Open Stack

Tue Jun 12 20:27:06 UTC 2012

> 
> For instance, an instance migration can take a while since we need to
> copy many gigabytes of disks to another host. If we want to do a
> software upgrade, we either need to wait a long time for the migration
> to finish, or we need to restart the service and then restart the
> processing of the message.
> 
> 

You wait a long time period. If you wait a long time and it fails, you're restarting. Having it do so automatically on the consumer-side isn't necessarily a good thing. 
> 
> If all software gets restarted, then persistence is important.
Again, I see an argument in having callers have limited persistence, but not consumers.
> 
> > All calls have a timeout (TTL). The ZeroMQ driver also implements a TTL
> > on the casts, and I'm quite sure we should support this in Kombu/Qpid
> > as well to avoid a thundering-herd.
> > 
> 
> 
> What thundering herd problems exist in Openstack?

Say we have one api service, one scheduler.  If the scheduler fails, API requests to create an instance will pile up, until the scheduler returns. The returning scheduler will get all of those instance creation requests and will launch those instances. (This would also be applicable for messages between the scheduler and a compute service)

The end-user will see the run-instance command as potentially failing and may attempt to launch again. The queue will hold all of these requests and they will all get processed when the scheduler returns.

This is especially problematic with auto-scaling. How well will Rightscale or Enstratus run against a system that takes hours and hours to launch instances?  They'll just retry and retry. You don't want these to just queue up.
> I do know there are problems with queuing combined with timeouts. It
> makes less sense to process a get_nw_info request if the requestor has
> already timed out and will ignore the response. Is that what you're
> referring to with TTLs?
> 
> 
> 

That is important too, in the case of calls, but not all that important.  I'm not so concerned about machines sending useless replies, we can ignore them.
> 
> 
> 

> 
> Idempotent actions want persistence so it will actually complete the
> action requested in the message. For instance, if nova-compute is
> stopped in the middle of an instance-create, we want to actually finish
> the create after the process is restarted.
Only if it hasn't timed out. Otherwise, you'd only asking for a thundering herd.

What has happened on the caller side? Has it timed out and given the user an error?  What manager methods (rpc methods) that call RPC, how deep does that stack go?

Perhaps it is better that if nova-compute is stopped in the middle of an instance-create that it can *cleanup* on a restore, rather than attempting to continue an arguably pointless and potentially dangerous path of actually creating that instance?
> There is no process waiting for a return value, but we certainly would
> like for the message to be persisted so we can restart it.
> 
I'm not sure about that.

> 
> > Anyway, in the ZeroMQ driver, we could have a local queue to track
> > casts and remove them when the send() coroutine completes. This would
> > provide restart protection for casts. 
> > 
> 
> 
> Assuming the requesting process remains running the entire time?
I meant ONLY persisting in the requesting process. If the requesting process fails while before that message is consumed, the requesting process can attempt to resubmit that message for consumption upon relaunch.  The requesting process would track the amount of time waiting for the message to be consumed and would subtract that time from the remaining timeout. 

Regards,
Eric Windisch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20120612/322b2b03/attachment.html>

Open Stack

[Openstack] RPC Semantics

OpenStack

Community

Documentation

Branding & Legal