[openstack-dev] [nova] Thoughs please on how to address a problem with mutliple deletes leading to a nova-compute thread pool problem
Alex Glikson
GLIKSON at il.ibm.com
Sat Oct 26 10:11:00 UTC 2013
+1
Regards,
Alex
Joshua Harlow <harlowja at yahoo-inc.com> wrote on 26/10/2013 09:29:03 AM:
>
> An idea that others and I are having for a similar use case in
> cinder (or it appears to be similar).
>
> If there was a well defined state machine/s in nova with well
> defined and managed transitions between states then it seems like
> this state machine could resume on failure as well as be interrupted
> when a "dueling" or preemptable operation arrives (a delete while
> being created for example). This way not only would it be very clear
> the set of states and transitions but it would also be clear how
> preemption occurs (and under what cases).
>
> Right now in nova there is a distributed and ad-hoc state machine
> which if it was more formalized it could inherit some if the
> described useful capabilities. It would also be much more resilient
> to these types of locking problems that u described.
>
> IMHO that's the only way these types of problems will be fully be
> fixed, not by more queues or more periodic tasks, but by solidifying
> & formalizing the state machines that compose the work nova does.
>
> Sent from my really tiny device...
>
> > On Oct 25, 2013, at 3:52 AM, "Day, Phil" <philip.day at hp.com> wrote:
> >
> > Hi Folks,
> >
> > We're very occasionally seeing problems where a thread processing
> a create hangs (and we've seen when taking to Cinder and Glance).
> Whilst those issues need to be hunted down in their own rights, they
> do show up what seems to me to be a weakness in the processing of
> delete requests that I'd like to get some feedback on.
> >
> > Delete is the one operation that is allowed regardless of the
> Instance state (since it's a one-way operation, and users should
> always be able to free up their quota). However when we get a
> create thread hung in one of these states, the delete requests when
> they hit the manager will also block as they are synchronized on the
> uuid. Because the user making the delete request doesn't see
> anything happen they tend to submit more delete requests. The
> Service is still up, so these go to the computer manager as well,
> and eventually all of the threads will be waiting for the lock, and
> the compute manager will stop consuming new messages.
> >
> > The problem isn't limited to deletes - although in most cases the
> change of state in the API means that you have to keep making
> different calls to get past the state checker logic to do it with an
> instance stuck in another state. Users also seem to be more
> impatient with deletes, as they are trying to free up quota for other
things.
> >
> > So while I know that we should never get a thread into a hung
> state into the first place, I was wondering about one of the
> following approaches to address just the delete case:
> >
> > i) Change the delete call on the manager so it doesn't wait for
> the uuid lock. Deletes should be coded so that they work regardless
> of the state of the VM, and other actions should be able to cope
> with a delete being performed from under them. There is of course
> no guarantee that the delete itself won't block as well.
> >
> > ii) Record in the API server that a delete has been started (maybe
> enough to use the task state being set to DELETEING in the API if
> we're sure this doesn't get cleared), and add a periodic task in the
> compute manager to check for and delete instances that are in a
> "DELETING" state for more than some timeout. Then the API, knowing
> that the delete will be processes eventually can just no-op any
> further delete requests.
> >
> > iii) Add some hook into the ServiceGroup API so that the timer
> could depend on getting a free thread from the compute manager pool
> (ie run some no-op task) - so that of there are no free threads then
> the service becomes down. That would (eventually) stop the scheduler
> from sending new requests to it, and make deleted be processed in
> the API server but won't of course help with commands for other
> instances on the same host.
> >
> > iv) Move away from having a general topic and thread pool for all
> requests, and start a listener on an instance specific topic for
> each running instance on a host (leaving the general topic and pool
> just for creates and other non-instance calls like the hypervisor
> API). Then a blocked task would only affect request for a
specificinstance.
> >
> > I'm tending towards ii) as a simple and pragmatic solution in the
> near term, although I like both iii) and iv) as being both generally
> good enhancments - but iv) in particular feels like a pretty seismic
change.
> >
> > Thoughts please,
> >
> > Phil
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131026/528a5b6f/attachment.html>
More information about the OpenStack-dev
mailing list