[openstack-dev] Nova's use of task_state for reboot serialization

Mark McLoughlin markmc at redhat.com
Fri Jan 18 14:10:18 UTC 2013


Hey

Here's a scenario I came across yesterday in a real Folsom deployment:

  https://bugs.launchpad.net/nova/+bug/982108

  - compute node locked up for over 12 due to what looks like a kernel 
    bug

  - during that time, someone came along and tried to reboot their 
    instance with horizon which does a hard reboot

  - the reboot message was cast to the compute node but never picked up 
    so the instance was in task_state=REBOOTING_HARD

  - once the compute node had come back to life, the user tried 
    rebooting again  but wasn't allowed. The instance needed admin 
    intervention to get unstuck.

At first, I thought this was just an oversight that REBOOTING_HARD
wasn't one of the allowed states for rebooting.

However, I came across a discussion here:

  https://review.openstack.org/5090

which shows that we're using task_state to prevent multiple reboots of
the same type happening at once. I'm assuming that's because the reboots
would interfere with me, e.g. attempting to create the same VM twice.

That has me wondering why we just don't take a lock on the instance in
the compute manager during a reboot?

Isn't it the case that we should only be using task_state as a kind of
"it doesn't make sense to do foo while bar is happening" type policies?
As opposed to task serialization?

Also, if a task is kicked off with a cast we never know if it the
message was ever actually received and don't know to revert the
task_state in the case? If we want asynchrony in these cases, shouldn't
be call() but have the compute node spawn off a greenthread to carry out
the action? That way the message is acknowledged and we know that
task_state reversion should happen if it fails from that point on.

In summary, the patch I'd cook up for this would change the reboot
cast() to call() and have nova-compute spawn off a greenthread which
takes the instance lock for the duration of the reboot and reverts
task_state on failure.

Am I missing something?

Is this a fundamental change from our previous thinking and we should
audit for similar problems, or is this just an individual oddity?

Cheers,
Mark.




More information about the OpenStack-dev mailing list