[openstack-dev] [nova] instances stuck with task_state of REBOOTING
chris.friesen at windriver.com
Thu Mar 20 18:59:55 UTC 2014
On 03/20/2014 12:29 PM, Chris Friesen wrote:
> The fact that there are no success or error logs in nova-compute.log
> makes me wonder if we somehow got stuck in self.driver.reboot().
> Also, I'm kind of wondering what would happen if nova-compute was
> running reboot_instance() and we rebooted the controller at the same
> time. reboot_instance() could time out trying to update the instance
> with the the new power state and a task_state of None. Later on in
> _sync_power_states() we would update the power_state, but nothing would
> update the task_state. I don't think this is what happened to us though
> since I'd expect to see logs of the timeout.
Actually, looking at the logs a bit more carefully it appears that what
happened is something like this:
We reboot the controllers.
Right after they come back up something calls compute.api.API.reboot()
That sets instance.task_state = task_states.REBOOTING and then calls
instance.save() to update the database.
Then it calls self.compute_rpcapi.reboot_instance() which does an rpc cast.
That message gets dropped on the floor due to communication issues
between the controller and the compute.
Now we're stuck with a task_state of REBOOTING.
I think that both of the RPC message loss scenarios are valid with
current nova code, so we really do need an audit to clean up after this
sort of thing.
More information about the OpenStack-dev