[openstack-dev] Instance state recovery - LP bug 957009

Johannes Erdfelt johannes at erdfelt.com
Fri Nov 2 15:01:39 UTC 2012


On Fri, Nov 02, 2012, Gurjar, Unmesh <Unmesh.Gurjar at nttdata.com> wrote:
> I would like to collect some thoughts for fixing LP bug:
> https://bugs.launchpad.net/nova/+bug/957009.
> 
> Following is the scenario:
> 
> 1.       The Compute service hosting an active instance goes down and
> user requests DELETE operation on that instance.
> 
> 2.       The Nova API finds the Compute service running (since it had
> updated heartbeat recently than 'service_down_time' period). It marks
> the instance task_state as DELETING and casts a delete to the Compute.
> 
> 3.       If the RabbitMQ service happens to restart before the Compute
> service comes up, the DELETE message to Compute will be lost and the
> instance will remain in DELETING state (vm_state=ACTIVE,
> task_state=DELETING, power_state=1). Also, even the
> '_cleanup_running_deleted_instances' periodic task won't clean up such
> instance.

Losing messages at RabbitMQ is one way, also restarting the consuming
service can cause messages to be lost.

Similar problems can also happen with every other RPC cast. Instances
can be left in the middle of BUILD, RESIZE, etc.

> Solution:
> I am planning to add a periodic task to Nova Compute similar to the
> '_check_instance_build_time' periodic task, which will poll for
> instances in 'DELETING' state for some configurable value and set them
> to ERROR state.

The problem with this solution is it only solves a narrow set of
failures.

I'd like to see nova adopt a more generic solution rather than coming up
with one-off solutions for every possible failure scenario.

There were some discussions at the design summit about this:

https://etherpad.openstack.org/grizzly-error-handling-recovery

JE




More information about the OpenStack-dev mailing list