[openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State
shoham.peller at stratoscale.com
Wed Apr 13 15:25:56 UTC 2016
There are some cases that a communication failure between the different
nova services, might cause a bad state in the system.
For example, when "shelving" a VM, nova-api puts the VM's task_state as
"shelving", sends an RPC to nova-compute, which shelves the VM, and resets
it's task_state in DB.
But, if for some reason, nova-compute didn't get the message (i.e. the RPC
service was down, there's a bug in the RPC service, nova-compute was down,
there was a temporary network malfunction), the VM is now stuck as
"shelving", and the user can't perform any operation on the stuck VM.
This example applies to a couple of scenarios in the system that involve
communication between different services.
>From nova-api's point-of-view, all it does is sending a message through
RPC, and neither actually checks that the message was received, nor waits
to get a reply or an acknowledgement from the receiver.
Of course, to solve this, a user can "reset-state" on a VM, and try to run
the action again, but this is error-prone and doesn't scale.
Possible solutions might be:
- nova-api should receive an acknowledgement from nova-compute. It is
unclear to me why today it uses a non-reply mechanism - probably to free
the worker as fast as it can.
- Change the task_state mechanism to prevent this kind of a stuck state
to stay in the DB. nova-compute can be the one that writes the task_state
to the DB, but this is not enough of course, but maybe there's another way?
- nova-api could start a timer for the action to complete. If the
shelving operation hasn't completed in X seconds, it will clean it by
itself and rollback\try-again.
What do you think about the problem and the solutions?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev