[openstack-dev] [Nova] RPC Communication Errors Might Lead to a Bad State

Dan Smith dms at danplanet.com
Wed Apr 13 15:34:53 UTC 2016

>   * nova-api should receive an acknowledgement from nova-compute. It is
>     unclear to me why today it uses a non-reply mechanism - probably to
>     free the worker as fast as it can.

Yes, wherever possible, we want the API to return immediately and let
the action complete later. Making a wholesale change to blocking calls
from the API to any other service is not a good idea, IMHO.

>   * Change the task_state mechanism to prevent this kind of a stuck
>     state to stay in the DB. nova-compute can be the one that writes the
>     task_state to the DB, but this is not enough of course, but maybe
>     there's another way?

The task_state being set in the API is our way of limiting/locking the
operation so that if the request is queued for a long time, a user
doesn't reissue the command a bunch of time and add load to the API
and/or jam up the queue with a thousand requests to do the same
operation just because it's taking a while.

>   * nova-api could start a timer for the action to complete. If the
>     shelving operation hasn't completed in X seconds, it will clean it
>     by itself and rollback\try-again.

I have wanted to make a change for a while that involves a TTL on
messages, along with a deadline record so that we can know when to retry
or revert things that were in flight. This requires a lot of machinery
to accomplish, and is probably interwoven with the task concept we've
had on the back burner for a while. The complexity of moving nova to
this sort of scheme means that nobody has picked it up as of yet, but
it's certainly in the minds of many of us as something we need to do
before too long.


More information about the OpenStack-dev mailing list