<div dir="ltr"><div>Hi all,</div><div><br></div><div>There are some cases that a communication failure between the different nova services, might cause a bad state in the system.<br></div><div><br></div><div>For example, when "shelving" a VM, nova-api puts the VM's task_state as "shelving", sends an RPC to nova-compute, which shelves the VM, and resets it's task_state in DB.</div><div>But, if for some reason, nova-compute didn't get the message (i.e. the RPC service was down, there's a bug in the RPC service, nova-compute was down, there was a temporary network malfunction), the VM is now stuck as "shelving", and the user can't perform any operation on the stuck VM.</div><div>This example applies to a couple of scenarios in the system that involve communication between different services.<br></div><div><br></div><div>From nova-api's point-of-view, all it does is sending a message through RPC, and neither actually checks that the message was received, nor waits to get a reply or an acknowledgement from the receiver.</div><div><br></div><div>Of course, to solve this, a user can "reset-state" on a VM, and try to run the action again, but this is error-prone and doesn't scale.</div><div><br></div><div>Possible solutions might be:</div><div><ul><li>nova-api should receive an acknowledgement from nova-compute. It is unclear to me why today it uses a non-reply mechanism - probably to free the worker as fast as it can.</li><li>Change the task_state mechanism to prevent this kind of a stuck state to stay in the DB. nova-compute can be the one that writes the task_state to the DB, but this is not enough of course, but maybe there's another way?</li><li>nova-api could start a timer for the action to complete. If the shelving operation hasn't completed in X seconds, it will clean it by itself and rollback\try-again.</li></ul><div>What do you think about the problem and the solutions?</div></div><div><br></div><div>Thanks,</div><div>Shoham Peller</div></div>