[Openstack-operators] [nova] Interesting bug when unshelving an instance in an AZ and the AZ is gone

Matt Riedemann mriedemos at gmail.com
Mon Oct 16 15:22:59 UTC 2017


This is interesting from the user point of view:

https://bugs.launchpad.net/nova/+bug/1723880

- The user creates an instance in a non-default AZ.
- They shelve offload the instance.
- The admin deletes the AZ that the instance was using, for whatever reason.
- The user unshelves the instance which goes back through scheduling and 
fails with NoValidHost because the AZ on the original request spec no 
longer exists.

Now the question is what, if anything, do we do about this bug? Some notes:

1. How reasonable is it for a user to expect in a stable production 
environment that AZs are going to be deleted from under them? We 
actually have a spec related to this but with AZ renames:

https://review.openstack.org/#/c/446446/

2. Should we null out the instance.availability_zone when it's shelved 
offloaded like we do for the instance.host and instance.node attributes? 
Similarly, we would not take into account the 
RequestSpec.availability_zone when scheduling during unshelve. I tend to 
prefer this option because once you unshelve offload an instance, it's 
no longer associated with a host and therefore no longer associated with 
an AZ. However, is it reasonable to assume that the user doesn't care 
that the instance, once unshelved, is no longer in the originally 
requested AZ? Probably not a safe assumption.

3. When a user unshelves, they can't propose a new AZ (and I don't think 
we want to add that capability to the unshelve API). So if the original 
AZ is gone, should we automatically remove the 
RequestSpec.availability_zone when scheduling? I tend to not like this 
as it's very implicit and the user could see the AZ on their instance 
change before and after unshelve and be confused.

4. We could simply do nothing about this specific bug and assert the 
behavior is correct. The user requested an instance in a specific AZ, 
shelved that instance and when they wanted to unshelve it, it's no 
longer available so it fails. The user would have to delete the instance 
and create a new instance from the shelve snapshot image in a new AZ. If 
we implemented Sylvain's spec in #1 above, maybe we don't have this 
problem going forward since you couldn't remove/delete an AZ when there 
are even shelved offloaded instances still tied to it.

Other options?

-- 

Thanks,

Matt



More information about the OpenStack-operators mailing list