[openstack-dev] [nova] Rocky RC time regression analysis

Matt Riedemann mriedemos at gmail.com
Tue Oct 9 19:19:52 UTC 2018


On 10/5/2018 6:59 PM, melanie witt wrote:
> 5) when live migration fails due to a internal error rollback is not 
> handled correctly https://bugs.launchpad.net/nova/+bug/1788014
> 
> - Bug was reported on 2018-08-20
> - The change that caused the regression landed on 2018-07-26, FF day 
> https://review.openstack.org/434870
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Was found because sean-k-mooney was doing live migrations and found 
> that when a LM failed because of a QEMU internal error, the VM remained 
> ACTIVE but the VM no longer had network connectivity.
> - Question: why wasn't this caught earlier?
> - Answer: We would need a live migration job scenario that intentionally 
> initiates and fails a live migration, then verify network connectivity 
> after the rollback occurs.
> - Question: can we add something like that?

Not in Tempest, no, but we could run something in the 
nova-live-migration job since that executes via its own script. We could 
hack something in like what we have proposed for testing evacuate:

https://review.openstack.org/#/c/602174/

The trick is figuring out how to introduce a fault in the destination 
host without taking down the service, because if the compute service is 
down we won't schedule to it.

> 
> 6) nova-manage db online_data_migrations hangs on instances with no host 
> set https://bugs.launchpad.net/nova/+bug/1788115
> 
> - Bug was reported on 2018-08-21
> - The patch that introduced the bug landed on 2018-05-30 
> https://review.openstack.org/567878
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Question: why wasn't this caught earlier?
> - Answer: To hit the bug, you had to have had instances with no host set 
> (that failed to schedule) in your database during an upgrade. This does 
> not happen during the grenade job
> - Question: could we add anything to the grenade job that would leave 
> some instances with no host set to cover cases like this?

Probably - I'd think creating a server on the old side with some 
parameters that we know won't schedule would do it, maybe requesting an 
AZ that doesn't exist, or some other kind of scheduler hint that we know 
won't work so we get a NoValidHost. However, online_data_migrations in 
grenade probably don't run on the cell0 database, so I'm not sure we 
would have caught that case.

-- 

Thanks,

Matt



More information about the OpenStack-dev mailing list