On Wed, Aug 18, 2021, at 2:07 PM, Lee Yarwood wrote:
Hello all,
For a while now we've been attempting to track down some infrequent but annoying Tempest test cleanup failures in CI when detaching volumes from an instance. Finally after rewriting part of the Tempest logic controlling the cleanup we've been able to confirm that this is being caused by a kernel panic within the instance at boot time as documented in the following bug:
Failure to detach volume during Tempest test cleanup due to APIC related kernel panic within the guest OS https://bugs.launchpad.net/nova/+bug/1939108
This had been previously found in 2014 but at the time a fix was only proposed to Nova that would solve this when using a supplied kernel image:
cirros 0.3.1 fails to boot https://bugs.launchpad.net/cirros/+bug/1312199
Use no_timer_check with soft-qemu https://review.opendev.org/c/openstack/nova/+/96090
Most (all?) of our CI currently running with [libvirt]virt_type=qemu uses the full Cirros 0.5.2 image. Does anyone have any suggestions on the best way of modifying the image(s) we use in CI to use the no_timer_check kernel command line arg?
The best way is probably to update the image upstream and then update the cirros version in our tests? https://github.com/cirros-dev/cirros/blob/master/src/boot/grub/menu.lst#L10 or maybe with a kernel build flag? Smoser does note in 1312199 above that baking this into the image is an option though that was some time ago. If you want to modify the existing images instead it would probably be a good idea to have something like devstack do it rather than the CI system so that people running tools like devstack don't end up with different images outside of the CI system.
Thanks in advance,
-- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76