[nova][qa][infra] Adding no_timer_check to the kernel command line of our CI images
Hello all, For a while now we've been attempting to track down some infrequent but annoying Tempest test cleanup failures in CI when detaching volumes from an instance. Finally after rewriting part of the Tempest logic controlling the cleanup we've been able to confirm that this is being caused by a kernel panic within the instance at boot time as documented in the following bug: Failure to detach volume during Tempest test cleanup due to APIC related kernel panic within the guest OS https://bugs.launchpad.net/nova/+bug/1939108 This had been previously found in 2014 but at the time a fix was only proposed to Nova that would solve this when using a supplied kernel image: cirros 0.3.1 fails to boot https://bugs.launchpad.net/cirros/+bug/1312199 Use no_timer_check with soft-qemu https://review.opendev.org/c/openstack/nova/+/96090 Most (all?) of our CI currently running with [libvirt]virt_type=qemu uses the full Cirros 0.5.2 image. Does anyone have any suggestions on the best way of modifying the image(s) we use in CI to use the no_timer_check kernel command line arg? Thanks in advance, -- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
Thanks Lee. it was causing problems for a long time. Thanks, Arkady On Wed, Aug 18, 2021 at 5:09 PM Lee Yarwood <lyarwood@redhat.com> wrote:
Hello all,
For a while now we've been attempting to track down some infrequent but annoying Tempest test cleanup failures in CI when detaching volumes from an instance. Finally after rewriting part of the Tempest logic controlling the cleanup we've been able to confirm that this is being caused by a kernel panic within the instance at boot time as documented in the following bug:
Failure to detach volume during Tempest test cleanup due to APIC related kernel panic within the guest OS https://bugs.launchpad.net/nova/+bug/1939108
This had been previously found in 2014 but at the time a fix was only proposed to Nova that would solve this when using a supplied kernel image:
cirros 0.3.1 fails to boot https://bugs.launchpad.net/cirros/+bug/1312199
Use no_timer_check with soft-qemu https://review.opendev.org/c/openstack/nova/+/96090
Most (all?) of our CI currently running with [libvirt]virt_type=qemu uses the full Cirros 0.5.2 image. Does anyone have any suggestions on the best way of modifying the image(s) we use in CI to use the no_timer_check kernel command line arg?
Thanks in advance,
-- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
-- Arkady Kanevsky, Ph.D. Phone: 972 707-6456 Corporate Phone: 919 729-5744 ext. 8176456
On Wed, Aug 18, 2021, at 2:07 PM, Lee Yarwood wrote:
Hello all,
For a while now we've been attempting to track down some infrequent but annoying Tempest test cleanup failures in CI when detaching volumes from an instance. Finally after rewriting part of the Tempest logic controlling the cleanup we've been able to confirm that this is being caused by a kernel panic within the instance at boot time as documented in the following bug:
Failure to detach volume during Tempest test cleanup due to APIC related kernel panic within the guest OS https://bugs.launchpad.net/nova/+bug/1939108
This had been previously found in 2014 but at the time a fix was only proposed to Nova that would solve this when using a supplied kernel image:
cirros 0.3.1 fails to boot https://bugs.launchpad.net/cirros/+bug/1312199
Use no_timer_check with soft-qemu https://review.opendev.org/c/openstack/nova/+/96090
Most (all?) of our CI currently running with [libvirt]virt_type=qemu uses the full Cirros 0.5.2 image. Does anyone have any suggestions on the best way of modifying the image(s) we use in CI to use the no_timer_check kernel command line arg?
The best way is probably to update the image upstream and then update the cirros version in our tests? https://github.com/cirros-dev/cirros/blob/master/src/boot/grub/menu.lst#L10 or maybe with a kernel build flag? Smoser does note in 1312199 above that baking this into the image is an option though that was some time ago. If you want to modify the existing images instead it would probably be a good idea to have something like devstack do it rather than the CI system so that people running tools like devstack don't end up with different images outside of the CI system.
Thanks in advance,
-- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
On Wed, Aug 18, 2021 at 11:45 PM Clark Boylan <cboylan@sapwetik.org> wrote:
On Wed, Aug 18, 2021, at 2:07 PM, Lee Yarwood wrote:
Hello all,
For a while now we've been attempting to track down some infrequent but annoying Tempest test cleanup failures in CI when detaching volumes from an instance. Finally after rewriting part of the Tempest logic controlling the cleanup we've been able to confirm that this is being caused by a kernel panic within the instance at boot time as documented in the following bug:
Failure to detach volume during Tempest test cleanup due to APIC related kernel panic within the guest OS https://bugs.launchpad.net/nova/+bug/1939108
This had been previously found in 2014 but at the time a fix was only proposed to Nova that would solve this when using a supplied kernel image:
cirros 0.3.1 fails to boot https://bugs.launchpad.net/cirros/+bug/1312199
Use no_timer_check with soft-qemu https://review.opendev.org/c/openstack/nova/+/96090
Most (all?) of our CI currently running with [libvirt]virt_type=qemu uses the full Cirros 0.5.2 image. Does anyone have any suggestions on the best way of modifying the image(s) we use in CI to use the no_timer_check kernel command line arg?
The best way is probably to update the image upstream and then update the cirros version in our tests? https://github.com/cirros-dev/cirros/blob/master/src/boot/grub/menu.lst#L10 or maybe with a kernel build flag? Smoser does note in 1312199 above that baking this into the image is an option though that was some time ago.
If you want to modify the existing images instead it would probably be a good idea to have something like devstack do it rather than the CI system so that people running tools like devstack don't end up with different images outside of the CI system.
+1 on both the approaches. With slight preference to just modify cirros upstream - it's not a production image so we can tweak it to suit kvm-less qemu constraints without worry. -yoctozepto
On 19-08-21 09:07:39, Radosław Piliszek wrote:
On Wed, Aug 18, 2021 at 11:45 PM Clark Boylan <cboylan@sapwetik.org> wrote:
On Wed, Aug 18, 2021, at 2:07 PM, Lee Yarwood wrote:
Hello all,
For a while now we've been attempting to track down some infrequent but annoying Tempest test cleanup failures in CI when detaching volumes from an instance. Finally after rewriting part of the Tempest logic controlling the cleanup we've been able to confirm that this is being caused by a kernel panic within the instance at boot time as documented in the following bug:
Failure to detach volume during Tempest test cleanup due to APIC related kernel panic within the guest OS https://bugs.launchpad.net/nova/+bug/1939108
This had been previously found in 2014 but at the time a fix was only proposed to Nova that would solve this when using a supplied kernel image:
cirros 0.3.1 fails to boot https://bugs.launchpad.net/cirros/+bug/1312199
Use no_timer_check with soft-qemu https://review.opendev.org/c/openstack/nova/+/96090
Most (all?) of our CI currently running with [libvirt]virt_type=qemu uses the full Cirros 0.5.2 image. Does anyone have any suggestions on the best way of modifying the image(s) we use in CI to use the no_timer_check kernel command line arg?
The best way is probably to update the image upstream and then update the cirros version in our tests? https://github.com/cirros-dev/cirros/blob/master/src/boot/grub/menu.lst#L10 or maybe with a kernel build flag? Smoser does note in 1312199 above that baking this into the image is an option though that was some time ago.
If you want to modify the existing images instead it would probably be a good idea to have something like devstack do it rather than the CI system so that people running tools like devstack don't end up with different images outside of the CI system.
+1 on both the approaches. With slight preference to just modify cirros upstream - it's not a production image so we can tweak it to suit kvm-less qemu constraints without worry.
Okay I can try both for the time being as I'm not entirely convinced that Cirros upstream will accept the change, removing the devstack change if they ever do. Thanks for the input both! -- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
On 19-08-21 10:02:27, Lee Yarwood wrote:
On 19-08-21 09:07:39, Radosław Piliszek wrote:
On Wed, Aug 18, 2021 at 11:45 PM Clark Boylan <cboylan@sapwetik.org> wrote:
On Wed, Aug 18, 2021, at 2:07 PM, Lee Yarwood wrote:
Hello all,
For a while now we've been attempting to track down some infrequent but annoying Tempest test cleanup failures in CI when detaching volumes from an instance. Finally after rewriting part of the Tempest logic controlling the cleanup we've been able to confirm that this is being caused by a kernel panic within the instance at boot time as documented in the following bug:
Failure to detach volume during Tempest test cleanup due to APIC related kernel panic within the guest OS https://bugs.launchpad.net/nova/+bug/1939108
This had been previously found in 2014 but at the time a fix was only proposed to Nova that would solve this when using a supplied kernel image:
cirros 0.3.1 fails to boot https://bugs.launchpad.net/cirros/+bug/1312199
Use no_timer_check with soft-qemu https://review.opendev.org/c/openstack/nova/+/96090
Most (all?) of our CI currently running with [libvirt]virt_type=qemu uses the full Cirros 0.5.2 image. Does anyone have any suggestions on the best way of modifying the image(s) we use in CI to use the no_timer_check kernel command line arg?
The best way is probably to update the image upstream and then update the cirros version in our tests? https://github.com/cirros-dev/cirros/blob/master/src/boot/grub/menu.lst#L10 or maybe with a kernel build flag? Smoser does note in 1312199 above that baking this into the image is an option though that was some time ago.
If you want to modify the existing images instead it would probably be a good idea to have something like devstack do it rather than the CI system so that people running tools like devstack don't end up with different images outside of the CI system.
+1 on both the approaches. With slight preference to just modify cirros upstream - it's not a production image so we can tweak it to suit kvm-less qemu constraints without worry.
Okay I can try both for the time being as I'm not entirely convinced that Cirros upstream will accept the change, removing the devstack change if they ever do.
After talking to sean-k-mooney in #opentack-nova we have ended up reviving an old workaround option Sean had to remove the apic entirely from our test instances: https://review.opendev.org/q/topic:workaround-disable-apic I've also pushed a PR upstream in Cirros but as I said before I'm pretty doubtful this will ever land: https://github.com/cirros-dev/cirros/issues/69 https://github.com/cirros-dev/cirros/pull/70 Cheers, -- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
participants (4)
-
Arkady Kanevsky
-
Clark Boylan
-
Lee Yarwood
-
Radosław Piliszek