[ci] Kernel panics in the guest vm
Hi, Since some time I noticed that quite often some scenario jobs are failing due to issue with SSH to the guest vm and when I was checking the reason of this SSH failure, it seems that it's due to Kernel panic in the guest vm, like e.g. [1]: [ 0.000000] Console: colour VGA+ 80x25 [ 0.000000] printk: console [tty1] enabled [ 0.000000] printk: console [ttyS0] enabled [ 0.000000] ACPI: Core revision 20190703 [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns [ 0.000000] APIC: Switch to symmetric I/O mode setup [ 0.000000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 0.000000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC [ 0.000000] ...trying to set up timer (IRQ0) through the 8259A ... [ 0.000000] ..... (found apic 0 pin 2) ... [ 0.000000] ....... failed. [ 0.000000] ...trying to set up timer as Virtual Wire IRQ... [ 0.000000] ..... failed. [ 0.000000] ...trying to set up timer as ExtINT IRQ... [ 0.000000] ..... failed :(. [ 0.000000] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. [ 0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-26-generic #28~18.04.1-Ubuntu [ 0.000000] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1 04/01/2014 [ 0.000000] Call Trace: [ 0.000000] dump_stack+0x6d/0x95 [ 0.000000] panic+0xfe/0x2d4 [ 0.000000] check_timer+0x5e8/0x685 [ 0.000000] ? radix_tree_lookup+0xd/0x10 [ 0.000000] setup_IO_APIC+0x182/0x1ca [ 0.000000] apic_intr_mode_init+0x1f5/0x1f8 [ 0.000000] x86_late_time_init+0x1b/0x22 [ 0.000000] start_kernel+0x4cb/0x58b [ 0.000000] x86_64_start_reservations+0x24/0x26 [ 0.000000] x86_64_start_kernel+0x74/0x77 [ 0.000000] secondary_startup_64+0xa4/0xb0 [ 0.000000] ---[ end Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. ]--- Logstash [2] is telling me that it is problem not only in neutron related jobs. Maybe someone of You was already trying to investigate such issue and maybe You have some ideas what we can do with it? In this specific example above [1], it was Cirros 0.5.1 image used. But I didn't check if that is the case in all other cases TBH. [1] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/z... [2] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Kernel%20panic%20-%20not%20syncing%3A%20IO-APIC%20%2B%20timer%20doesn't%20work!%20%20Boot%20with%20apic%3Ddebug%20and%20send%20a%20report.%20%20Then%20try%20booting%20with%20the%20'noapic'%20option.%5C%22 -- Slawek Kaplonski Principal Software Engineer Red Hat
Hi,
Since some time I noticed that quite often some scenario jobs are failing due to issue with SSH to the guest vm and when I was checking the reason of this SSH failure, it seems that it's due to Kernel panic in the guest vm, like e.g. [1]:
[ 0.000000] Console: colour VGA+ 80x25 [ 0.000000] printk: console [tty1] enabled [ 0.000000] printk: console [ttyS0] enabled [ 0.000000] ACPI: Core revision 20190703 [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns [ 0.000000] APIC: Switch to symmetric I/O mode setup [ 0.000000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 0.000000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC [ 0.000000] ...trying to set up timer (IRQ0) through the 8259A ... [ 0.000000] ..... (found apic 0 pin 2) ... [ 0.000000] ....... failed. [ 0.000000] ...trying to set up timer as Virtual Wire IRQ... [ 0.000000] ..... failed. [ 0.000000] ...trying to set up timer as ExtINT IRQ... [ 0.000000] ..... failed :(. [ 0.000000] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. [ 0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-26-generic #28~18.04.1-Ubuntu [ 0.000000] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1 04/01/2014 [ 0.000000] Call Trace: [ 0.000000] dump_stack+0x6d/0x95 [ 0.000000] panic+0xfe/0x2d4 [ 0.000000] check_timer+0x5e8/0x685 [ 0.000000] ? radix_tree_lookup+0xd/0x10 [ 0.000000] setup_IO_APIC+0x182/0x1ca [ 0.000000] apic_intr_mode_init+0x1f5/0x1f8 [ 0.000000] x86_late_time_init+0x1b/0x22 [ 0.000000] start_kernel+0x4cb/0x58b [ 0.000000] x86_64_start_reservations+0x24/0x26 [ 0.000000] x86_64_start_kernel+0x74/0x77 [ 0.000000] secondary_startup_64+0xa4/0xb0 [ 0.000000] ---[ end Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. ]---
Logstash [2] is telling me that it is problem not only in neutron related jobs. Maybe someone of You was already trying to investigate such issue and maybe You have some ideas what we can do with it? In this specific example above [1], it was Cirros 0.5.1 image used. But I didn't check if that is the case in all other cases TBH.
On Sun, 2020-12-06 at 10:42 +0100, Slawek Kaplonski wrote: this has been happening for months its not new. this might be an issue with the ci providers qemu verion or the kernel in the cirros image we could provide a way to disabel the io apic via nova likely via an image property which we would set on the cirros image via devstack. byond that i dont know what we can do other then move to something like alpine whihc is maintained instead of cirros rhel https://bugzilla.redhat.com/show_bug.cgi?id=221658 and ubuntu https://bugs.launchpad.net/ubuntu/+source/linux/+bug/52553 have both hit this issue in the past in the ~2.6 kernel timeframe cirros uses a ubuntu 18.04 kernel so i think its more likely to be https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1856387 that is in theory fix in the 4.15 kernel that 18.04 default too but cirros is using a 5.3 which i think is form the cloud arche that might not be patched.
[1] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/z... [2] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Kernel%20panic%20-%20not%20syncing%3A%20IO-APIC%20%2B%20timer%20doesn't%20work!%20%20Boot%20with%20apic%3Ddebug%20and%20send%20a%20report.%20%20Then%20try%20booting%20with%20the%20'noapic'%20option.%5C%22
I wonder why we have not seen this in Kolla CIs. We always spawn one cirros instance. Could it be related to doing this concurrently? As in, some qemu/kvm component has an ugly race condition? -yoctozepto On Mon, Dec 7, 2020 at 2:02 PM Sean Mooney <smooney@redhat.com> wrote:
Hi,
Since some time I noticed that quite often some scenario jobs are failing due to issue with SSH to the guest vm and when I was checking the reason of this SSH failure, it seems that it's due to Kernel panic in the guest vm, like e.g. [1]:
[ 0.000000] Console: colour VGA+ 80x25 [ 0.000000] printk: console [tty1] enabled [ 0.000000] printk: console [ttyS0] enabled [ 0.000000] ACPI: Core revision 20190703 [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns [ 0.000000] APIC: Switch to symmetric I/O mode setup [ 0.000000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 0.000000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC [ 0.000000] ...trying to set up timer (IRQ0) through the 8259A ... [ 0.000000] ..... (found apic 0 pin 2) ... [ 0.000000] ....... failed. [ 0.000000] ...trying to set up timer as Virtual Wire IRQ... [ 0.000000] ..... failed. [ 0.000000] ...trying to set up timer as ExtINT IRQ... [ 0.000000] ..... failed :(. [ 0.000000] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. [ 0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-26-generic #28~18.04.1-Ubuntu [ 0.000000] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1 04/01/2014 [ 0.000000] Call Trace: [ 0.000000] dump_stack+0x6d/0x95 [ 0.000000] panic+0xfe/0x2d4 [ 0.000000] check_timer+0x5e8/0x685 [ 0.000000] ? radix_tree_lookup+0xd/0x10 [ 0.000000] setup_IO_APIC+0x182/0x1ca [ 0.000000] apic_intr_mode_init+0x1f5/0x1f8 [ 0.000000] x86_late_time_init+0x1b/0x22 [ 0.000000] start_kernel+0x4cb/0x58b [ 0.000000] x86_64_start_reservations+0x24/0x26 [ 0.000000] x86_64_start_kernel+0x74/0x77 [ 0.000000] secondary_startup_64+0xa4/0xb0 [ 0.000000] ---[ end Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. ]---
Logstash [2] is telling me that it is problem not only in neutron related jobs. Maybe someone of You was already trying to investigate such issue and maybe You have some ideas what we can do with it? In this specific example above [1], it was Cirros 0.5.1 image used. But I didn't check if that is the case in all other cases TBH.
On Sun, 2020-12-06 at 10:42 +0100, Slawek Kaplonski wrote: this has been happening for months its not new. this might be an issue with the ci providers qemu verion or the kernel in the cirros image we could provide a way to disabel the io apic via nova likely via an image property which we would set on the cirros image via devstack. byond that i dont know what we can do other then move to something like alpine whihc is maintained instead of cirros
rhel https://bugzilla.redhat.com/show_bug.cgi?id=221658 and ubuntu https://bugs.launchpad.net/ubuntu/+source/linux/+bug/52553 have both hit this issue in the past in the ~2.6 kernel timeframe
cirros uses a ubuntu 18.04 kernel so i think its more likely to be https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1856387
that is in theory fix in the 4.15 kernel that 18.04 default too but cirros is using a 5.3 which i think is form the cloud arche that might not be patched.
[1] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/z... [2] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Kernel%20panic%20-%20not%20syncing%3A%20IO-APIC%20%2B%20timer%20doesn't%20work!%20%20Boot%20with%20apic%3Ddebug%20and%20send%20a%20report.%20%20Then%20try%20booting%20with%20the%20'noapic'%20option.%5C%22
On Mon, 2020-12-07 at 15:04 +0100, Radosław Piliszek wrote:
I wonder why we have not seen this in Kolla CIs. We always spawn one cirros instance. Could it be related to doing this concurrently? As in, some qemu/kvm component has an ugly race condition? it is an intermitent failure. gennerally one vm out of the entire tempest run will hit this and the rest will be fine
i think this is just a guest kernel issue. concurrancy may be a factor as might load but if you are only spawning one vm i would guess its just much less likely to fail in this manner.
-yoctozepto
On Mon, Dec 7, 2020 at 2:02 PM Sean Mooney <smooney@redhat.com> wrote:
Hi,
Since some time I noticed that quite often some scenario jobs are failing due to issue with SSH to the guest vm and when I was checking the reason of this SSH failure, it seems that it's due to Kernel panic in the guest vm, like e.g. [1]:
[ 0.000000] Console: colour VGA+ 80x25 [ 0.000000] printk: console [tty1] enabled [ 0.000000] printk: console [ttyS0] enabled [ 0.000000] ACPI: Core revision 20190703 [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns [ 0.000000] APIC: Switch to symmetric I/O mode setup [ 0.000000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 [ 0.000000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC [ 0.000000] ...trying to set up timer (IRQ0) through the 8259A ... [ 0.000000] ..... (found apic 0 pin 2) ... [ 0.000000] ....... failed. [ 0.000000] ...trying to set up timer as Virtual Wire IRQ... [ 0.000000] ..... failed. [ 0.000000] ...trying to set up timer as ExtINT IRQ... [ 0.000000] ..... failed :(. [ 0.000000] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. [ 0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-26-generic #28~18.04.1-Ubuntu [ 0.000000] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1 04/01/2014 [ 0.000000] Call Trace: [ 0.000000] dump_stack+0x6d/0x95 [ 0.000000] panic+0xfe/0x2d4 [ 0.000000] check_timer+0x5e8/0x685 [ 0.000000] ? radix_tree_lookup+0xd/0x10 [ 0.000000] setup_IO_APIC+0x182/0x1ca [ 0.000000] apic_intr_mode_init+0x1f5/0x1f8 [ 0.000000] x86_late_time_init+0x1b/0x22 [ 0.000000] start_kernel+0x4cb/0x58b [ 0.000000] x86_64_start_reservations+0x24/0x26 [ 0.000000] x86_64_start_kernel+0x74/0x77 [ 0.000000] secondary_startup_64+0xa4/0xb0 [ 0.000000] ---[ end Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. ]---
Logstash [2] is telling me that it is problem not only in neutron related jobs. Maybe someone of You was already trying to investigate such issue and maybe You have some ideas what we can do with it? In this specific example above [1], it was Cirros 0.5.1 image used. But I didn't check if that is the case in all other cases TBH.
On Sun, 2020-12-06 at 10:42 +0100, Slawek Kaplonski wrote: this has been happening for months its not new. this might be an issue with the ci providers qemu verion or the kernel in the cirros image we could provide a way to disabel the io apic via nova likely via an image property which we would set on the cirros image via devstack. byond that i dont know what we can do other then move to something like alpine whihc is maintained instead of cirros
rhel https://bugzilla.redhat.com/show_bug.cgi?id=221658 and ubuntu https://bugs.launchpad.net/ubuntu/+source/linux/+bug/52553 have both hit this issue in the past in the ~2.6 kernel timeframe
cirros uses a ubuntu 18.04 kernel so i think its more likely to be https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1856387
that is in theory fix in the 4.15 kernel that 18.04 default too but cirros is using a 5.3 which i think is form the cloud arche that might not be patched.
[1] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/z... [2] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Kernel%20panic%20-%20not%20syncing%3A%20IO-APIC%20%2B%20timer%20doesn't%20work!%20%20Boot%20with%20apic%3Ddebug%20and%20send%20a%20report.%20%20Then%20try%20booting%20with%20the%20'noapic'%20option.%5C%22
participants (3)
-
Radosław Piliszek
-
Sean Mooney
-
Slawek Kaplonski