[ci] Kernel panics in the guest vm
smooney at redhat.com
Mon Dec 7 15:35:11 UTC 2020
On Mon, 2020-12-07 at 15:04 +0100, Radosław Piliszek wrote:
> I wonder why we have not seen this in Kolla CIs.
> We always spawn one cirros instance.
> Could it be related to doing this concurrently?
> As in, some qemu/kvm component has an ugly race condition?
it is an intermitent failure.
gennerally one vm out of the entire tempest run will hit this and the rest will be fine
i think this is just a guest kernel issue.
concurrancy may be a factor as might load but if you are only spawning one vm i would guess its just
much less likely to fail in this manner.
> On Mon, Dec 7, 2020 at 2:02 PM Sean Mooney <smooney at redhat.com> wrote:
> > On Sun, 2020-12-06 at 10:42 +0100, Slawek Kaplonski wrote:
> > > Hi,
> > >
> > > Since some time I noticed that quite often some scenario jobs are failing due to
> > > issue with SSH to the guest vm and when I was checking the reason of this SSH
> > > failure, it seems that it's due to Kernel panic in the guest vm, like e.g. :
> > >
> > > [ 0.000000] Console: colour VGA+ 80x25
> > > [ 0.000000] printk: console [tty1] enabled
> > > [ 0.000000] printk: console [ttyS0] enabled
> > > [ 0.000000] ACPI: Core revision 20190703
> > > [ 0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604467 ns
> > > [ 0.000000] APIC: Switch to symmetric I/O mode setup
> > > [ 0.000000] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
> > > [ 0.000000] ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > > [ 0.000000] ...trying to set up timer (IRQ0) through the 8259A ...
> > > [ 0.000000] ..... (found apic 0 pin 2) ...
> > > [ 0.000000] ....... failed.
> > > [ 0.000000] ...trying to set up timer as Virtual Wire IRQ...
> > > [ 0.000000] ..... failed.
> > > [ 0.000000] ...trying to set up timer as ExtINT IRQ...
> > > [ 0.000000] ..... failed :(.
> > > [ 0.000000] Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option.
> > > [ 0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-26-generic #28~18.04.1-Ubuntu
> > > [ 0.000000] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.13.0-1ubuntu1 04/01/2014
> > > [ 0.000000] Call Trace:
> > > [ 0.000000] dump_stack+0x6d/0x95
> > > [ 0.000000] panic+0xfe/0x2d4
> > > [ 0.000000] check_timer+0x5e8/0x685
> > > [ 0.000000] ? radix_tree_lookup+0xd/0x10
> > > [ 0.000000] setup_IO_APIC+0x182/0x1ca
> > > [ 0.000000] apic_intr_mode_init+0x1f5/0x1f8
> > > [ 0.000000] x86_late_time_init+0x1b/0x22
> > > [ 0.000000] start_kernel+0x4cb/0x58b
> > > [ 0.000000] x86_64_start_reservations+0x24/0x26
> > > [ 0.000000] x86_64_start_kernel+0x74/0x77
> > > [ 0.000000] secondary_startup_64+0xa4/0xb0
> > > [ 0.000000] ---[ end Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the 'noapic' option. ]---
> > >
> > > Logstash  is telling me that it is problem not only in neutron related jobs.
> > > Maybe someone of You was already trying to investigate such issue and maybe You
> > > have some ideas what we can do with it?
> > > In this specific example above , it was Cirros 0.5.1 image used. But I didn't
> > > check if that is the case in all other cases TBH.
> > this has been happening for months its not new.
> > this might be an issue with the ci providers qemu verion or the kernel in the cirros image
> > we could provide a way to disabel the io apic via nova likely via an image property which we would set on the cirros image via devstack.
> > byond that i dont know what we can do other then move to something like alpine whihc is maintained instead of cirros
> > rhel https://bugzilla.redhat.com/show_bug.cgi?id=221658 and ubuntu https://bugs.launchpad.net/ubuntu/+source/linux/+bug/52553
> > have both hit this issue in the past in the ~2.6 kernel timeframe
> > cirros uses a ubuntu 18.04 kernel so i think its more likely to be https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1856387
> > that is in theory fix in the 4.15 kernel that 18.04 default too but cirros is using a 5.3 which i think is form the cloud arche that might not be
> > patched.
> > >
> > >  https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_c50/764921/1/gate/neutron-tempest-plugin-scenario-openvswitch-iptables_hybrid/c501b2c/testr_results.html
> > >  http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Kernel%20panic%20-%20not%20syncing%3A%20IO-APIC%20%2B%20timer%20doesn't%20work!%20%20Boot%20with%20apic%3Ddebug%20and%20send%20a%20report.%20%20Then%20try%20booting%20with%20the%20'noapic'%20option.%5C%22
> > >
More information about the openstack-discuss