[OpenStack-Infra] [Tripleo][CI] check-tripleo outage

Robert Collins robertc at robertcollins.net
Tue Feb 25 00:08:26 UTC 2014


Today we had an outage of the tripleo test cloud :(.

tl;dr:
 - we were down for 14 hours
 - we don't know the fundamental cause
 - infra were not inconvenienced - yaaay
 - its all ok now.

Read on for more information, what little we have.

We don't know exactly why it happened yet, but the control plane
dropped off the network. Console showed node still had a correct
networking configuration, including openflow rules and bridges. The
node was arpingable, and could arping out, but could not be pinged.
Tcpdump showed the node sending a ping reply on it's raw ethernet
device, but other machines on the same LAN did not see the packet.

>From syslog we can see
Feb 24 06:28:31 ci-overcloud-notcompute0-gxezgcvv4v2q kernel:
[1454708.543053] hpsa 0000:06:00.0: cmd_alloc returned NULL!
events

around the time frame that the drop-off would have happened, but they
go back many hours before and after that.

After exhausting everything that came to mind we rebooted the machine,
which promptly spat an NMI trace into the console:

[1502354.552431]  [<ffffffff810fdf98>] rcu_eqs_enter_common.isra.43+0x208/0x220
[1502354.552491]  [<ffffffff810ff9ed>] rcu_irq_exit+0x5d/0x90
[1502354.552549]  [<ffffffff81067670>] irq_exit+0x80/0xc0
[1502354.552605]  [<ffffffff816f9605>] smp_apic_timer_interrupt+0x45/0x60
[1502354.552665]  [<ffffffff816f7f9d>] apic_timer_interrupt+0x6d/0x80
[1502354.552722]  <EOI>  <NMI>  [<ffffffff816e1384>] ? panic+0x193/0x1d7
[1502354.552880]  [<ffffffffa02d18e5>] hpwdt_pretimeout+0xe5/0xe5 [hpwdt]
[1502354.552939]  [<ffffffff816efc88>] nmi_handle.isra.3+0x88/0x180
[1502354.552997]  [<ffffffff816eff11>] do_nmi+0x191/0x330
[1502354.553053]  [<ffffffff816ef201>] end_repeat_nmi+0x1e/0x2e
[1502354.553111]  [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120
[1502354.553168]  [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120
[1502354.553226]  [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120
[1502354.553282]  <<EOE>>  [<ffffffff8159fe90>] cpuidle_enter_state+0x40/0xc0
[1502354.553408]  [<ffffffff8159ffd9>] cpuidle_idle_call+0xc9/0x210
[1502354.553466]  [<ffffffff8101bafe>] arch_cpu_idle+0xe/0x30
[1502354.553523]  [<ffffffff810b54c5>] cpu_startup_entry+0xe5/0x280
[1502354.553581]  [<ffffffff816d64b7>] rest_init+0x77/0x80
[1502354.553638]  [<ffffffff81d26ef7>] start_kernel+0x40a/0x416
[1502354.553695]  [<ffffffff81d268f6>] ? repair_env_string+0x5c/0x5c
[1502354.553753]  [<ffffffff81d26120>] ? early_idt_handlers+0x120/0x120
[1502354.553812]  [<ffffffff81d265de>] x86_64_start_reservations+0x2a/0x2c
[1502354.553871]  [<ffffffff81d266e8>] x86_64_start_kernel+0x108/0x117
[1502354.553929] ---[ end trace 166b62e89aa1f54b ]---

'yay'. After that, a power reset in the console, it came up ok, just
needed a minor nudge to refresh it's heat configuration and we were up
and running again.

For some reason, neutron decided to rename it's agents at this point
and we had to remove and reattach the l3 agent before VM connectivity
was restored.
https://bugs.launchpad.net/tripleo/+bug/1284354

However, about 90 nodepool nodes were stuck in states like ACTIVE
deleting, and did not clear until we did a rolling restart of every
nova compute process.
https://bugs.launchpad.net/tripleo/+bug/1284356

Cheers,
Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-Infra mailing list