[openstack-dev] [Tripleo][CI] check-tripleo outage

Robert Collins robertc at robertcollins.net
Sun Mar 2 04:49:29 UTC 2014


We've had 3 (I think) follow on outages identical to this in cause but
somewhat more rapidly addressed as we have less exploring to do each
time.

HP's DC folk have most recently done a firmware update to everything
in the machine, and advised that if we have another NMI occurence
we'll replace the motherboard.

That said, infra isn't spinning up nodes properly, but manual testing
brings up nodes just fine with plenty of network speed, so we're not
sure whats up.

-Rob

On 25 February 2014 13:08, Robert Collins <robertc at robertcollins.net> wrote:
> Today we had an outage of the tripleo test cloud :(.
>
> tl;dr:
>  - we were down for 14 hours
>  - we don't know the fundamental cause
>  - infra were not inconvenienced - yaaay
>  - its all ok now.
>
> Read on for more information, what little we have.
>
> We don't know exactly why it happened yet, but the control plane
> dropped off the network. Console showed node still had a correct
> networking configuration, including openflow rules and bridges. The
> node was arpingable, and could arping out, but could not be pinged.
> Tcpdump showed the node sending a ping reply on it's raw ethernet
> device, but other machines on the same LAN did not see the packet.
>
> From syslog we can see
> Feb 24 06:28:31 ci-overcloud-notcompute0-gxezgcvv4v2q kernel:
> [1454708.543053] hpsa 0000:06:00.0: cmd_alloc returned NULL!
> events
>
> around the time frame that the drop-off would have happened, but they
> go back many hours before and after that.
>
> After exhausting everything that came to mind we rebooted the machine,
> which promptly spat an NMI trace into the console:
>
> [1502354.552431]  [<ffffffff810fdf98>] rcu_eqs_enter_common.isra.43+0x208/0x220
> [1502354.552491]  [<ffffffff810ff9ed>] rcu_irq_exit+0x5d/0x90
> [1502354.552549]  [<ffffffff81067670>] irq_exit+0x80/0xc0
> [1502354.552605]  [<ffffffff816f9605>] smp_apic_timer_interrupt+0x45/0x60
> [1502354.552665]  [<ffffffff816f7f9d>] apic_timer_interrupt+0x6d/0x80
> [1502354.552722]  <EOI>  <NMI>  [<ffffffff816e1384>] ? panic+0x193/0x1d7
> [1502354.552880]  [<ffffffffa02d18e5>] hpwdt_pretimeout+0xe5/0xe5 [hpwdt]
> [1502354.552939]  [<ffffffff816efc88>] nmi_handle.isra.3+0x88/0x180
> [1502354.552997]  [<ffffffff816eff11>] do_nmi+0x191/0x330
> [1502354.553053]  [<ffffffff816ef201>] end_repeat_nmi+0x1e/0x2e
> [1502354.553111]  [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120
> [1502354.553168]  [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120
> [1502354.553226]  [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120
> [1502354.553282]  <<EOE>>  [<ffffffff8159fe90>] cpuidle_enter_state+0x40/0xc0
> [1502354.553408]  [<ffffffff8159ffd9>] cpuidle_idle_call+0xc9/0x210
> [1502354.553466]  [<ffffffff8101bafe>] arch_cpu_idle+0xe/0x30
> [1502354.553523]  [<ffffffff810b54c5>] cpu_startup_entry+0xe5/0x280
> [1502354.553581]  [<ffffffff816d64b7>] rest_init+0x77/0x80
> [1502354.553638]  [<ffffffff81d26ef7>] start_kernel+0x40a/0x416
> [1502354.553695]  [<ffffffff81d268f6>] ? repair_env_string+0x5c/0x5c
> [1502354.553753]  [<ffffffff81d26120>] ? early_idt_handlers+0x120/0x120
> [1502354.553812]  [<ffffffff81d265de>] x86_64_start_reservations+0x2a/0x2c
> [1502354.553871]  [<ffffffff81d266e8>] x86_64_start_kernel+0x108/0x117
> [1502354.553929] ---[ end trace 166b62e89aa1f54b ]---
>
> 'yay'. After that, a power reset in the console, it came up ok, just
> needed a minor nudge to refresh it's heat configuration and we were up
> and running again.
>
> For some reason, neutron decided to rename it's agents at this point
> and we had to remove and reattach the l3 agent before VM connectivity
> was restored.
> https://bugs.launchpad.net/tripleo/+bug/1284354
>
> However, about 90 nodepool nodes were stuck in states like ACTIVE
> deleting, and did not clear until we did a rolling restart of every
> nova compute process.
> https://bugs.launchpad.net/tripleo/+bug/1284356
>
> Cheers,
> Rob
>
> --
> Robert Collins <rbtcollins at hp.com>
> Distinguished Technologist
> HP Converged Cloud



-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-dev mailing list