Open Stack

Thu Feb 27 02:55:23 UTC 2014

On 25/02/14 00:08, Robert Collins wrote:
> Today we had an outage of the tripleo test cloud :(.
> 
> tl;dr:
>  - we were down for 14 hours
>  - we don't know the fundamental cause
>  - infra were not inconvenienced - yaaay
>  - its all ok now.
Looks like we've hit the same problem again tonight, I've
o rebooted the server
o fixed up the hostname
o restarted nova and neutron services on the controller

VM's still not getting IP's, I'm not seeing dhcp requests from them
coming into dnsmasq, spent some time trying to figure out the problem,
no luck, I'll pick up in this in a few hours if nobody else has before then.

> 
> Read on for more information, what little we have.
> 
> We don't know exactly why it happened yet, but the control plane
> dropped off the network. Console showed node still had a correct
> networking configuration, including openflow rules and bridges. The
> node was arpingable, and could arping out, but could not be pinged.
> Tcpdump showed the node sending a ping reply on it's raw ethernet
> device, but other machines on the same LAN did not see the packet.
> 
> From syslog we can see
> Feb 24 06:28:31 ci-overcloud-notcompute0-gxezgcvv4v2q kernel:
> [1454708.543053] hpsa 0000:06:00.0: cmd_alloc returned NULL!
> events
> 
> around the time frame that the drop-off would have happened, but they
> go back many hours before and after that.
> 
> After exhausting everything that came to mind we rebooted the machine,
> which promptly spat an NMI trace into the console:
> 
> [1502354.552431]  [<ffffffff810fdf98>] rcu_eqs_enter_common.isra.43+0x208/0x220
> [1502354.552491]  [<ffffffff810ff9ed>] rcu_irq_exit+0x5d/0x90
> [1502354.552549]  [<ffffffff81067670>] irq_exit+0x80/0xc0
> [1502354.552605]  [<ffffffff816f9605>] smp_apic_timer_interrupt+0x45/0x60
> [1502354.552665]  [<ffffffff816f7f9d>] apic_timer_interrupt+0x6d/0x80
> [1502354.552722]  <EOI>  <NMI>  [<ffffffff816e1384>] ? panic+0x193/0x1d7
> [1502354.552880]  [<ffffffffa02d18e5>] hpwdt_pretimeout+0xe5/0xe5 [hpwdt]
> [1502354.552939]  [<ffffffff816efc88>] nmi_handle.isra.3+0x88/0x180
> [1502354.552997]  [<ffffffff816eff11>] do_nmi+0x191/0x330
> [1502354.553053]  [<ffffffff816ef201>] end_repeat_nmi+0x1e/0x2e
> [1502354.553111]  [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120
> [1502354.553168]  [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120
> [1502354.553226]  [<ffffffff813d46c2>] ? intel_idle+0xc2/0x120
> [1502354.553282]  <<EOE>>  [<ffffffff8159fe90>] cpuidle_enter_state+0x40/0xc0
> [1502354.553408]  [<ffffffff8159ffd9>] cpuidle_idle_call+0xc9/0x210
> [1502354.553466]  [<ffffffff8101bafe>] arch_cpu_idle+0xe/0x30
> [1502354.553523]  [<ffffffff810b54c5>] cpu_startup_entry+0xe5/0x280
> [1502354.553581]  [<ffffffff816d64b7>] rest_init+0x77/0x80
> [1502354.553638]  [<ffffffff81d26ef7>] start_kernel+0x40a/0x416
> [1502354.553695]  [<ffffffff81d268f6>] ? repair_env_string+0x5c/0x5c
> [1502354.553753]  [<ffffffff81d26120>] ? early_idt_handlers+0x120/0x120
> [1502354.553812]  [<ffffffff81d265de>] x86_64_start_reservations+0x2a/0x2c
> [1502354.553871]  [<ffffffff81d266e8>] x86_64_start_kernel+0x108/0x117
> [1502354.553929] ---[ end trace 166b62e89aa1f54b ]---
> 
> 'yay'. After that, a power reset in the console, it came up ok, just
> needed a minor nudge to refresh it's heat configuration and we were up
> and running again.
> 
> For some reason, neutron decided to rename it's agents at this point
> and we had to remove and reattach the l3 agent before VM connectivity
> was restored.
> https://bugs.launchpad.net/tripleo/+bug/1284354
> 
> However, about 90 nodepool nodes were stuck in states like ACTIVE
> deleting, and did not clear until we did a rolling restart of every
> nova compute process.
> https://bugs.launchpad.net/tripleo/+bug/1284356
> 
> Cheers,
> Rob
> 

Open Stack

[OpenStack-Infra] [openstack-dev] [Tripleo][CI] check-tripleo outage

OpenStack

Community

Documentation

Branding & Legal