[openstack-dev] [TripleO] CI outage

Dan Prince dprince at redhat.com
Sat Mar 21 01:41:11 UTC 2015

Short version:

The RH1 CI region has been down since yesterday afternoon.

We have a misbehaving switch and have file a support ticket with the
vendor to troubleshoot things further. We hope to know more this
weekend, or Monday at the latest.

Long version:

Yesterday afternoon we started seeing issues in scheduling jobs on the
RH1 CI cloud. We haven't made any OpenStack configuration changes
recently, and things have been quite stable for some time now (our
uptime was 365 days on the controller).

Initially we found a misconfigured Keystone URL which was preventing
some diagnostic queries via OS clients external to the rack. This
setting hadn't been recently changed however and didn't seem to bother
nodepool before so I don't think it is the cause of the outage...

MySQL also got a bounce. It seemed happy enough after a restart as well.

After fixing the keystone setting and bouncing MySQL instances appears
to go ACTIVE but we were still having connectivity issues getting
floating IPs and DHCP working on overcloud instances. After a good bit
of debugging we started looking at the switches. Turns out one of them
has a high CPU usuage (above the warning threshold) and MAC address are
also unstable (ports are moving around).

Until this is resolved RH1 is unavailable to host jobs CI jobs. Will
post back here with an update once we have more information.


More information about the OpenStack-dev mailing list