[OpenStack-Infra] citycloud lon1 mirror postmortem

Ian Wienand iwienand at redhat.com
Thu Aug 10 12:34:56 UTC 2017


In response to sdague reporting that citycloud jobs were timing out, I
investigated the mirror, suspecting it was not providing data fast enough.

There were some 170 htcacheclean jobs running, and the host had a load
over 100.  I killed all these, but performance was still unacceptable.

I suspected networking, but since the host was in such a bad state I
decided to reboot it.  Unfortunately it would get an address from DHCP
but seemed to have DNS issues ... eventually it would ping but nothing
else was working.

nodepool.o.o was placed in the emergency file and I removed lon1 to
avoid jobs going there.

I used the citycloud live chat, and Kim helpfully investigated and
ended up migrating mirror.lon1.citycloud.openstack.org to a new
compute node.  This appeared to fix things, for us at least.

nodepool.o.o is removed from the emergency file and original config

With hindsight, clearly the excessive htcacheclean processes were due
to negative feedback of slow processes due to the network/dns issues
all starting to bunch up over time.  However, I still think we could
minimise further issues running it under a lock [1].  Other than that,
not sure there is much else we can do, I think this was largely an
upstream issue.



[1] https://review.openstack.org/#/c/492481/

