[OpenStack-Infra] citycloud lon1 mirror postmortem

Paul Belanger pabelanger at redhat.com
Thu Aug 10 16:36:07 UTC 2017


On Thu, Aug 10, 2017 at 10:34:56PM +1000, Ian Wienand wrote:
> Hi,
> 
> In response to sdague reporting that citycloud jobs were timing out, I
> investigated the mirror, suspecting it was not providing data fast enough.
> 
> There were some 170 htcacheclean jobs running, and the host had a load
> over 100.  I killed all these, but performance was still unacceptable.
> 
> I suspected networking, but since the host was in such a bad state I
> decided to reboot it.  Unfortunately it would get an address from DHCP
> but seemed to have DNS issues ... eventually it would ping but nothing
> else was working.
> 
> nodepool.o.o was placed in the emergency file and I removed lon1 to
> avoid jobs going there.
> 
> I used the citycloud live chat, and Kim helpfully investigated and
> ended up migrating mirror.lon1.citycloud.openstack.org to a new
> compute node.  This appeared to fix things, for us at least.
> 
> nodepool.o.o is removed from the emergency file and original config
> restored.
> 
> With hindsight, clearly the excessive htcacheclean processes were due
> to negative feedback of slow processes due to the network/dns issues
> all starting to bunch up over time.  However, I still think we could
> minimise further issues running it under a lock [1].  Other than that,
> not sure there is much else we can do, I think this was largely an
> upstream issue.
> 
> Cheers,
> 
> -i
> 
> [1] https://review.openstack.org/#/c/492481/
> 
Thanks, I also noticed a job fail to download a package from
mirror.iad.rax.openstack.org. When I SSH'd to the server I too see high load
(6.0+) and multiple htcacheclean processes running. 

I did an audit on the other mirrors and they too had the same, so I killed all
the processes there.  I can confirm the lock patch merged but will keep an eye
on it.

I did notice that mirror.lon1.citycloud.openstack.org wass still slow to react
to shell commands. I still think we have an IO bottleneck some where, possible
the compute host is throttling something.  We should keep an eye on it.

-PB



More information about the OpenStack-Infra mailing list