[OpenStack-Infra] citycloud lon1 mirror postmortem
Paul Belanger
pabelanger at redhat.com
Thu Aug 10 16:36:07 UTC 2017
On Thu, Aug 10, 2017 at 10:34:56PM +1000, Ian Wienand wrote:
> Hi,
>
> In response to sdague reporting that citycloud jobs were timing out, I
> investigated the mirror, suspecting it was not providing data fast enough.
>
> There were some 170 htcacheclean jobs running, and the host had a load
> over 100. I killed all these, but performance was still unacceptable.
>
> I suspected networking, but since the host was in such a bad state I
> decided to reboot it. Unfortunately it would get an address from DHCP
> but seemed to have DNS issues ... eventually it would ping but nothing
> else was working.
>
> nodepool.o.o was placed in the emergency file and I removed lon1 to
> avoid jobs going there.
>
> I used the citycloud live chat, and Kim helpfully investigated and
> ended up migrating mirror.lon1.citycloud.openstack.org to a new
> compute node. This appeared to fix things, for us at least.
>
> nodepool.o.o is removed from the emergency file and original config
> restored.
>
> With hindsight, clearly the excessive htcacheclean processes were due
> to negative feedback of slow processes due to the network/dns issues
> all starting to bunch up over time. However, I still think we could
> minimise further issues running it under a lock [1]. Other than that,
> not sure there is much else we can do, I think this was largely an
> upstream issue.
>
> Cheers,
>
> -i
>
> [1] https://review.openstack.org/#/c/492481/
>
Thanks, I also noticed a job fail to download a package from
mirror.iad.rax.openstack.org. When I SSH'd to the server I too see high load
(6.0+) and multiple htcacheclean processes running.
I did an audit on the other mirrors and they too had the same, so I killed all
the processes there. I can confirm the lock patch merged but will keep an eye
on it.
I did notice that mirror.lon1.citycloud.openstack.org wass still slow to react
to shell commands. I still think we have an IO bottleneck some where, possible
the compute host is throttling something. We should keep an eye on it.
-PB
More information about the OpenStack-Infra
mailing list