[openstack-dev] Many timeouts in zuul gates for TripleO
Ben Nemec
openstack at nemebean.com
Fri Jan 19 17:23:45 UTC 2018
On 01/18/2018 09:45 AM, Emilien Macchi wrote:
> On Thu, Jan 18, 2018 at 6:34 AM, Or Idgar <oidgar at redhat.com> wrote:
>> Hi,
>> we're encountering many timeouts for zuul gates in TripleO.
>> For example, see
>> http://logs.openstack.org/95/508195/28/check-tripleo/tripleo-ci-centos-7-ovb-ha-oooq/c85fcb7/.
>>
>> rechecks won't help and sometimes specific gate is end successfully and
>> sometimes not.
>> The problem is that after recheck it's not always the same gate which is
>> failed.
>>
>> Is there someone who have access to the servers load to see what cause this?
>> alternatively, is there something we can do in order to reduce the running
>> time for each gate?
>
> We're migrating to RDO Cloud for OVB jobs:
> https://review.openstack.org/#/c/526481/
> It's a work in progress but will help a lot for OVB timeouts on RH1.
>
> I'll let the CI folks comment on that topic.
>
I noticed that the timeouts on rh1 have been especially bad as of late
so I did a little testing and found that it did seem to be running more
slowly than it should. After some investigation I found that 6 of our
compute nodes have warning messages that the cpu was throttled due to
high temperature. I've disabled 4 of them that had a lot of warnings.
The other 2 only had a handful of warnings so I'm hopeful we can leave
them active without affecting job performance too much. It won't
accomplish much if we disable the overheating nodes only to overload the
remaining ones.
I'll follow up with our hardware people and see if we can determine why
these specific nodes are overheating. They seem to be running 20
degrees C hotter than the rest of the nodes.
More information about the OpenStack-dev
mailing list