On 01/18/2018 09:45 AM, Emilien Macchi wrote:
> On Thu, Jan 18, 2018 at 6:34 AM, Or Idgar <oidgar at redhat.com> wrote:
>> Hi,
>> we're encountering many timeouts for zuul gates in TripleO.
>> For example, see
>> http://logs.openstack.org/95/508195/28/check-tripleo/tripleo-ci-centos-7-ovb-ha-oooq/c85fcb7/.
>> rechecks won't help and sometimes specific gate is end successfully and
>> sometimes not.
>> The problem is that after recheck it's not always the same gate which is
>> failed.
>> Is there someone who have access to the servers load to see what cause this?
>> alternatively, is there something we can do in order to reduce the running
>> time for each gate?
> We're migrating to RDO Cloud for OVB jobs:
> https://review.openstack.org/#/c/526481/
> It's a work in progress but will help a lot for OVB timeouts on RH1.
> I'll let the CI folks comment on that topic.

I noticed that the timeouts on rh1 have been especially bad as of late 
so I did a little testing and found that it did seem to be running more 
slowly than it should.  After some investigation I found that 6 of our 
compute nodes have warning messages that the cpu was throttled due to 
high temperature.  I've disabled 4 of them that had a lot of warnings. 
The other 2 only had a handful of warnings so I'm hopeful we can leave 
them active without affecting job performance too much.  It won't 
accomplish much if we disable the overheating nodes only to overload the 
remaining ones.

I'll follow up with our hardware people and see if we can determine why 
these specific nodes are overheating.  They seem to be running 20 
degrees C hotter than the rest of the nodes.

