[openstack-dev] Many timeouts in zuul gates for TripleO
whayutin at redhat.com
Mon Jan 22 17:20:50 UTC 2018
On Mon, Jan 22, 2018 at 6:55 AM, Or Idgar <oidgar at redhat.com> wrote:
> Still having timeouts but now in tripleo-heat-templates experimental gates
> (tripleo-ci-centos-7-ovb-fakeha-caserver and tripleo-ci-centos-7-ovb-ha-
> see examples:
> Anyone have an idea what we can do to fix it?
> On Sat, Jan 20, 2018 at 4:38 AM, Paul Belanger <pabelanger at redhat.com>
>> On Fri, Jan 19, 2018 at 11:23:45AM -0600, Ben Nemec wrote:
>> > On 01/18/2018 09:45 AM, Emilien Macchi wrote:
>> > > On Thu, Jan 18, 2018 at 6:34 AM, Or Idgar <oidgar at redhat.com> wrote:
>> > > > Hi,
>> > > > we're encountering many timeouts for zuul gates in TripleO.
>> > > > For example, see
>> > > > http://logs.openstack.org/95/508195/28/check-tripleo/tripleo
>> > > >
>> > > > rechecks won't help and sometimes specific gate is end successfully
>> > > > sometimes not.
>> > > > The problem is that after recheck it's not always the same gate
>> which is
>> > > > failed.
>> > > >
>> > > > Is there someone who have access to the servers load to see what
>> cause this?
>> > > > alternatively, is there something we can do in order to reduce the
>> > > > time for each gate?
>> > >
>> > > We're migrating to RDO Cloud for OVB jobs:
>> > > https://review.openstack.org/#/c/526481/
>> > > It's a work in progress but will help a lot for OVB timeouts on RH1.
>> > >
>> > > I'll let the CI folks comment on that topic.
>> > >
>> > I noticed that the timeouts on rh1 have been especially bad as of late
>> so I
>> > did a little testing and found that it did seem to be running more
>> > than it should. After some investigation I found that 6 of our compute
>> > nodes have warning messages that the cpu was throttled due to high
>> > temperature. I've disabled 4 of them that had a lot of warnings. The
>> > 2 only had a handful of warnings so I'm hopeful we can leave them active
>> > without affecting job performance too much. It won't accomplish much
>> if we
>> > disable the overheating nodes only to overload the remaining ones.
>> > I'll follow up with our hardware people and see if we can determine why
>> > these specific nodes are overheating. They seem to be running 20
>> degrees C
>> > hotter than the rest of the nodes.
>> Did tripleo-test-cloud-rh1 get new kernels applied for meltdown / spectre,
>> possible that is impacting performance too?
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscrib
> Best regards,
> Or Idgar
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
FYI.. we created a lp to track decommissioning the ovb jobs on rh1 and
moving them to third party ci.
Up for comments https://bugs.launchpad.net/tripleo/+bug/1744763
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev