[openstack-dev] Many timeouts in zuul gates for TripleO

Wesley Hayutin whayutin at redhat.com
Mon Jan 22 17:20:50 UTC 2018


On Mon, Jan 22, 2018 at 6:55 AM, Or Idgar <oidgar at redhat.com> wrote:

> Hi,
> Still having timeouts but now in tripleo-heat-templates experimental gates
> (tripleo-ci-centos-7-ovb-fakeha-caserver and tripleo-ci-centos-7-ovb-ha-
> tempest-oooq).
>
> see examples:
> http://logs.openstack.org/31/518331/23/experimental-
> tripleo/tripleo-ci-centos-7-ovb-fakeha-caserver/7502e82/
> http://logs.openstack.org/31/518331/23/experimental-
> tripleo/tripleo-ci-centos-7-ovb-ha-tempest-oooq/46e8e0d/
>
> Anyone have an idea what we can do to fix it?
>
> Thanks,
> Idgar
>
> On Sat, Jan 20, 2018 at 4:38 AM, Paul Belanger <pabelanger at redhat.com>
> wrote:
>
>> On Fri, Jan 19, 2018 at 11:23:45AM -0600, Ben Nemec wrote:
>> >
>> >
>> > On 01/18/2018 09:45 AM, Emilien Macchi wrote:
>> > > On Thu, Jan 18, 2018 at 6:34 AM, Or Idgar <oidgar at redhat.com> wrote:
>> > > > Hi,
>> > > > we're encountering many timeouts for zuul gates in TripleO.
>> > > > For example, see
>> > > > http://logs.openstack.org/95/508195/28/check-tripleo/tripleo
>> -ci-centos-7-ovb-ha-oooq/c85fcb7/.
>> > > >
>> > > > rechecks won't help and sometimes specific gate is end successfully
>> and
>> > > > sometimes not.
>> > > > The problem is that after recheck it's not always the same gate
>> which is
>> > > > failed.
>> > > >
>> > > > Is there someone who have access to the servers load to see what
>> cause this?
>> > > > alternatively, is there something we can do in order to reduce the
>> running
>> > > > time for each gate?
>> > >
>> > > We're migrating to RDO Cloud for OVB jobs:
>> > > https://review.openstack.org/#/c/526481/
>> > > It's a work in progress but will help a lot for OVB timeouts on RH1.
>> > >
>> > > I'll let the CI folks comment on that topic.
>> > >
>> >
>> > I noticed that the timeouts on rh1 have been especially bad as of late
>> so I
>> > did a little testing and found that it did seem to be running more
>> slowly
>> > than it should.  After some investigation I found that 6 of our compute
>> > nodes have warning messages that the cpu was throttled due to high
>> > temperature.  I've disabled 4 of them that had a lot of warnings. The
>> other
>> > 2 only had a handful of warnings so I'm hopeful we can leave them active
>> > without affecting job performance too much.  It won't accomplish much
>> if we
>> > disable the overheating nodes only to overload the remaining ones.
>> >
>> > I'll follow up with our hardware people and see if we can determine why
>> > these specific nodes are overheating.  They seem to be running 20
>> degrees C
>> > hotter than the rest of the nodes.
>> >
>> Did tripleo-test-cloud-rh1 get new kernels applied for meltdown / spectre,
>> possible that is impacting performance too?
>>
>> -Paul
>>
>> ____________________________________________________________
>> ______________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
>
> --
> Best regards,
> Or Idgar
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
FYI.. we created a lp to track decommissioning the ovb jobs on rh1 and
moving them to third party ci.
Up for comments https://bugs.launchpad.net/tripleo/+bug/1744763
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20180122/b2c260c2/attachment.html>


More information about the OpenStack-dev mailing list