[openstack-dev] [TripleO][review] Please treat -1s on check-tripleo-*-precise as voting.

Derek Higgins derekh at redhat.com
Fri Feb 21 16:36:06 UTC 2014


On 21/02/14 03:31, Robert Collins wrote:
> On 18 February 2014 04:30, Derek Higgins <derekh at redhat.com> wrote:
>> On 17/02/14 01:25, Robert Collins wrote:
>>> Hi!
>>>
>>> The nascent tripleo-gate is now running on all tripleo repositories,
>>> *and should pass*, but are not yet voting. They aren't voting because
>>> we cannot submit to the gate unless jenkins votes verified... *and* we
>>> have no redundancy for the tripleo-ci cloud now, so any glitch in the
>>> current region will take out our ability to land changes.
>>>
>>> We're working up the path to having two regions as fast as we can- and
>>> once we do we should be up to check or perhaps even gate in short
>>> order :).
>>>
>>> Note: unless you *expand* the jenkins vote, you can't tell if a -1 occurred.
>>>
>>> If, for some reason, we have an infrastructure failure that means
>>> spurious -1's will be occurring, then we'll put that in the #tripleo
>>> topic.
>>
>> It looks like we've hit a glitch, network access to our ci-overcloud
>> controller seems to be gone, I think invoking this clause is needed
>> until the problem is sorted, will update the topic and am working on
>> diagnosing the problem.
> 
> So we fixed that clause, but infra took us out of rotation as we took
> nodepool down before it was fixed.
> 
> We've now:
>  - improved nodepool to handle downclouds more gracefully
>  - moved the tripleo cloud using jobs to dedicated check and
> experimental pipelines
>  - and been reinstated
> 
> So - please look for comments from check-tripleo before approving merges!

the ci cloud seems to be running today as expected, but we have a bit to
tuning todo

check-tripleo-overcloud-precise is throwing out false negatives because
the testenv-worker has a timeout that is less then the timeout on the
jenkins job (and less then the length of time it take to run the job)
o this should handle the false negatives
  https://review.openstack.org/#/c/75402/

o and this is a more permanent solution (to remove the possibility of
double booking environments), a new test-env cluster will need to be
built with it, we can do that once we iron out anything else that may
pop up over the next few days.
  https://review.openstack.org/#/c/75403/

Current status is that a lot of jobs are failing because they are not
completing the "nova-manage db sync" on the seed quickly enough, this
only started happening today and doesn't immediately suggest a problem
with our test environment setup (unless we are over committing resources
on the test environments), I suspect some part of the seed boot process
on or before the db sync is now taking longer then it used to. I was
trying to track down the problem but I'm about to run out of time.

This begs the question,
  If this proves to be a failure in tripleo-ci that is being caused by a
change that happened outside of tripleo should we stop merging commits?
Are are we ok to go ahead and merge while also helping the other project
to solve the problem? Of course if we were gating on all projects this
problem would be far less frequent then I suspect it will be, but for
now how do we proceed in these situations.

Derek.


> 
> The tripleo test cloud is still one region, CI is running on 10
> hypervisors and 10 emulated baremetal backend systems, so we have
> reasonable capacity.
> 
> Additionally, running 'check experimental' will now run tripleo jobs
> against everything we include in tripleo images - nova, cinder, swift
> etc etc.
> 
> See the config layout.yaml for details, and I'll send a broader
> announcement once we've had a little bit of run-time with this.
> 
> -Rob
> 




More information about the OpenStack-dev mailing list