[openstack-dev] [tripleo] tripleo gate is blocked - please read

Emilien Macchi emilien at redhat.com
Sat Jun 16 06:15:01 UTC 2018


Sending an update before the weekend:

Gate was in very bad shape today (long queue, lot of failures) again today,
and it turns out we had a few more issues that we tracked here:
https://etherpad.openstack.org/p/tripleo-gate-issues-june-2018

## scenario007 broke because of a patch in networking-ovn
https://bugs.launchpad.net/tripleo/+bug/1777168
We made the job non voting and meanwhile tried and managed to fix it:
https://review.rdoproject.org/r/#/c/14155/
Breaking commit was:
https://github.com/openstack/networking-ovn/commit/2365df1cc3e24deb2f3745c925d78d6d8e5bb5df
Kudos to Daniel Alvarez for having the patch ready!
Also thanks to Wes for making the job non voting in the meantime.
I've reverted the non-voting things are situation is fixed now, so we can
vote again on this one.

## Dockerhub proxy issue
Infra using wrong image layer object storage proxy for Dockerhub:
https://review.openstack.org/#/c/575787/
Huge thanks to infra team, specially Clark for fixing this super quickly,
it clearly helped to stabilize our container jobs, I actually haven't seen
timeouts since we merged your patch. Thanks a ton!

## RDO master wasn't consistent anymore, python-cloudkittyclient broke
The client was refactored:
https://git.openstack.org/cgit/openstack/python-cloudkittyclient/commit/?id=d070f6a68cddf51c57e77107f1b823a8f75770ba
And it broke the RPM, we had to completely rewrite the dependencies so we
can build the package:
https://review.rdoproject.org/r/#/c/14265/
Mille merci Heikel for your responsive help at 3am, so we could come back
consistent and have our latest rpms that contained a bunch of fixes.

## Where we are now

Gate looks stable now. You can recheck and approve things. I went ahead and
rechecked everything and made sure nothing was left abandoned. Steve's work
has merged so I think we could re-consider
https://review.openstack.org/#/c/575330/ again.
Special thanks to everyone involved in these issues and Alex & John who
also stepped up to help.
Enjoy your weekend!

On Thu, Jun 14, 2018 at 6:40 AM, Emilien Macchi <emilien at redhat.com> wrote:

> It sounds like we merged a bunch last night thanks to the revert, so I
> went ahead and restored/rechecked everything that was out of the gate. I've
> checked and nothing was left over, but let me know in case I missed
> something.
> I'll keep updating this thread with the progress made to improve the
> situation etc.
> So from now, situation is back to "normal", recheck/+W is ok.
>
> Thanks again for your patience,
>
> On Wed, Jun 13, 2018 at 10:39 PM, Emilien Macchi <emilien at redhat.com>
> wrote:
>
>> https://review.openstack.org/575264 just landed (and didn't timeout in
>> check nor gate without recheck, so good sigh it helped to mitigate).
>>
>> I've restore and rechecked some patches that I evacuated from the gate,
>> please do not restore others or recheck or approve anything for now, and
>> see how it goes with a few patches.
>> We're still working with Steve on his patches to optimize the way we
>> deploy containers on the registry and are investigating how we could make
>> it faster with a proxy.
>>
>> Stay tuned and thanks for your patience.
>>
>> On Wed, Jun 13, 2018 at 5:50 PM, Emilien Macchi <emilien at redhat.com>
>> wrote:
>>
>>> TL;DR: gate queue was 25h+, we put all patches from gate on standby, do
>>> not restore/recheck until further announcement.
>>>
>>> We recently enabled the containerized undercloud for multinode jobs and
>>> we believe this was a bit premature as the container download process
>>> wasn't optimized so it's not pulling the mirrors for the same containers
>>> multiple times yet.
>>> It caused the job runtime to increase and probably the load on docker.io
>>> mirrors hosted by OpenStack Infra to be a bit slower to provide the same
>>> containers multiple times. The time taken to prepare containers on the
>>> undercloud and then for the overcloud caused the jobs to randomly timeout
>>> therefore the gate to fail in a high amount of times, so we decided to
>>> remove all jobs from the gate by abandoning the patches temporarily (I have
>>> them in my browser and will restore when things are stable again, please do
>>> not touch anything).
>>>
>>> Steve Baker has been working on a series of patches that optimize the
>>> way we prepare the containers but basically the workflow will be:
>>> - pull containers needed for the undercloud into a local registry, using
>>> infra mirror if available
>>> - deploy the containerized undercloud
>>> - pull containers needed for the overcloud minus the ones already pulled
>>> for the undercloud, using infra mirror if available
>>> - update containers on the overcloud
>>> - deploy the containerized undercloud
>>>
>>> With that process, we hope to reduce the runtime of the deployment and
>>> therefore reduce the timeouts in the gate.
>>> To enable it, we need to land in that order: https://review.openstac
>>> k.org/#/c/571613/, https://review.openstack.org/#/c/574485/,
>>> https://review.openstack.org/#/c/571631/ and https://review.openstack.o
>>> rg/#/c/568403.
>>>
>>> In the meantime, we are disabling the containerized undercloud recently
>>> enabled on all scenarios: https://review.openstack.org/#/c/575264/ for
>>> mitigation with the hope to stabilize things until Steve's patches land.
>>> Hopefully, we can merge Steve's work tonight/tomorrow and re-enable the
>>> containerized undercloud on scenarios after checking that we don't have
>>> timeouts and reasonable deployment runtimes.
>>>
>>> That's the plan we came with, if you have any question / feedback please
>>> share it.
>>> --
>>> Emilien, Steve and Wes
>>>
>>
>>
>>
>> --
>> Emilien Macchi
>>
>
>
>
> --
> Emilien Macchi
>



-- 
Emilien Macchi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20180615/18e7d06c/attachment.html>


More information about the OpenStack-dev mailing list