[openstack-dev] [tripleo] tripleo gate is blocked - please read
Bogdan Dobrelya
bdobreli at redhat.com
Thu Jun 14 09:47:32 UTC 2018
On 6/14/18 3:50 AM, Emilien Macchi wrote:
> TL;DR: gate queue was 25h+, we put all patches from gate on standby, do
> not restore/recheck until further announcement.
>
> We recently enabled the containerized undercloud for multinode jobs and
> we believe this was a bit premature as the container download process
> wasn't optimized so it's not pulling the mirrors for the same containers
> multiple times yet.
> It caused the job runtime to increase and probably the load on docker.io
> <http://docker.io> mirrors hosted by OpenStack Infra to be a bit slower
> to provide the same containers multiple times. The time taken to prepare
> containers on the undercloud and then for the overcloud caused the jobs
> to randomly timeout therefore the gate to fail in a high amount of
> times, so we decided to remove all jobs from the gate by abandoning the
> patches temporarily (I have them in my browser and will restore when
> things are stable again, please do not touch anything).
>
> Steve Baker has been working on a series of patches that optimize the
> way we prepare the containers but basically the workflow will be:
> - pull containers needed for the undercloud into a local registry, using
> infra mirror if available
> - deploy the containerized undercloud
> - pull containers needed for the overcloud minus the ones already pulled
> for the undercloud, using infra mirror if available
> - update containers on the overcloud
> - deploy the containerized undercloud
Let me also note that it's may be time to introduce jobs dependencies
[0]. Dependencies might somewhat alleviate registries/mirrors DoS
issues, like that one we have currently, by running jobs in batches, and
not firing of all at once.
We still have options to think of. The undercloud deployment takes
longer than standalone, but provides better coverage therefore better
extrapolates (and cuts off) future overcloud failures for the dependent
jobs. Standalone is less stable yet though. The containers update check
may be also an option for the step 1, or step 2, before the remaining
multinode jobs execute.
Making those dependent jobs skipped, in turn, reduces DoS effects caused
to registries and mirrors.
[0]
https://review.openstack.org/#/q/status:open+project:openstack-infra/tripleo-ci+topic:ci_pipelines
>
> With that process, we hope to reduce the runtime of the deployment and
> therefore reduce the timeouts in the gate.
> To enable it, we need to land in that order:
> https://review.openstack.org/#/c/571613/,
> https://review.openstack.org/#/c/574485/,
> https://review.openstack.org/#/c/571631/ and
> https://review.openstack.org/#/c/568403.
>
> In the meantime, we are disabling the containerized undercloud recently
> enabled on all scenarios: https://review.openstack.org/#/c/575264/ for
> mitigation with the hope to stabilize things until Steve's patches land.
> Hopefully, we can merge Steve's work tonight/tomorrow and re-enable the
> containerized undercloud on scenarios after checking that we don't have
> timeouts and reasonable deployment runtimes.
>
> That's the plan we came with, if you have any question / feedback please
> share it.
> --
> Emilien, Steve and Wes
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
More information about the OpenStack-dev
mailing list