[openstack-dev] [tripleo] Zuul Queue backlogs and resource usage
cboylan at sapwetik.org
Tue Oct 30 21:16:07 UTC 2018
On Tue, Oct 30, 2018, at 1:01 PM, Ben Nemec wrote:
> On 10/30/18 1:25 PM, Clark Boylan wrote:
> > On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:
> >> On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec <openstack at nemebean.com> wrote:
> >>> Tagging with tripleo since my suggestion below is specific to that project.
> >>> On 10/30/18 11:03 AM, Clark Boylan wrote:
> >>>> Hello everyone,
> >>>> A little while back I sent email explaining how the gate queues work and how fixing bugs helps us test and merge more code. All of this still is still true and we should keep pushing to improve our testing to avoid gate resets.
> >>>> Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In the process of doing this we had to restart Zuul which brought in a new logging feature that exposes node resource usage by jobs. Using this data I've been able to generate some report information on where our node demand is going. This change  produces this report .
> >>>> As with optimizing software we want to identify which changes will have the biggest impact and to be able to measure whether or not changes have had an impact once we have made them. Hopefully this information is a start at doing that. Currently we can only look back to the point Zuul was restarted, but we have a thirty day log rotation for this service and should be able to look at a month's worth of data going forward.
> >>>> Looking at the data you might notice that Tripleo is using many more node resources than our other projects. They are aware of this and have a plan  to reduce their resource consumption. We'll likely be using this report generator to check progress of this plan over time.
> >>> I know at one point we had discussed reducing the concurrency of the
> >>> tripleo gate to help with this. Since tripleo is still using >50% of the
> >>> resources it seems like maybe we should revisit that, at least for the
> >>> short-term until the more major changes can be made? Looking through the
> >>> merge history for tripleo projects I don't see a lot of cases (any, in
> >>> fact) where more than a dozen patches made it through anyway*, so I
> >>> suspect it wouldn't have a significant impact on gate throughput, but it
> >>> would free up quite a few nodes for other uses.
> >> It's the failures in gate and resets. At this point I think it would
> >> be a good idea to turn down the concurrency of the tripleo queue in
> >> the gate if possible. As of late it's been timeouts but we've been
> >> unable to track down why it's timing out specifically. I personally
> >> have a feeling it's the container download times since we do not have
> >> a local registry available and are only able to leverage the mirrors
> >> for some levels of caching. Unfortunately we don't get the best
> >> information about this out of docker (or the mirrors) and it's really
> >> hard to determine what exactly makes things run a bit slower.
> > We actually tried this not too long ago https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b but decided to revert it because it didn't decrease the check queue backlog significantly. We were still running at several hours behind most of the time.
> I'm surprised to hear that. Counting the tripleo jobs in the gate at
> positions 11-20 right now, I see around 84 nodes tied up in long-running
> jobs and another 32 for shorter unit test jobs. The latter probably
> don't have much impact, but the former is a non-trivial amount. It may
> not erase the entire 2300+ job queue that we have right now, but it
> seems like it should help.
> > If we want to set up better monitoring and measuring and try it again we can do that. But we probably want to measure queue sizes with and without the change like that to better understand if it helps.
> This seems like good information to start capturing, otherwise we are
> kind of just guessing. Is there something in infra already that we could
> use or would it need to be new tooling?
Digging around in graphite we currently track mean in pipelines. This is probably a reasonable metric to use for this specific case.
Looking at the check queue  shows the mean time enqueued in check during the rough period window floor was 10 and  shows it since then. The 26th and 27th are bigger peaks than previously seen (possibly due to losing inap temporarily) but otherwise a queue backlog of ~200 minutes was "normal" in both time periods.
You should be able to change check to eg gate or other queue names and poke around more if you like. Note the scale factor scales from milliseconds to minutes.
More information about the OpenStack-dev