[OpenStack-Infra] RFC: Zuul executor congestion control

James E. Blair corvus at inaugust.com
Tue Jan 16 16:47:59 UTC 2018


<Tobias.Henkel at bmw.de> writes:

> Hi zuulers,
>
> the zuul-executor resource governor topic seems to be a recurring now
> and we might want take the step and make it a bit smarter.

To be honest, it keeps coming up because we haven't gotten around to
finishing the work already in progress on this.  We're not done with the
current approach yet, so let's not declare it a failure until we've
tried it and learned what we can.

> I think the current approach of a set of on/off governors based on the
> current conditions may not be sufficient. I thought about that and
> like to have feedback about what you think about that.
>
> TLDR; I propose having a congestion control algorithm managing a
> congestion window utilizing a slow start with a generic sensor
> interface and weighted job costs.
>
> Algorithm
> --------------
>
> The algorithm I propose would manage a congestion window of an
> abstract metric measured in points. This is intended to leverage some
> (simple) weighting of jobs as multi node jobs e.g. probably take more
> resources than single node jobs.
>
> The algorithm consists of two threads. One managing the congestion
> window, one accepting the jobs.
>
> Congestion window management:
>
>   1.  Start with a window of size START_WINDOW_SIZE points
>   2.  Get current used percentage of window
>   3.  Ask sensors for green/red
>   4.  If all green AND window-usage > INCREASE_WINDOW_THRESHOLD
>      *   Increase window
>   5.  If one red
>      *   Decrease window below current usage
>   6.  Loop back to step 2
>
> Job accepting:
>
>   1.  Get current used window-percentage
>   2.  If window-usage < window-size
>      *   Register function if necessary
>      *   Accept job
>      *   summarize window-usage (the job will update asynchronously when finished)
>   3.  Else
>      *   Deregister function if necessary
>   4.  Loop back to step 1
>
> The magic numbers used are subject for further discussion and
> algorithm tweaking.
>
>
> Weighting of jobs
> ------------------------
> Now different jobs take different amounts of resources so we would need some simple estimation about that. This could be tuned in the future. For the start I’d propose something simple like this:
>
> Cost_job = 5 + 5 * size of inventory
>
> In the future this could be improved to estimate the costs based on historical data of the individual jobs.

This is the part I'm most concerned about.  The current approach
involves some magic numbers, but they are only there to help approximate
what an appropriate load (or soon memory) for a host might be, and would
be straightforward for an operator to tune if necessary.

Approximating what resources a job uses is a much more difficult matter.
We could have a job which uses 0 nodes and runs for 30 seconds, or a job
that uses 10 nodes and runs for 6 hours collecting a lot of output.  We
could be running both of those at the same time.  Tuning that magic
number would not be straightforward for anyone, and may be impossible.

Automatically collecting that data would improve the accuracy, but would
also be very difficult.  Collecting the cpu usuage, memory consumption,
disk usage, etc, over time and using it to predict impact on the system
is a very sophisticated task, and I'm afraid we'll spend a lot of time
on it.

> Sensors
> ----------
>
> Further different ways of deployment will have different needs about
> the sensors. E.g. the load and ram sensors which utilize load1 and
> memfree won’t work in a kubernetes based deployments as they assume
> the executor is located exclusively on a VM. In order to mitigate I’d
> like to have some generic sensor interface where we also could put a
> cgroups sensor into which checks resource usage according to the
> cgroup limit (which is what we need for a kubernetes hosted zuul). We
> also could put a filesystem sensor in which monitors if there is
> enough local storage. For hooking this into the algorithm I think we
> could start with a single function
>
> def isStatusOk() -> bool

This is a good idea, and I see no reason why we shouldn't go ahead and
work toward an interface like this even with the current system.  That
will make it more flexible, and if we decide to implement a more
sophisticated system like the one described here in the future, it will
be easy to incorporate this.

> Exposing the data
> -------------------------
>
> The window-usage and window-size values could also be exported to
> statsd. This could enable autoscaling of the number of executors in
> deployments supporting that.

I agree that whatever we do, we should expose the data to statsd.

> What are your thoughts about that?

Let me lay out my goals for this:

The governor doesn't need to be perfect -- it only needs to keep the
system from becoming so overloaded that jobs which are running don't
fail due to being out of resources, or cause the kernel to kill the
process.  Ideally an operator should be able to look at a graph and see
that a significant number of executors have gone into self-protection
and stopped accepting jobs, and know that they should add more (or
kubernetes should add more).

I don't want operators to have to tune the system at all.  I'm hoping we
can make this automatically work in most cases (though operators should
be able to override settings if we git the heuristic wrong).  If it
turns out that operators do need to tune something, we should just drop
all of this and expose a 'max jobs' setting -- that would be the easiest
to tune.

Now, aside from not incorporating anything related to RAM into the
current system, the other major deficiency we're dealing with is the
fact that the load average and ram usage are trailing indicators -- it
takes several minutes at least for an increase in usage to be reflected.
Currently, our executors will, even when fully loaded, only wait a few
seconds between accepting jobs.  So we need to slow down the acceptance
rate to allow the indicators to catch up.  Our current plan is to simply
increase the delay between job acceptance proportionate to the number of
jobs we're running.  The slow start and window scaling could help with
that, however, I think we need to focus on that a bit more -- how do we
avoid increasing the window size too quickly, but still allow it to
increase?  That's really a fundamental problem shared by both
approaches.

I'm starting to view the algorithm you describe here as a formalization
of what we're trying to achieve with the current approach.  I think it
can be a good way of thinking about it.

I wonder if we can simplify things a bit by dealing with averages.
Declare that the window is the number of jobs we run at once, then
increase the window if we are at the max and our sensors are all green
for a certain amount of time (minutes?), and as soon as one goes red,
decrease the window.  This way we don't have to estimate the cost of
each job, we just observe the behavior of the system in aggregate.

While we continue to discuss the window approach, I'd also like to
continue to implement the RAM governor, and then also increase the
acceptance delay as load scales.  I think this work will be useful to
build on in the future regardless, and we can hopefully address some
immediate problems and learn more about the behavior.

-Jim



More information about the OpenStack-Infra mailing list