Open Stack

Thu Sep 3 14:42:41 UTC 2015

On 03/09/15 02:56, Angus Salkeld wrote:
> On Thu, Sep 3, 2015 at 3:53 AM Zane Bitter <zbitter at redhat.com
> <mailto:zbitter at redhat.com>> wrote:
>
>     On 02/09/15 04:55, Steven Hardy wrote:
>      > On Wed, Sep 02, 2015 at 04:33:36PM +1200, Robert Collins wrote:
>      >> On 2 September 2015 at 11:53, Angus Salkeld
>     <asalkeld at mirantis.com <mailto:asalkeld at mirantis.com>> wrote:
>      >>
>      >>> 1. limit the number of resource actions in parallel (maybe base
>     on the
>      >>> number of cores)
>      >>
>      >> I'm having trouble mapping that back to 'and heat-engine is
>     running on
>      >> 3 separate servers'.
>      >
>      > I think Angus was responding to my test feedback, which was a
>     different
>      > setup, one 4-core laptop running heat-engine with 4 worker processes.
>      >
>      > In that environment, the level of additional concurrency becomes
>     a problem
>      > because all heat workers become so busy that creating a large stack
>      > DoSes the Heat services, and in my case also the DB.
>      >
>      > If we had a configurable option, similar to num_engine_workers, which
>      > enabled control of the number of resource actions in parallel, I
>     probably
>      > could have controlled that explosion in activity to a more
>     managable series
>      > of tasks, e.g I'd set num_resource_actions to
>     (num_engine_workers*2) or
>      > something.
>
>     I think that's actually the opposite of what we need.
>
>     The resource actions are just sent to the worker queue to get processed
>     whenever. One day we will get to the point where we are overflowing the
>     queue, but I guarantee that we are nowhere near that day. If we are
>     DoSing ourselves, it can only be because we're pulling *everything* off
>     the queue and starting it in separate greenthreads.
>
>
> worker does not use a greenthread per job like service.py does.
> This issue is if you have actions that are fast you can hit the db hard.
>
> QueuePool limit of size 5 overflow 10 reached, connection timed out,
> timeout 30
>
> It seems like it's not very hard to hit this limit. It comes from simply
> loading
> the resource in the worker:
> "/home/angus/work/heat/heat/engine/worker.py", line 276, in check_resource
> "/home/angus/work/heat/heat/engine/worker.py", line 145, in _load_resource
> "/home/angus/work/heat/heat/engine/resource.py", line 290, in load
> resource_objects.Resource.get_obj(context, resource_id)

This is probably me being naive, but that sounds strange. I would have 
thought that there is no way to exhaust the connection pool by doing 
lots of actions in rapid succession. I'd have guessed that the only way 
to exhaust a connection pool would be to have lots of connections open 
simultaneously. That suggests to me that either we are failing to 
expeditiously close connections and return them to the pool, or that we 
are - explicitly or implicitly - processing a bunch of messages in parallel.

>     In an ideal world, we might only ever pull one task off that queue at a
>     time. Any time the task is sleeping, we would use for processing stuff
>     off the engine queue (which needs a quick response, since it is serving
>     the ReST API). The trouble is that you need a *huge* number of
>     heat-engines to handle stuff in parallel. In the reductio-ad-absurdum
>     case of a single engine only processing a single task at a time, we're
>     back to creating resources serially. So we probably want a higher number
>     than 1. (Phase 2 of convergence will make tasks much smaller, and may
>     even get us down to the point where we can pull only a single task at a
>     time.)
>
>     However, the fewer engines you have, the more greenthreads we'll have to
>     allow to get some semblance of parallelism. To the extent that more
>     cores means more engines (which assumes all running on one box, but
>     still), the number of cores is negatively correlated with the number of
>     tasks that we want to allow.
>
>     Note that all of the greenthreads run in a single CPU thread, so having
>     more cores doesn't help us at all with processing more stuff in
>     parallel.
>
>
> Except, as I said above, we are not creating greenthreads in worker.

Well, maybe we'll need to in order to make things still work sanely with 
a low number of engines :) (Should be pretty easy to do with a semaphore.)

I think what y'all are suggesting is limiting the number of jobs that go 
into the queue... that's quite wrong IMO. Apart from the fact it's 
impossible (resources put jobs into the queue entirely independently, 
and have no knowledge of the global state required to throttle inputs), 
we shouldn't implement an in-memory queue with long-running tasks 
containing state that can be lost if the process dies - the whole point 
of convergence is we have... a message queue for that. We need to limit 
the rate that stuff comes *out* of the queue. And, again, since we have 
no knowledge of global state, we can only control the rate at which an 
individual worker processes tasks. The way to avoid killing the DB is to 
out a constant ceiling on the workers * concurrent_tasks_per_worker product.

cheers,
Zane.

Open Stack

[openstack-dev] [Heat] convergence rally test results (so far)

OpenStack

Community

Documentation

Branding & Legal