[openstack-dev] [Heat] convergence rally test results (so far)

Angus Salkeld asalkeld at mirantis.com
Fri Sep 4 00:17:20 UTC 2015


On Fri, Sep 4, 2015 at 12:48 AM Zane Bitter <zbitter at redhat.com> wrote:

> On 03/09/15 02:56, Angus Salkeld wrote:
> > On Thu, Sep 3, 2015 at 3:53 AM Zane Bitter <zbitter at redhat.com
> > <mailto:zbitter at redhat.com>> wrote:
> >
> >     On 02/09/15 04:55, Steven Hardy wrote:
> >      > On Wed, Sep 02, 2015 at 04:33:36PM +1200, Robert Collins wrote:
> >      >> On 2 September 2015 at 11:53, Angus Salkeld
> >     <asalkeld at mirantis.com <mailto:asalkeld at mirantis.com>> wrote:
> >      >>
> >      >>> 1. limit the number of resource actions in parallel (maybe base
> >     on the
> >      >>> number of cores)
> >      >>
> >      >> I'm having trouble mapping that back to 'and heat-engine is
> >     running on
> >      >> 3 separate servers'.
> >      >
> >      > I think Angus was responding to my test feedback, which was a
> >     different
> >      > setup, one 4-core laptop running heat-engine with 4 worker
> processes.
> >      >
> >      > In that environment, the level of additional concurrency becomes
> >     a problem
> >      > because all heat workers become so busy that creating a large
> stack
> >      > DoSes the Heat services, and in my case also the DB.
> >      >
> >      > If we had a configurable option, similar to num_engine_workers,
> which
> >      > enabled control of the number of resource actions in parallel, I
> >     probably
> >      > could have controlled that explosion in activity to a more
> >     managable series
> >      > of tasks, e.g I'd set num_resource_actions to
> >     (num_engine_workers*2) or
> >      > something.
> >
> >     I think that's actually the opposite of what we need.
> >
> >     The resource actions are just sent to the worker queue to get
> processed
> >     whenever. One day we will get to the point where we are overflowing
> the
> >     queue, but I guarantee that we are nowhere near that day. If we are
> >     DoSing ourselves, it can only be because we're pulling *everything*
> off
> >     the queue and starting it in separate greenthreads.
> >
> >
> > worker does not use a greenthread per job like service.py does.
> > This issue is if you have actions that are fast you can hit the db hard.
> >
> > QueuePool limit of size 5 overflow 10 reached, connection timed out,
> > timeout 30
> >
> > It seems like it's not very hard to hit this limit. It comes from simply
> > loading
> > the resource in the worker:
> > "/home/angus/work/heat/heat/engine/worker.py", line 276, in
> check_resource
> > "/home/angus/work/heat/heat/engine/worker.py", line 145, in
> _load_resource
> > "/home/angus/work/heat/heat/engine/resource.py", line 290, in load
> > resource_objects.Resource.get_obj(context, resource_id)
>
> This is probably me being naive, but that sounds strange. I would have
> thought that there is no way to exhaust the connection pool by doing
> lots of actions in rapid succession. I'd have guessed that the only way
> to exhaust a connection pool would be to have lots of connections open
> simultaneously. That suggests to me that either we are failing to
> expeditiously close connections and return them to the pool, or that we
> are - explicitly or implicitly - processing a bunch of messages in
> parallel.
>

I suspect we are leaking sessions, I have updated this bug to make sure we
focus on figuring out the root cause of this before jumping to conclusions:
https://bugs.launchpad.net/heat/+bug/1491185

-A


>
> >     In an ideal world, we might only ever pull one task off that queue
> at a
> >     time. Any time the task is sleeping, we would use for processing
> stuff
> >     off the engine queue (which needs a quick response, since it is
> serving
> >     the ReST API). The trouble is that you need a *huge* number of
> >     heat-engines to handle stuff in parallel. In the reductio-ad-absurdum
> >     case of a single engine only processing a single task at a time,
> we're
> >     back to creating resources serially. So we probably want a higher
> number
> >     than 1. (Phase 2 of convergence will make tasks much smaller, and may
> >     even get us down to the point where we can pull only a single task
> at a
> >     time.)
> >
> >     However, the fewer engines you have, the more greenthreads we'll
> have to
> >     allow to get some semblance of parallelism. To the extent that more
> >     cores means more engines (which assumes all running on one box, but
> >     still), the number of cores is negatively correlated with the number
> of
> >     tasks that we want to allow.
> >
> >     Note that all of the greenthreads run in a single CPU thread, so
> having
> >     more cores doesn't help us at all with processing more stuff in
> >     parallel.
> >
> >
> > Except, as I said above, we are not creating greenthreads in worker.
>
> Well, maybe we'll need to in order to make things still work sanely with
> a low number of engines :) (Should be pretty easy to do with a semaphore.)
>
> I think what y'all are suggesting is limiting the number of jobs that go
> into the queue... that's quite wrong IMO. Apart from the fact it's
> impossible (resources put jobs into the queue entirely independently,
> and have no knowledge of the global state required to throttle inputs),
> we shouldn't implement an in-memory queue with long-running tasks
> containing state that can be lost if the process dies - the whole point
> of convergence is we have... a message queue for that. We need to limit
> the rate that stuff comes *out* of the queue. And, again, since we have
> no knowledge of global state, we can only control the rate at which an
> individual worker processes tasks. The way to avoid killing the DB is to
> out a constant ceiling on the workers * concurrent_tasks_per_worker
> product.
>
> cheers,
> Zane.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150904/10ac4234/attachment.html>


More information about the OpenStack-dev mailing list