<div dir="ltr"><div class="gmail_quote"><div dir="ltr">On Fri, Sep 4, 2015 at 12:48 AM Zane Bitter <<a href="mailto:zbitter@redhat.com">zbitter@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 03/09/15 02:56, Angus Salkeld wrote:<br>

> On Thu, Sep 3, 2015 at 3:53 AM Zane Bitter <<a href="mailto:zbitter@redhat.com" target="_blank">zbitter@redhat.com</a><br>

> <mailto:<a href="mailto:zbitter@redhat.com" target="_blank">zbitter@redhat.com</a>>> wrote:<br>

><br>

>     On 02/09/15 04:55, Steven Hardy wrote:<br>

>      > On Wed, Sep 02, 2015 at 04:33:36PM +1200, Robert Collins wrote:<br>

>      >> On 2 September 2015 at 11:53, Angus Salkeld<br>

>     <<a href="mailto:asalkeld@mirantis.com" target="_blank">asalkeld@mirantis.com</a> <mailto:<a href="mailto:asalkeld@mirantis.com" target="_blank">asalkeld@mirantis.com</a>>> wrote:<br>

>      >><br>

>      >>> 1. limit the number of resource actions in parallel (maybe base<br>

>     on the<br>

>      >>> number of cores)<br>

>      >><br>

>      >> I'm having trouble mapping that back to 'and heat-engine is<br>

>     running on<br>

>      >> 3 separate servers'.<br>

>      ><br>

>      > I think Angus was responding to my test feedback, which was a<br>

>     different<br>

>      > setup, one 4-core laptop running heat-engine with 4 worker processes.<br>

>      ><br>

>      > In that environment, the level of additional concurrency becomes<br>

>     a problem<br>

>      > because all heat workers become so busy that creating a large stack<br>

>      > DoSes the Heat services, and in my case also the DB.<br>

>      ><br>

>      > If we had a configurable option, similar to num_engine_workers, which<br>

>      > enabled control of the number of resource actions in parallel, I<br>

>     probably<br>

>      > could have controlled that explosion in activity to a more<br>

>     managable series<br>

>      > of tasks, e.g I'd set num_resource_actions to<br>

>     (num_engine_workers*2) or<br>

>      > something.<br>

><br>

>     I think that's actually the opposite of what we need.<br>

><br>

>     The resource actions are just sent to the worker queue to get processed<br>

>     whenever. One day we will get to the point where we are overflowing the<br>

>     queue, but I guarantee that we are nowhere near that day. If we are<br>

>     DoSing ourselves, it can only be because we're pulling *everything* off<br>

>     the queue and starting it in separate greenthreads.<br>

><br>

><br>

> worker does not use a greenthread per job like service.py does.<br>

> This issue is if you have actions that are fast you can hit the db hard.<br>

><br>

> QueuePool limit of size 5 overflow 10 reached, connection timed out,<br>

> timeout 30<br>

><br>

> It seems like it's not very hard to hit this limit. It comes from simply<br>

> loading<br>

> the resource in the worker:<br>

> "/home/angus/work/heat/heat/engine/worker.py", line 276, in check_resource<br>

> "/home/angus/work/heat/heat/engine/worker.py", line 145, in _load_resource<br>

> "/home/angus/work/heat/heat/engine/resource.py", line 290, in load<br>

> resource_objects.Resource.get_obj(context, resource_id)<br>

<br>

This is probably me being naive, but that sounds strange. I would have<br>

thought that there is no way to exhaust the connection pool by doing<br>

lots of actions in rapid succession. I'd have guessed that the only way<br>

to exhaust a connection pool would be to have lots of connections open<br>

simultaneously. That suggests to me that either we are failing to<br>

expeditiously close connections and return them to the pool, or that we<br>

are - explicitly or implicitly - processing a bunch of messages in parallel.<br></blockquote><div><br></div><div>I suspect we are leaking sessions, I have updated this bug to make sure we</div><div>focus on figuring out the root cause of this before jumping to conclusions:</div><div><a href="https://bugs.launchpad.net/heat/+bug/1491185">https://bugs.launchpad.net/heat/+bug/1491185</a><br></div><div><br></div><div>-A</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

>     In an ideal world, we might only ever pull one task off that queue at a<br>

>     time. Any time the task is sleeping, we would use for processing stuff<br>

>     off the engine queue (which needs a quick response, since it is serving<br>

>     the ReST API). The trouble is that you need a *huge* number of<br>

>     heat-engines to handle stuff in parallel. In the reductio-ad-absurdum<br>

>     case of a single engine only processing a single task at a time, we're<br>

>     back to creating resources serially. So we probably want a higher number<br>

>     than 1. (Phase 2 of convergence will make tasks much smaller, and may<br>

>     even get us down to the point where we can pull only a single task at a<br>

>     time.)<br>

><br>

>     However, the fewer engines you have, the more greenthreads we'll have to<br>

>     allow to get some semblance of parallelism. To the extent that more<br>

>     cores means more engines (which assumes all running on one box, but<br>

>     still), the number of cores is negatively correlated with the number of<br>

>     tasks that we want to allow.<br>

><br>

>     Note that all of the greenthreads run in a single CPU thread, so having<br>

>     more cores doesn't help us at all with processing more stuff in<br>

>     parallel.<br>

><br>

><br>

> Except, as I said above, we are not creating greenthreads in worker.<br>

<br>

Well, maybe we'll need to in order to make things still work sanely with<br>

a low number of engines :) (Should be pretty easy to do with a semaphore.)<br>

<br>

I think what y'all are suggesting is limiting the number of jobs that go<br>

into the queue... that's quite wrong IMO. Apart from the fact it's<br>

impossible (resources put jobs into the queue entirely independently,<br>

and have no knowledge of the global state required to throttle inputs),<br>

we shouldn't implement an in-memory queue with long-running tasks<br>

containing state that can be lost if the process dies - the whole point<br>

of convergence is we have... a message queue for that. We need to limit<br>

the rate that stuff comes *out* of the queue. And, again, since we have<br>

no knowledge of global state, we can only control the rate at which an<br>

individual worker processes tasks. The way to avoid killing the DB is to<br>

out a constant ceiling on the workers * concurrent_tasks_per_worker product.<br>

<br>

cheers,<br>

Zane.<br>

<br>

__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</blockquote></div></div>