[openstack-dev] Moving task flow to conductor - concern about scale
harlowja at yahoo-inc.com
Sun Jul 21 02:26:49 UTC 2013
Looking at the conductor code it still to me provides a low level database API that succumbs to the same races as a the old db access did. Get calls followed by some response followed by some python code followed by some rpc update followed by more code is still susceptible to consistency & fragility issues.
The API provided is more data oriented and not action oriented. I would argue that data oriented leads to lots of consistency issues with multiple conductors. Action/task oriented if that is ever accomplished allows the conductor to lock resources that are being "manipulated" so that another conductor can not alter the same resource at the same time.
Nova currently has a lot of devoted and hard to follow logic for when resources are simultaneously manipulated (deleted while building for example). Just look for *not found* exceptions being thrown in the conductor from *get/update function calls and check where that exception is handled (are all of them? are all resources cleaned up??). These seem like examples of a API that is to low level and wouldn't be exposed in a action/task oriented API. It appears that nova is trying to handle all of these special exists or not already exists (or similar consistency violations) calls correctly, which is good, but having said logic scattered sure doesn't inspire confidence that it is correctly doing the right logic under all scenarios to me. Does that not worry anyone else??
IMHO adding task logic in the conductor on top of the already hard to follow logic for these scenarios worries me personally. That's why I previously thought (and others seem to think) task logic and correct locking and such ... should be located in a service that can devote its code to just doing said tasks reliably. Honestly said code will be much much more complex than a database-rpc access layer (especially when the races and simultaneous manipulation problems are not hidden/scattered but are dealt with in an upfront and easily auditable manner).
But maybe this is nothing new to folks and all of this is already being thought about (solutions do seem to be appearing and more discussion about said ideas is always beneficial).
Just my thoughts...
Sent from my really tiny device...
On Jul 19, 2013, at 5:30 PM, "Peter Feiner" <peter at gridcentric.ca> wrote:
> On Fri, Jul 19, 2013 at 4:36 PM, Joshua Harlow <harlowja at yahoo-inc.com> wrote:
>> This seems to me to be a good example where a library "problem" is leaking into the openstack architecture right? That is IMHO a bad path to go down.
>> I like to think of a world where this isn't a problem and design the correct solution there instead and fix the eventlet problem instead. Other large applications don't fallback to rpc calls to get around a database/eventlet scaling issues afaik.
>> Honestly I would almost just want to finally fix the eventlet problem (chris b. I think has been working on it) and design a system that doesn't try to work around a libraries lacking. But maybe that's to much idealism, idk...
> Well, there are two problems that multiple nova-conductor processes
> fix. One is the bad interaction between eventlet and native code. The
> other is allowing multiprocessing. That is, once nova-conductor
> starts to handle enough requests, enough time will be spent holding
> the GIL to make it a bottleneck; in fact I've had to scale keystone
> using multiple processes because of GIL contention (i.e., keystone was
> steadily at 100% CPU utilization when I was hitting OpenStack with
> enough requests). So multiple processes isn't avoidable. Indeed, other
> software that strives for high concurrency, such as apache, use
> multiple processes to avoid contention for per-process kernel
> resources like the mmap semaphore.
>> This doesn't even touch on the synchronization issues that can happen when u start pumping db traffic over a mq. Ex, an update is now queued behind another update, the second one conflicts with the first, where does resolution happen when an async mq call is used. What about when you have X conductors doing Y reads and Z updates; I don't even want to think about the sync/races there (and so on...). Did u hit / check for any consistency issues in your tests? Consistency issues under high load using multiple conductors scare the bejezzus out of me....
> If a sequence of updates needs to be atomic, then they should be made
> in the same database transaction. Hence nova-conductor's interface
> isn't do_some_sql(query), it's a bunch of high-level nova operations
> that are implemented using transactions.
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
More information about the OpenStack-dev