Open Stack

Fri Dec 20 16:34:06 UTC 2013

Excerpts from Radomir Dopieralski's message of 2013-12-20 01:13:20 -0800:
> On 20/12/13 00:17, Jay Pipes wrote:
> > On 12/19/2013 04:55 AM, Radomir Dopieralski wrote:
> >> On 14/12/13 16:51, Jay Pipes wrote:
> >>
> >> [snip]
> >>
> >>> Instead of focusing on locking issues -- which I agree are very
> >>> important in the virtualized side of things where resources are
> >>> "thinner" -- I believe that in the bare-metal world, a more useful focus
> >>> would be to ensure that the Tuskar API service treats related group
> >>> operations (like "deploy an undercloud on these nodes") in a way that
> >>> can handle failures in a graceful and/or atomic way.
> >>
> >> Atomicity of operations can be achieved by intoducing critical sections.
> >> You basically have two ways of doing that, optimistic and pessimistic.
> >> Pessimistic critical section is implemented with a locking mechanism
> >> that prevents all other processes from entering the critical section
> >> until it is finished.
> > 
> > I'm familiar with the traditional non-distributed software concept of a
> > mutex (or in Windows world, a critical section). But we aren't dealing
> > with traditional non-distributed software here. We're dealing with
> > highly distributed software where components involved in the
> > "transaction" may not be running on the same host or have much awareness
> > of each other at all.
> 
> Yes, that is precisely why you need to have a single point where they
> can check if they are not stepping on each other's toes. If you don't,
> you get race conditions and non-deterministic behavior. The only
> difference with traditional, non-distributed software is that since the
> components involved are communicating over a, relatively slow, network,
> you have a much, much greater chance of actually having a conflict.
> Scaling the whole thing to hundreds of nodes practically guarantees trouble.
> 

Radomir, what Jay is suggesting is that it seems pretty unlikely that
two individuals would be given a directive to deploy OpenStack into a
single pool of hardware at such a scale where they will both use the
whole thing.

Worst case, if it does happen, they both run out of hardware, one
individual deletes their deployment, the other one resumes. This is the
optimistic position and it will work fine. Assuming you are driving this
all through Heat (which, AFAIK, Tuskar still uses Heat) there's even a
blueprint to support you that I'm working on:

https://blueprints.launchpad.net/heat/+spec/retry-failed-update

Even if both operators put the retry in a loop, one would actually
finish at some point.

> > Trying to make a complex series of related but distributed actions --
> > like the underlying actions of the Tuskar -> Ironic API calls -- into an
> > atomic operation is just not a good use of programming effort, IMO.
> > Instead, I'm advocating that programming effort should instead be spent
> > coding a workflow/taskflow pipeline that can gracefully retry failed
> > operations and report the state of the total taskflow back to the user.
> 
> Sure, there are many ways to solve any particular synchronisation
> problem. Let's say that we have one that can actually be solved by
> retrying. Do you want to retry infinitely? Would you like to increase
> the delays between retries exponentially? If so, where are you going to
> keep the shared counters for the retries? Perhaps in tuskar-api, hmm?
> 

I don't think a sane person would retry more than maybe once without
checking with the other operators.

> Or are you just saying that we should pretend that the nondeteministic
> bugs appearing due to the lack of synchronization simply don't exist?
> They cannot be easily reproduced, after all. We could just close our
> eyes, cover our ears, sing "lalalala" and close any bug reports with
> such errors with "could not reproduce on my single-user, single-machine
> development installation". I know that a lot of software companies do
> exactly that, so I guess it's a valid business practice, I just want to
> make sure that this is actually the tactic that we are going to take,
> before commiting to an architectural decision that will make those bugs
> impossible to fix.
> 

OpenStack is non-deterministic. Deterministic systems are rigid and unable
to handle failure modes of any kind of diversity. We tend to err toward
pushing problems back to the user and giving them tools to resolve the
problem. Avoiding spurious problems is important too, no doubt. However,
what Jay has been suggesting is that the situation a pessimistic locking
system would avoid is entirely user created, and thus lower priority
than say, actually having a complete UI for deploying OpenStack.

Open Stack

[openstack-dev] [TripleO] Tuskar CLI after architecture changes

OpenStack

Community

Documentation

Branding & Legal