Open Stack

Sat Dec 21 05:10:16 UTC 2013

On 12/20/2013 11:34 AM, Clint Byrum wrote:
> Excerpts from Radomir Dopieralski's message of 2013-12-20 01:13:20 -0800:
>> On 20/12/13 00:17, Jay Pipes wrote:
>>> On 12/19/2013 04:55 AM, Radomir Dopieralski wrote:
>>>> On 14/12/13 16:51, Jay Pipes wrote:
>>>>
>>>> [snip]
>>>>
>>>>> Instead of focusing on locking issues -- which I agree are very
>>>>> important in the virtualized side of things where resources are
>>>>> "thinner" -- I believe that in the bare-metal world, a more useful focus
>>>>> would be to ensure that the Tuskar API service treats related group
>>>>> operations (like "deploy an undercloud on these nodes") in a way that
>>>>> can handle failures in a graceful and/or atomic way.
>>>>
>>>> Atomicity of operations can be achieved by intoducing critical sections.
>>>> You basically have two ways of doing that, optimistic and pessimistic.
>>>> Pessimistic critical section is implemented with a locking mechanism
>>>> that prevents all other processes from entering the critical section
>>>> until it is finished.
>>>
>>> I'm familiar with the traditional non-distributed software concept of a
>>> mutex (or in Windows world, a critical section). But we aren't dealing
>>> with traditional non-distributed software here. We're dealing with
>>> highly distributed software where components involved in the
>>> "transaction" may not be running on the same host or have much awareness
>>> of each other at all.
>>
>> Yes, that is precisely why you need to have a single point where they
>> can check if they are not stepping on each other's toes. If you don't,
>> you get race conditions and non-deterministic behavior. The only
>> difference with traditional, non-distributed software is that since the
>> components involved are communicating over a, relatively slow, network,
>> you have a much, much greater chance of actually having a conflict.
>> Scaling the whole thing to hundreds of nodes practically guarantees trouble.
>>
>
> Radomir, what Jay is suggesting is that it seems pretty unlikely that
> two individuals would be given a directive to deploy OpenStack into a
> single pool of hardware at such a scale where they will both use the
> whole thing.
>
> Worst case, if it does happen, they both run out of hardware, one
> individual deletes their deployment, the other one resumes. This is the
> optimistic position and it will work fine. Assuming you are driving this
> all through Heat (which, AFAIK, Tuskar still uses Heat) there's even a
> blueprint to support you that I'm working on:
>
> https://blueprints.launchpad.net/heat/+spec/retry-failed-update
>
> Even if both operators put the retry in a loop, one would actually
> finish at some point.

Yes, thank you Clint. That is precisely what I was saying.

>>> Trying to make a complex series of related but distributed actions --
>>> like the underlying actions of the Tuskar -> Ironic API calls -- into an
>>> atomic operation is just not a good use of programming effort, IMO.
>>> Instead, I'm advocating that programming effort should instead be spent
>>> coding a workflow/taskflow pipeline that can gracefully retry failed
>>> operations and report the state of the total taskflow back to the user.
>>
>> Sure, there are many ways to solve any particular synchronisation
>> problem. Let's say that we have one that can actually be solved by
>> retrying. Do you want to retry infinitely? Would you like to increase
>> the delays between retries exponentially? If so, where are you going to
>> keep the shared counters for the retries? Perhaps in tuskar-api, hmm?
>>
>
> I don't think a sane person would retry more than maybe once without
> checking with the other operators.
>
>> Or are you just saying that we should pretend that the nondeteministic
>> bugs appearing due to the lack of synchronization simply don't exist?
>> They cannot be easily reproduced, after all. We could just close our
>> eyes, cover our ears, sing "lalalala" and close any bug reports with
>> such errors with "could not reproduce on my single-user, single-machine
>> development installation". I know that a lot of software companies do
>> exactly that, so I guess it's a valid business practice, I just want to
>> make sure that this is actually the tactic that we are going to take,
>> before commiting to an architectural decision that will make those bugs
>> impossible to fix.
>>
>
> OpenStack is non-deterministic. Deterministic systems are rigid and unable
> to handle failure modes of any kind of diversity. We tend to err toward
> pushing problems back to the user and giving them tools to resolve the
> problem. Avoiding spurious problems is important too, no doubt. However,
> what Jay has been suggesting is that the situation a pessimistic locking
> system would avoid is entirely user created, and thus lower priority
> than say, actually having a complete UI for deploying OpenStack.

Bingo.

Thanks,
-jay

Open Stack

[openstack-dev] [TripleO] Tuskar CLI after architecture changes

OpenStack

Community

Documentation

Branding & Legal