Open Stack

Fri Jan 3 12:57:41 UTC 2014

On 21.12.2013 06:10, Jay Pipes wrote:
> On 12/20/2013 11:34 AM, Clint Byrum wrote:
>> Excerpts from Radomir Dopieralski's message of 2013-12-20 01:13:20 -0800:
>>> On 20/12/13 00:17, Jay Pipes wrote:
>>>> On 12/19/2013 04:55 AM, Radomir Dopieralski wrote:
>>>>> On 14/12/13 16:51, Jay Pipes wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>> Instead of focusing on locking issues -- which I agree are very
>>>>>> important in the virtualized side of things where resources are
>>>>>> "thinner" -- I believe that in the bare-metal world, a more useful focus
>>>>>> would be to ensure that the Tuskar API service treats related group
>>>>>> operations (like "deploy an undercloud on these nodes") in a way that
>>>>>> can handle failures in a graceful and/or atomic way.
>>>>>
>>>>> Atomicity of operations can be achieved by intoducing critical sections.
>>>>> You basically have two ways of doing that, optimistic and pessimistic.
>>>>> Pessimistic critical section is implemented with a locking mechanism
>>>>> that prevents all other processes from entering the critical section
>>>>> until it is finished.
>>>>
>>>> I'm familiar with the traditional non-distributed software concept of a
>>>> mutex (or in Windows world, a critical section). But we aren't dealing
>>>> with traditional non-distributed software here. We're dealing with
>>>> highly distributed software where components involved in the
>>>> "transaction" may not be running on the same host or have much awareness
>>>> of each other at all.
>>>
>>> Yes, that is precisely why you need to have a single point where they
>>> can check if they are not stepping on each other's toes. If you don't,
>>> you get race conditions and non-deterministic behavior. The only
>>> difference with traditional, non-distributed software is that since the
>>> components involved are communicating over a, relatively slow, network,
>>> you have a much, much greater chance of actually having a conflict.
>>> Scaling the whole thing to hundreds of nodes practically guarantees trouble.
>>>
>>
>> Radomir, what Jay is suggesting is that it seems pretty unlikely that
>> two individuals would be given a directive to deploy OpenStack into a
>> single pool of hardware at such a scale where they will both use the
>> whole thing.
>>
>> Worst case, if it does happen, they both run out of hardware, one
>> individual deletes their deployment, the other one resumes. This is the
>> optimistic position and it will work fine. Assuming you are driving this
>> all through Heat (which, AFAIK, Tuskar still uses Heat) there's even a
>> blueprint to support you that I'm working on:
>>
>> https://blueprints.launchpad.net/heat/+spec/retry-failed-update
>>
>> Even if both operators put the retry in a loop, one would actually
>> finish at some point.
>
> Yes, thank you Clint. That is precisely what I was saying.
>
>>>> Trying to make a complex series of related but distributed actions --
>>>> like the underlying actions of the Tuskar -> Ironic API calls -- into an
>>>> atomic operation is just not a good use of programming effort, IMO.
>>>> Instead, I'm advocating that programming effort should instead be spent
>>>> coding a workflow/taskflow pipeline that can gracefully retry failed
>>>> operations and report the state of the total taskflow back to the user.
>>>
>>> Sure, there are many ways to solve any particular synchronisation
>>> problem. Let's say that we have one that can actually be solved by
>>> retrying. Do you want to retry infinitely? Would you like to increase
>>> the delays between retries exponentially? If so, where are you going to
>>> keep the shared counters for the retries? Perhaps in tuskar-api, hmm?
>>>
>>
>> I don't think a sane person would retry more than maybe once without
>> checking with the other operators.
>>
>>> Or are you just saying that we should pretend that the nondeteministic
>>> bugs appearing due to the lack of synchronization simply don't exist?
>>> They cannot be easily reproduced, after all. We could just close our
>>> eyes, cover our ears, sing "lalalala" and close any bug reports with
>>> such errors with "could not reproduce on my single-user, single-machine
>>> development installation". I know that a lot of software companies do
>>> exactly that, so I guess it's a valid business practice, I just want to
>>> make sure that this is actually the tactic that we are going to take,
>>> before commiting to an architectural decision that will make those bugs
>>> impossible to fix.
>>>
>>
>> OpenStack is non-deterministic. Deterministic systems are rigid and unable
>> to handle failure modes of any kind of diversity. We tend to err toward
>> pushing problems back to the user and giving them tools to resolve the
>> problem. Avoiding spurious problems is important too, no doubt. However,
>> what Jay has been suggesting is that the situation a pessimistic locking
>> system would avoid is entirely user created, and thus lower priority
>> than say, actually having a complete UI for deploying OpenStack.

+1. I very much agree with Jay and Clint on this matter.

Jirka

Open Stack

[openstack-dev] [TripleO] Tuskar CLI after architecture changes

OpenStack

Community

Documentation

Branding & Legal