[openstack-dev] [TripleO] Tuskar CLI after architecture changes
Radomir Dopieralski
openstack at sheep.art.pl
Fri Dec 20 09:13:20 UTC 2013
On 20/12/13 00:17, Jay Pipes wrote:
> On 12/19/2013 04:55 AM, Radomir Dopieralski wrote:
>> On 14/12/13 16:51, Jay Pipes wrote:
>>
>> [snip]
>>
>>> Instead of focusing on locking issues -- which I agree are very
>>> important in the virtualized side of things where resources are
>>> "thinner" -- I believe that in the bare-metal world, a more useful focus
>>> would be to ensure that the Tuskar API service treats related group
>>> operations (like "deploy an undercloud on these nodes") in a way that
>>> can handle failures in a graceful and/or atomic way.
>>
>> Atomicity of operations can be achieved by intoducing critical sections.
>> You basically have two ways of doing that, optimistic and pessimistic.
>> Pessimistic critical section is implemented with a locking mechanism
>> that prevents all other processes from entering the critical section
>> until it is finished.
>
> I'm familiar with the traditional non-distributed software concept of a
> mutex (or in Windows world, a critical section). But we aren't dealing
> with traditional non-distributed software here. We're dealing with
> highly distributed software where components involved in the
> "transaction" may not be running on the same host or have much awareness
> of each other at all.
Yes, that is precisely why you need to have a single point where they
can check if they are not stepping on each other's toes. If you don't,
you get race conditions and non-deterministic behavior. The only
difference with traditional, non-distributed software is that since the
components involved are communicating over a, relatively slow, network,
you have a much, much greater chance of actually having a conflict.
Scaling the whole thing to hundreds of nodes practically guarantees trouble.
> And, in any case (see below), I don't think that this is a problem that
> needs to be solved in Tuskar.
>
>> Perhaps you have some other way of making them atomic that I can't
>> think of?
>
> I should not have used the term atomic above. I actually do not think
> that the things that Tuskar/Ironic does should be viewed as an atomic
> operation. More below.
OK, no operations performed by Tuskar need to be atomic, noted.
>>> For example, if the construction or installation of one compute worker
>>> failed, adding some retry or retry-after-wait-for-event logic would be
>>> more useful than trying to put locks in a bunch of places to prevent
>>> multiple sysadmins from trying to deploy on the same bare-metal nodes
>>> (since it's just not gonna happen in the real world, and IMO, if it did
>>> happen, the sysadmins/deployers should be punished and have to clean up
>>> their own mess ;)
>>
>> I don't see why they should be punished, if the UI was assuring them
>> that they are doing exactly the thing that they wanted to do, at every
>> step, and in the end it did something completely different, without any
>> warning. If anyone deserves punishment in such a situation, it's the
>> programmers who wrote the UI in such a way.
>
> The issue I am getting at is that, in the real world, the problem of
> multiple users of Tuskar attempting to deploy an undercloud on the exact
> same set of bare metal machines is just not going to happen. If you
> think this is actually a real-world problem, and have seen two sysadmins
> actively trying to deploy an undercloud on bare-metal machines at the
> same time without unbeknownst to each other, then I feel bad for the
> sysadmins that found themselves in such a situation, but I feel its
> their own fault for not knowing about what the other was doing.
How can it be their fault, when at every step of their interaction with
the user interface, the user interface was assuring them that they are
going to do the right thing (deploy a certain set of nodes), but when
they finally hit the confirmation button, did a completely different
thing (deployed a different set of nodes)? The only fault I see is in
them using such software. Or are you suggesting that they should
implement the lock themselves, through e-mails or some other means of
communication?
Don't get me wrong, the deploy button is just one easy example of this
problem. We have it all over the user interface. Even such a simple
operation, as retrieving a list of node ids, and then displaying the
corresponding information to the user has a race condition in it -- what
if some of the nodes get deleted after we get the list of ids, but
before we make the call to get node details about them? This should be
done as an atomic operation that either locks, or fails if there was a
change in the middle of it, and since the calls are to different
systems, the only place where you can set a lock or check if there was a
change, is the tuskar-api. And no, trying to get again the information
about a deleted node won't help -- you can keep retrying for years, and
the node will still remain deleted. This is all over the place. And,
saying that "this is the user's fault" doesn't help.
> Trying to make a complex series of related but distributed actions --
> like the underlying actions of the Tuskar -> Ironic API calls -- into an
> atomic operation is just not a good use of programming effort, IMO.
> Instead, I'm advocating that programming effort should instead be spent
> coding a workflow/taskflow pipeline that can gracefully retry failed
> operations and report the state of the total taskflow back to the user.
Sure, there are many ways to solve any particular synchronisation
problem. Let's say that we have one that can actually be solved by
retrying. Do you want to retry infinitely? Would you like to increase
the delays between retries exponentially? If so, where are you going to
keep the shared counters for the retries? Perhaps in tuskar-api, hmm?
Or are you just saying that we should pretend that the nondeteministic
bugs appearing due to the lack of synchronization simply don't exist?
They cannot be easily reproduced, after all. We could just close our
eyes, cover our ears, sing "lalalala" and close any bug reports with
such errors with "could not reproduce on my single-user, single-machine
development installation". I know that a lot of software companies do
exactly that, so I guess it's a valid business practice, I just want to
make sure that this is actually the tactic that we are going to take,
before commiting to an architectural decision that will make those bugs
impossible to fix.
--
Radomir Dopieralski
More information about the OpenStack-dev
mailing list