[openstack-dev] [TripleO] Tuskar CLI after architecture changes

Ladislav Smola lsmola at redhat.com
Fri Dec 20 11:25:51 UTC 2013


May I propose we keep the conversation Icehouse related. I don't think 
we can make any sort of locking
mechanism in I.

Though it would be worth of creating some WikiPage that would present it 
whole in some consistent
manner. I am kind of lost in these emails. :-)

So, what do you thing are the biggest issues for the Icehouse tasks we have?

1. GET operations?
I don't think we need to be atomic here. We basically join resources 
from multiple APIs together. I think
it's perfectly fine that something will be deleted in the process. Even 
right now we join together only things
that exists. And we can handle when something is not. There is no need 
of locking or retrying here AFAIK.

2. Heat stack create, update
This is locked in the process of the operation, so nobody can mess with 
it while it is updating or creating.
Once we will pack all operations that are now aside in this, we should 
be alright. And that should be doable in I.
So we should push towards this, rather then building some temporary 
locking solution in Tuskar-API.

3. Reservation of resources
As we can deploy only one stack now, so I think it shouldn't be a 
problem with multiple users there. When
somebody will delete the resources from 'free pool' in the process, it 
will fail with 'Not enough free resources'
I guess that is fine.
Also not sure how it's now, but it should be possible to deploy smartly, 
so the stack will be working even
with smaller amount of resources. Then we would just heat stack-update 
with numbers it ended up with,
and it would switch to OK status without changing anything.

So, are there any other critical sections you see?

I know we did it bad way in the previous Tuskar-API and I think we are 
avoiding that now. And we will avoid
it in the future. By simply not doing these kind of stuff until there is 
a proper way to do it.

Thanks,
Ladislav


On 12/20/2013 10:13 AM, Radomir Dopieralski wrote:
> On 20/12/13 00:17, Jay Pipes wrote:
>> On 12/19/2013 04:55 AM, Radomir Dopieralski wrote:
>>> On 14/12/13 16:51, Jay Pipes wrote:
>>>
>>> [snip]
>>>
>>>> Instead of focusing on locking issues -- which I agree are very
>>>> important in the virtualized side of things where resources are
>>>> "thinner" -- I believe that in the bare-metal world, a more useful focus
>>>> would be to ensure that the Tuskar API service treats related group
>>>> operations (like "deploy an undercloud on these nodes") in a way that
>>>> can handle failures in a graceful and/or atomic way.
>>> Atomicity of operations can be achieved by intoducing critical sections.
>>> You basically have two ways of doing that, optimistic and pessimistic.
>>> Pessimistic critical section is implemented with a locking mechanism
>>> that prevents all other processes from entering the critical section
>>> until it is finished.
>> I'm familiar with the traditional non-distributed software concept of a
>> mutex (or in Windows world, a critical section). But we aren't dealing
>> with traditional non-distributed software here. We're dealing with
>> highly distributed software where components involved in the
>> "transaction" may not be running on the same host or have much awareness
>> of each other at all.
> Yes, that is precisely why you need to have a single point where they
> can check if they are not stepping on each other's toes. If you don't,
> you get race conditions and non-deterministic behavior. The only
> difference with traditional, non-distributed software is that since the
> components involved are communicating over a, relatively slow, network,
> you have a much, much greater chance of actually having a conflict.
> Scaling the whole thing to hundreds of nodes practically guarantees trouble.
>
>> And, in any case (see below), I don't think that this is a problem that
>> needs to be solved in Tuskar.
>>
>>> Perhaps you have some other way of making them atomic that I can't
>>> think of?
>> I should not have used the term atomic above. I actually do not think
>> that the things that Tuskar/Ironic does should be viewed as an atomic
>> operation. More below.
> OK, no operations performed by Tuskar need to be atomic, noted.
>
>>>> For example, if the construction or installation of one compute worker
>>>> failed, adding some retry or retry-after-wait-for-event logic would be
>>>> more useful than trying to put locks in a bunch of places to prevent
>>>> multiple sysadmins from trying to deploy on the same bare-metal nodes
>>>> (since it's just not gonna happen in the real world, and IMO, if it did
>>>> happen, the sysadmins/deployers should be punished and have to clean up
>>>> their own mess ;)
>>> I don't see why they should be punished, if the UI was assuring them
>>> that they are doing exactly the thing that they wanted to do, at every
>>> step, and in the end it did something completely different, without any
>>> warning. If anyone deserves punishment in such a situation, it's the
>>> programmers who wrote the UI in such a way.
>> The issue I am getting at is that, in the real world, the problem of
>> multiple users of Tuskar attempting to deploy an undercloud on the exact
>> same set of bare metal machines is just not going to happen. If you
>> think this is actually a real-world problem, and have seen two sysadmins
>> actively trying to deploy an undercloud on bare-metal machines at the
>> same time without unbeknownst to each other, then I feel bad for the
>> sysadmins that found themselves in such a situation, but I feel its
>> their own fault for not knowing about what the other was doing.
> How can it be their fault, when at every step of their interaction with
> the user interface, the user interface was assuring them that they are
> going to do the right thing (deploy a certain set of nodes), but when
> they finally hit the confirmation button, did a completely different
> thing (deployed a different set of nodes)? The only fault I see is in
> them using such software. Or are you suggesting that they should
> implement the lock themselves, through e-mails or some other means of
> communication?
>
> Don't get me wrong, the deploy button is just one easy example of this
> problem. We have it all over the user interface. Even such a simple
> operation, as retrieving a list of node ids, and then displaying the
> corresponding information to the user has a race condition in it -- what
> if some of the nodes get deleted after we get the list of ids, but
> before we make the call to get node details about them? This should be
> done as an atomic operation that either locks, or fails if there was a
> change in the middle of it, and since the calls are to different
> systems, the only place where you can set a lock or check if there was a
> change, is the tuskar-api. And no, trying to get again the information
> about a deleted node won't help -- you can keep retrying for years, and
> the node will still remain deleted. This is all over the place. And,
> saying that "this is the user's fault" doesn't help.
>
>> Trying to make a complex series of related but distributed actions --
>> like the underlying actions of the Tuskar -> Ironic API calls -- into an
>> atomic operation is just not a good use of programming effort, IMO.
>> Instead, I'm advocating that programming effort should instead be spent
>> coding a workflow/taskflow pipeline that can gracefully retry failed
>> operations and report the state of the total taskflow back to the user.
> Sure, there are many ways to solve any particular synchronisation
> problem. Let's say that we have one that can actually be solved by
> retrying. Do you want to retry infinitely? Would you like to increase
> the delays between retries exponentially? If so, where are you going to
> keep the shared counters for the retries? Perhaps in tuskar-api, hmm?
>
> Or are you just saying that we should pretend that the nondeteministic
> bugs appearing due to the lack of synchronization simply don't exist?
> They cannot be easily reproduced, after all. We could just close our
> eyes, cover our ears, sing "lalalala" and close any bug reports with
> such errors with "could not reproduce on my single-user, single-machine
> development installation". I know that a lot of software companies do
> exactly that, so I guess it's a valid business practice, I just want to
> make sure that this is actually the tactic that we are going to take,
> before commiting to an architectural decision that will make those bugs
> impossible to fix.
>




More information about the OpenStack-dev mailing list