Open Stack

Tue Mar 18 13:07:38 UTC 2014

On 03/17/2014 01:54 PM, John Garbutt wrote:
> On 15 March 2014 18:39, Chris Friesen <chris.friesen at windriver.com> wrote:
>> Hi,
>>
>> I'm curious why the specified git commit chose to fix the anti-affinity race
>> condition by aborting the boot and triggering a reschedule.
>>
>> It seems to me that it would have been more elegant for the scheduler to do
>> a database transaction that would atomically check that the chosen host was
>> not already part of the group, and then add the instance (with the chosen
>> host) to the group.  If the check fails then the scheduler could update the
>> group_hosts list and reschedule.  This would prevent the race condition in
>> the first place rather than detecting it later and trying to work around it.
>>
>> This would require setting the "host" field in the instance at the time of
>> scheduling rather than the time of instance creation, but that seems like it
>> should work okay.  Maybe I'm missing something though...
> 
> We deal with memory races in the same way as this today, when they
> race against the scheduler.
> 
> Given the scheduler split, writing that value into the nova db from
> the scheduler would be a step backwards, and it probably breaks lots
> of code that assumes the host is not set until much later.

This is exactly the reason I did it this way.  It fits the existing
pattern with how we deal with host scheduling races today.  We do the
final claiming and validation on the compute node itself and kick back
to the scheduler if something doesn't work out.  Alternatives are *way*
too risky to be doing in feature freeze, IMO.

I think it's great to see discussion of better ways to approach these
things, but it would have to be Juno work.

-- 
Russell Bryant

Open Stack

[openstack-dev] [nova] question about e41fb84 "fix anti-affinity race condition on boot"

OpenStack

Community

Documentation

Branding & Legal