Open Stack

Wed Sep 7 15:44:39 UTC 2016

On 09/07/2016 06:46 AM, Chris Dent wrote:
> 
> More updates on resource providers work:
> 
> Yesterday we realized that a SQL join for associating inventories
> with allocations and resource providers was missing a critical and
> clause. This was leading to allocations failing to be written when
> there should have been plenty of capacity.
> 
> This was fixed in:
> 
>     https://review.openstack.org/#/c/366245/
> 
> It will merged in a few minutes. There are still some concerns that
> we don't understand why tests (of the prior code) were not failing.

As a follow up here, I actually got to the bottom of why the old tests
didn't work.

There were no tests which had > 1 resource class and > 1 consumer for a
resource provider. And even if there had been, they probably wouldn't
have failed unless the scales of the resource providers were enough that
comparing the free / used mixed up between them would have caused an issue.

To reproduce the key issue you need to have active allocations in the
database not owned by your consumer. Because one of the first things
that happens when setting allocations for your consumer, is it deletes
existing allocations.

If nothing is in the allocations table the left outer join has nothing
to add up to join for usage. Basically the column set:

    cols_in_output = [
        _RP_TBL.c.id.label('resource_provider_id'),
        _RP_TBL.c.uuid,
        _RP_TBL.c.generation,
        _INV_TBL.c.resource_class_id,
        _INV_TBL.c.total,
        _INV_TBL.c.reserved,
        _INV_TBL.c.allocation_ratio,
        usage.c.used,
    ]

Ends up with None in the final column. So you'll get rows like.

1,$uuid,1,2,1024,4,16.0,None
1,$uuid,1,9,40,4,1.0,None

However, if there are existing allocations there, the left outer join
blows this out into a matrix and you'd get:

1,$uuid,1,2,1024,4,16.0,16
1,$uuid,1,9,40,4,1.0,16
1,$uuid,1,2,1024,4,16.0,1
1,$uuid,1,9,40,4,1.0,1

Where 1 is the usage by resource provider 9, and 16 is the usage by
resource provider 2. This is because of a missing join where the
inventory.resource_class_id == allocs.resource_class_id.

The fix provides a test that will explode if we regress this.

Because this only would expose when we've got existing allocations by a
different consumer (i.e. a concurrently running guest), this explains
why it was spuraticly showing up in the gate. Only when 3 guests were
stood up at the same time (either in a test, or between) would we get
this issue. Our guests run at 64M memory, we run on 8 cpu hosts, with
16x modifier.

If we compare consumed ram to available cpu (which was the actual fail
happening) the first guest up consumes 64M ram, 1 vcpu. 128 vcpu can be
consumed, 128 - 64 >= 0. Second guest gets us to 128M ram, 2 vcpu.
Again, we can actually survive the column shift. But once we are >= 3
guests at once we can hit this. There are no ORDER by clauses inside the
SQL monster
(https://github.com/openstack/nova/blob/25abb68039ca122b4b3796a9f8c9e3495db22772/nova/objects/resource_provider.py#L637)
which means which order we'll get the rows and the join means sometimes
we'll be correctly comparing, sometimes we won't. But until you get to 3
guests at once, then you'll never be able to see it.

	-Sean

-- 
Sean Dague
http://dague.net

Open Stack

[openstack-dev] [nova] Next steps for resource providers work

OpenStack

Community

Documentation

Branding & Legal