[OpenStack-Infra] Debugging nodepool over-allocation with history tracking (109132)
Ian Wienand
iwienand at redhat.com
Fri Aug 29 06:12:24 UTC 2014
On 08/26/2014 04:04 PM, Ian Wienand wrote:
> I'm having a hard time getting the description in [1] to trigger after
> trying several different approaches.
A huge thank-you to jeblair for getting the logs out and a day of
analysis. We can see things going crazy [1] with negative
allocations.
---
nodepool.NodePool: <AllocationRequest for 248.0 of bare-precise>
nodepool.NodePool: <AllocationSubRequest for -116.776699029 (out of 248.0) of bare-precise from hpcloud-b3>
nodepool.NodePool: <AllocationSubRequest for -109.553398058 (out of 248.0) of bare-precise from hpcloud-b2>
nodepool.NodePool: <AllocationSubRequest for -71.0291262136 (out of 248.0) of bare-precise from rax-iad>
nodepool.NodePool: <AllocationSubRequest for -115.572815534 (out of 248.0) of bare-precise from hpcloud-b1>
nodepool.NodePool: <AllocationSubRequest for -198.640776699 (out of 248.0) of bare-precise from rax-dfw>
nodepool.NodePool: <AllocationSubRequest for 1018.48543689 (out of 248.0) of bare-precise from hpcloud-b5>
nodepool.NodePool: <AllocationSubRequest for -44.5436893204 (out of 248.0) of bare-precise from rax-ord>
nodepool.NodePool: <AllocationSubRequest for -114.368932039 (out of 248.0) of bare-precise from hpcloud-b4>
---
I traced through, from a simple example, what is happening with the
allocator when it is getting negative values in the comment of [2].
This explains what happens in the included test-case.
I feel it explains why the history-tracking got itself into this
situation; because it promotes small allocations it was vastly
over-allocating. It would just keep getting worse as more-and-more
nodes started to fail.
The extant allocator doesn't notice this, especially in the current
busy environment, because it would generally result in already
over-capacity requests being larger. It could still over-allocate
(the test-case shows this) but because smaller allocations don't get
any preferential treatment, it doesn't become a run-away problem.
Thus I do not think the existing change really needs updating [3],
other than being rebased on the fix (which it is)
-i
[1] http://nodepool.openstack.org/allocator-failure.log
[2] https://review.openstack.org/#/c/109185/
[3] https://review.openstack.org/#/c/109890/
More information about the OpenStack-Infra
mailing list