Re: All VMs fail when --max exceeds available resources

21 Nov 2019


      On Wed, 2019-11-20 at 16:04 -0800, melanie witt wrote:
...
On 11/20/19 15:16, Albert Braden wrote:
...
The expected result (that I was seeing last week) is that, if my cluster has capacity for 4 VMs and I use --max 5, 4
will go active and 1 will go to error. This week all 5 are going to error. I can still build 4 VMs of that flavor,
one at a time, or use --max 4, but if I use --max 5, then all 5 will fail. If I use smaller VMs, the --max numbers
get bigger but I still see the same symptom.
The behavior you're describing is an old issue described here:
https://bugs.launchpad.net/nova/+bug/1458122
I don't understand how it's possible that you saw the 4 active 1 in 
error behavior last week. The behavior has been "error all" since 2015, 
at least. Unless there's some kind of race condition bug happening, 
maybe. Did you consistently see it fulfill less than --max last week or 
was it just once?
for what its worth i have definitely seen the behavior where you use max an only some
go active and some go to error i cant recall if its post rocky but i a agree with the
bug in that they should only all go to error if the min value was not met.
max should not error out.
Changing the behavior would be an API change so it would need a spec and 
new microversion, I think. It's been an undesirable behavior for a long 
time but it seemingly hasn't been enough of a pain point for someone to 
sign up and do the work.
well what would be the api change i previous though that the behavior was some would go active and some would not
if that is not the current behaviour it was change without a spec and that is a regressions.
i think the behavior might change if the max vaule exceeds the batch size. we group the resues in set of 10? by default
if all the vms in a batch go active and latter vms in a different set fail the first vms will remain active.
i cant remember which config option contolse that but there is one. its max concurent build or somethign like that.
...
-melanie
...
The --max thing is pretty useful and we use it a lot; it allows us to use up the cluster without knowing exactly how
much space we have.
-----Original Message-----
From: Matt Riedemann <mriedemos@gmail.com>
Sent: Wednesday, November 20, 2019 2:00 PM
To: openstack-discuss@lists.openstack.org
Subject: Re: All VMs fail when --max exceeds available resources
On 11/20/2019 3:21 PM, Albert Braden wrote:
...
I think the document is saying that we need to set them in nova.conf on each HV. I tried that and it seems to fix
the allocation failure:
root@us01odc-dev1-ctrl1:~# os resource provider inventory list f20fa03d-18f4-486b-9b40-ceaaf52dabf8
+----------------+------------------+----------+----------+-----------+----------+--------+
...
resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit |  total |
+----------------+------------------+----------+----------+-----------+----------+--------+
...
VCPU           |              1.0 |       16 |        2 |         1 |        1 |     16 |
MEMORY_MB      |              1.0 |   128888 |     8192 |         1 |        1 | 128888 |
DISK_GB        |              1.0 |     1208 |      246 |         1 |        1 |   1208 |
+----------------+------------------+----------+----------+-----------+----------+--------+
Yup, the config on the controller doesn't apply to the computes or
placement because the computes are what report the inventory to
placement so you have to configure the allocation ratios there, or
starting in stein via (resource provider) aggregate.
...
This fixed the "allocation ratio" issue but I still see the --max issue. What could be causing that?
That's something else yeah? I didn't quite dig into that email and the
allocation ratio thing popped up to me since it's been a long standing
known painful issue/behavior change since Ocata.
One question though, I read your original email as essentially "(1) I
did x and got some failures, then (2) I changed something and now
everything fails", but are you running from a clean environment in both
test scenarios because if you have VMs on the computes when you're doing
(2) then that's going to change the scheduling results in (2), i.e. the
computes will have less capacity since there are resources allocated on
them in placement.

Re: All VMs fail when --max exceeds available resources

Sean Mooney