All VMs fail when --max exceeds available resources

Sean Mooney smooney at redhat.com
Thu Nov 21 12:04:24 UTC 2019


On Wed, 2019-11-20 at 16:04 -0800, melanie witt wrote:
> On 11/20/19 15:16, Albert Braden wrote:
> > The expected result (that I was seeing last week) is that, if my cluster has capacity for 4 VMs and I use --max 5, 4
> > will go active and 1 will go to error. This week all 5 are going to error. I can still build 4 VMs of that flavor,
> > one at a time, or use --max 4, but if I use --max 5, then all 5 will fail. If I use smaller VMs, the --max numbers
> > get bigger but I still see the same symptom.
> 
> The behavior you're describing is an old issue described here:
> 
> https://bugs.launchpad.net/nova/+bug/1458122
> 
> I don't understand how it's possible that you saw the 4 active 1 in 
> error behavior last week. The behavior has been "error all" since 2015, 
> at least. Unless there's some kind of race condition bug happening, 
> maybe. Did you consistently see it fulfill less than --max last week or 
> was it just once?
for what its worth i have definitely seen the behavior where you use max an only some
go active and some go to error i cant recall if its post rocky but i a agree with the
bug in that they should only all go to error if the min value was not met.
max should not error out.
> 
> Changing the behavior would be an API change so it would need a spec and 
> new microversion, I think. It's been an undesirable behavior for a long 
> time but it seemingly hasn't been enough of a pain point for someone to 
> sign up and do the work.
well what would be the api change i previous though that the behavior was some would go active and some would not
if that is not the current behaviour it was change without a spec and that is a regressions.

i think the behavior might change if the max vaule exceeds the batch size. we group the resues in set of 10? by default
if all the vms in a batch go active and latter vms in a different set fail the first vms will remain active.
i cant remember which config option contolse that but there is one. its max concurent build or somethign like that.
> 
> -melanie
> 
> > The --max thing is pretty useful and we use it a lot; it allows us to use up the cluster without knowing exactly how
> > much space we have.
> > 
> > -----Original Message-----
> > From: Matt Riedemann <mriedemos at gmail.com>
> > Sent: Wednesday, November 20, 2019 2:00 PM
> > To: openstack-discuss at lists.openstack.org
> > Subject: Re: All VMs fail when --max exceeds available resources
> > 
> > On 11/20/2019 3:21 PM, Albert Braden wrote:
> > > I think the document is saying that we need to set them in nova.conf on each HV. I tried that and it seems to fix
> > > the allocation failure:
> > > 
> > > root at us01odc-dev1-ctrl1:~# os resource provider inventory list f20fa03d-18f4-486b-9b40-ceaaf52dabf8
> > > +----------------+------------------+----------+----------+-----------+----------+--------+
> > > > resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit |  total |
> > > 
> > > +----------------+------------------+----------+----------+-----------+----------+--------+
> > > > VCPU           |              1.0 |       16 |        2 |         1 |        1 |     16 |
> > > > MEMORY_MB      |              1.0 |   128888 |     8192 |         1 |        1 | 128888 |
> > > > DISK_GB        |              1.0 |     1208 |      246 |         1 |        1 |   1208 |
> > > 
> > > +----------------+------------------+----------+----------+-----------+----------+--------+
> > 
> > Yup, the config on the controller doesn't apply to the computes or
> > placement because the computes are what report the inventory to
> > placement so you have to configure the allocation ratios there, or
> > starting in stein via (resource provider) aggregate.
> > 
> > > 
> > > This fixed the "allocation ratio" issue but I still see the --max issue. What could be causing that?
> > 
> > That's something else yeah? I didn't quite dig into that email and the
> > allocation ratio thing popped up to me since it's been a long standing
> > known painful issue/behavior change since Ocata.
> > 
> > One question though, I read your original email as essentially "(1) I
> > did x and got some failures, then (2) I changed something and now
> > everything fails", but are you running from a clean environment in both
> > test scenarios because if you have VMs on the computes when you're doing
> > (2) then that's going to change the scheduling results in (2), i.e. the
> > computes will have less capacity since there are resources allocated on
> > them in placement.
> > 
> 
> 




More information about the openstack-discuss mailing list