Re: All VMs fail when --max exceeds available resources
The other symptom is that the scheduler will send single VMs to a full hypervisor and overload it even though we have cpu_allocation_ratio and ram_allocation_ratio set to 1: root@us01odc-dev1-ctrl1:~# os hypervisor list --long +----+------------------------------------------+-----------------+---------------+-------+------------+-------+----------------+-----------+ | ID | Hypervisor Hostname | Hypervisor Type | Host IP | State | vCPUs Used | vCPUs | Memory MB Used | Memory MB | +----+------------------------------------------+-----------------+---------------+-------+------------+-------+----------------+-----------+ | 1 | us01odc-dev1-hv003.internal.synopsys.com | QEMU | 10.195.116.16 | up | 42 | 16 | 161792 | 128888 | | 3 | us01odc-dev1-hv002.internal.synopsys.com | QEMU | 10.195.116.15 | up | 43 | 16 | 165888 | 128888 | | 4 | us01odc-dev1-hv001.internal.synopsys.com | QEMU | 10.195.116.14 | up | 38 | 16 | 161792 | 128888 | +----+------------------------------------------+-----------------+---------------+-------+------------+-------+----------------+-----------+ In the logs I see the scheduler returning 1 host: /var/log/nova/nova-scheduler.log:2019-11-19 16:38:20.930 895454 DEBUG nova.filters [req-0703f1f8-a52a-4fb6-a402-226dd25e9988 2cb6757679d54a69803a5b6e317b3a93 474ae347d8ad426f8118e55eee47dcfd - default 7d3a4deab35b434bba403100a6729c81] Filter NUMATopologyFilter returned 1 host(s) get_filtered_objects /usr/lib/python2.7/dist-packages/nova/filters.py:104 It weighs the host and reports negative RAM: /var/log/nova/nova-scheduler.log:2019-11-19 16:38:20.930 895454 DEBUG nova.scheduler.filter_scheduler [req-0703f1f8-a52a-4fb6-a402-226dd25e9988 2cb6757679d54a69803a5b6e317b3a93 474ae347d8ad426f8118e55eee47dcfd - default 7d3a4deab35b434bba403100a6729c81] Weighed [WeighedHost [host: (us01odc-dev1-hv002, us01odc-dev1-hv002.internal.synopsys.com) ram: -6280MB disk: 683008MB io_ops: 1 instances: 5, weight: 0.0]] _get_sorted_hosts /usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py:454 Then it selects that host: /var/log/nova/nova-scheduler.log:2019-11-19 16:38:20.931 895454 DEBUG nova.scheduler.utils [req-0703f1f8-a52a-4fb6-a402-226dd25e9988 2cb6757679d54a69803a5b6e317b3a93 474ae347d8ad426f8118e55eee47dcfd - default 7d3a4deab35b434bba403100a6729c81] Attempting to claim resources in the placement API for instance 834dd112-26f8-424c-a92b-23423baa185a claim_resources /usr/lib/python2.7/dist-packages/nova/scheduler/utils.py:935 /var/log/nova/nova-scheduler.log:2019-11-19 16:38:21.567 895454 DEBUG nova.scheduler.filter_scheduler [req-0703f1f8-a52a-4fb6-a402-226dd25e9988 2cb6757679d54a69803a5b6e317b3a93 474ae347d8ad426f8118e55eee47dcfd - default 7d3a4deab35b434bba403100a6729c81] [instance: 834dd112-26f8-424c-a92b-23423baa185a] Selected host: (us01odc-dev1-hv002, us01odc-dev1-hv002.internal.synopsys.com) ram: -6280MB disk: 683008MB io_ops: 1 instances: 5 _consume_selected_host /usr/lib/python2.7/dist-packages/nova/scheduler/filter_scheduler.py:346 The VM builds successfully and goes to ACTIVE. What should I be looking for here? Obviously I broke the scheduler, but my nova config is the same as the working cluster.
On 11/20/2019 1:02 PM, Albert Braden wrote:
The other symptom is that the scheduler will send single VMs to a full hypervisor and overload it even though we have cpu_allocation_ratio and ram_allocation_ratio set to 1:
You're on Rocky correct? If allocation ratios are acting funky, you should read through this: https://docs.openstack.org/nova/rocky/admin/configuration/schedulers.html#bu... There were some changes in Stein to help with configuring nova to deal with allocation ratios per compute or via aggregate: https://docs.openstack.org/nova/latest/admin/configuration/schedulers.html#a... But what you'll likely need to do is manage the allocation ratios in aggregate on the resource providers in placement. Fortunately there is a CLI for doing that: https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-prov... e.g. openstack resource provider inventory set <aggregate_uuid> --resource VCPU:allocation_ratio=1.0 --aggregate --amend Anyway, see if that documented bug with allocation ratios is your issue first and then go through the workarounds. -- Thanks, Matt
Yes, we are on Rocky. If I'm reading correctly the document says that setting allocation ratios by aggregate may not work after Ocata, but we are setting them in nova.conf on the controller. That setting does appear to have failed. The settings are 1: root@us01odc-dev1-ctrl1:~# grep allocation_ /etc/nova/nova.conf cpu_allocation_ratio = 1 ram_allocation_ratio = 1.0 But the inventory shows different values: root@us01odc-dev1-ctrl1:~# os resource provider inventory list f20fa03d-18f4-486b-9b40-ceaaf52dabf8 +----------------+------------------+----------+----------+-----------+----------+--------+ | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total | +----------------+------------------+----------+----------+-----------+----------+--------+ | VCPU | 16.0 | 16 | 2 | 1 | 1 | 16 | | MEMORY_MB | 1.5 | 128888 | 8192 | 1 | 1 | 128888 | | DISK_GB | 1.0 | 1208 | 246 | 1 | 1 | 1208 | +----------------+------------------+----------+----------+-----------+----------+--------+ I think the document is saying that we need to set them in nova.conf on each HV. I tried that and it seems to fix the allocation failure: root@us01odc-dev1-ctrl1:~# os resource provider inventory list f20fa03d-18f4-486b-9b40-ceaaf52dabf8 +----------------+------------------+----------+----------+-----------+----------+--------+ | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total | +----------------+------------------+----------+----------+-----------+----------+--------+ | VCPU | 1.0 | 16 | 2 | 1 | 1 | 16 | | MEMORY_MB | 1.0 | 128888 | 8192 | 1 | 1 | 128888 | | DISK_GB | 1.0 | 1208 | 246 | 1 | 1 | 1208 | +----------------+------------------+----------+----------+-----------+----------+--------+ This fixed the "allocation ratio" issue but I still see the --max issue. What could be causing that? -----Original Message----- From: Matt Riedemann <mriedemos@gmail.com> Sent: Wednesday, November 20, 2019 12:10 PM To: openstack-discuss@lists.openstack.org Subject: Re: All VMs fail when --max exceeds available resources On 11/20/2019 1:02 PM, Albert Braden wrote:
The other symptom is that the scheduler will send single VMs to a full hypervisor and overload it even though we have cpu_allocation_ratio and ram_allocation_ratio set to 1:
You're on Rocky correct? If allocation ratios are acting funky, you should read through this: https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.openstack.org_nova_rocky_admin_configuration_schedulers.html-23bug-2D1804125&d=DwIC-g&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=gMZHbn10OjlP7p38T4nIjXKKFJRwV8b1vbxdP_PSGlg&s=hLHu3N1jilahIhKN7TpazkEVjFSQX-YwtVvTos-h9BY&e= There were some changes in Stein to help with configuring nova to deal with allocation ratios per compute or via aggregate: https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.openstack.org_nova_latest_admin_configuration_schedulers.html-23allocation-2Dratios&d=DwIC-g&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=gMZHbn10OjlP7p38T4nIjXKKFJRwV8b1vbxdP_PSGlg&s=Gb4j3hz2t9M_BDmhIMvg2BQiXcg5CEYAdMlj-PFZygQ&e= But what you'll likely need to do is manage the allocation ratios in aggregate on the resource providers in placement. Fortunately there is a CLI for doing that: https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.openstack.org_osc-2Dplacement_latest_cli_index.html-23resource-2Dprovider-2Dinventory-2Dset&d=DwIC-g&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=gMZHbn10OjlP7p38T4nIjXKKFJRwV8b1vbxdP_PSGlg&s=tsaZIozBHkiBjNvbZGRvyuRMQOKe23zN2ouP3uOi8ag&e= e.g. openstack resource provider inventory set <aggregate_uuid> --resource VCPU:allocation_ratio=1.0 --aggregate --amend Anyway, see if that documented bug with allocation ratios is your issue first and then go through the workarounds. -- Thanks, Matt
On 11/20/2019 3:21 PM, Albert Braden wrote:
I think the document is saying that we need to set them in nova.conf on each HV. I tried that and it seems to fix the allocation failure:
root@us01odc-dev1-ctrl1:~# os resource provider inventory list f20fa03d-18f4-486b-9b40-ceaaf52dabf8 +----------------+------------------+----------+----------+-----------+----------+--------+ | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total | +----------------+------------------+----------+----------+-----------+----------+--------+ | VCPU | 1.0 | 16 | 2 | 1 | 1 | 16 | | MEMORY_MB | 1.0 | 128888 | 8192 | 1 | 1 | 128888 | | DISK_GB | 1.0 | 1208 | 246 | 1 | 1 | 1208 | +----------------+------------------+----------+----------+-----------+----------+--------+
Yup, the config on the controller doesn't apply to the computes or placement because the computes are what report the inventory to placement so you have to configure the allocation ratios there, or starting in stein via (resource provider) aggregate.
This fixed the "allocation ratio" issue but I still see the --max issue. What could be causing that?
That's something else yeah? I didn't quite dig into that email and the allocation ratio thing popped up to me since it's been a long standing known painful issue/behavior change since Ocata. One question though, I read your original email as essentially "(1) I did x and got some failures, then (2) I changed something and now everything fails", but are you running from a clean environment in both test scenarios because if you have VMs on the computes when you're doing (2) then that's going to change the scheduling results in (2), i.e. the computes will have less capacity since there are resources allocated on them in placement. -- Thanks, Matt
The expected result (that I was seeing last week) is that, if my cluster has capacity for 4 VMs and I use --max 5, 4 will go active and 1 will go to error. This week all 5 are going to error. I can still build 4 VMs of that flavor, one at a time, or use --max 4, but if I use --max 5, then all 5 will fail. If I use smaller VMs, the --max numbers get bigger but I still see the same symptom. The --max thing is pretty useful and we use it a lot; it allows us to use up the cluster without knowing exactly how much space we have. -----Original Message----- From: Matt Riedemann <mriedemos@gmail.com> Sent: Wednesday, November 20, 2019 2:00 PM To: openstack-discuss@lists.openstack.org Subject: Re: All VMs fail when --max exceeds available resources On 11/20/2019 3:21 PM, Albert Braden wrote:
I think the document is saying that we need to set them in nova.conf on each HV. I tried that and it seems to fix the allocation failure:
root@us01odc-dev1-ctrl1:~# os resource provider inventory list f20fa03d-18f4-486b-9b40-ceaaf52dabf8 +----------------+------------------+----------+----------+-----------+----------+--------+ | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total | +----------------+------------------+----------+----------+-----------+----------+--------+ | VCPU | 1.0 | 16 | 2 | 1 | 1 | 16 | | MEMORY_MB | 1.0 | 128888 | 8192 | 1 | 1 | 128888 | | DISK_GB | 1.0 | 1208 | 246 | 1 | 1 | 1208 | +----------------+------------------+----------+----------+-----------+----------+--------+
Yup, the config on the controller doesn't apply to the computes or placement because the computes are what report the inventory to placement so you have to configure the allocation ratios there, or starting in stein via (resource provider) aggregate.
This fixed the "allocation ratio" issue but I still see the --max issue. What could be causing that?
That's something else yeah? I didn't quite dig into that email and the allocation ratio thing popped up to me since it's been a long standing known painful issue/behavior change since Ocata. One question though, I read your original email as essentially "(1) I did x and got some failures, then (2) I changed something and now everything fails", but are you running from a clean environment in both test scenarios because if you have VMs on the computes when you're doing (2) then that's going to change the scheduling results in (2), i.e. the computes will have less capacity since there are resources allocated on them in placement. -- Thanks, Matt
On 11/20/2019 5:16 PM, Albert Braden wrote:
The expected result (that I was seeing last week) is that, if my cluster has capacity for 4 VMs and I use --max 5, 4 will go active and 1 will go to error. This week all 5 are going to error. I can still build 4 VMs of that flavor, one at a time, or use --max 4, but if I use --max 5, then all 5 will fail. If I use smaller VMs, the --max numbers get bigger but I still see the same symptom.
The --max thing is pretty useful and we use it a lot; it allows us to use up the cluster without knowing exactly how much space we have.
OK so I think you're hitting this with the NoValidHost error: https://github.com/openstack/nova/blob/18.0.0/nova/conductor/manager.py#L120... And that's putting all of the instances into ERROR status even though 4 out of the 5 did successfully allocate resources in the scheduler. The scheduler would have rolled back the allocations here if it couldn't fit everything: https://github.com/openstack/nova/blob/18.0.0/nova/scheduler/filter_schedule... Which release did you say that the --max 5 scenario worked where 4 would be successfully built but the remaining one would go to ERROR status? I'm just trying to figure out where/when the regression in behavior occurred. -- Thanks, Matt
On 11/20/2019 5:51 PM, Matt Riedemann wrote:
Which release did you say that the --max 5 scenario worked where 4 would be successfully built but the remaining one would go to ERROR status? I'm just trying to figure out where/when the regression in behavior occurred.
Reading back on your original email, I guess it's the same release (Rocky). I can't really understand how you got the first scenario where you used --max 5 and 4 were built but one failed and was put into ERROR status, especially if the environment and server create request is made the same way. Given the links in my previous email, I would expect them all to go to ERROR status when the scheduler raises NoValidHost. And yeah that's likely a regression, but the inconsistent behavior is what is weird to me. -- Thanks, Matt
On 11/20/19 15:16, Albert Braden wrote:
The expected result (that I was seeing last week) is that, if my cluster has capacity for 4 VMs and I use --max 5, 4 will go active and 1 will go to error. This week all 5 are going to error. I can still build 4 VMs of that flavor, one at a time, or use --max 4, but if I use --max 5, then all 5 will fail. If I use smaller VMs, the --max numbers get bigger but I still see the same symptom.
The behavior you're describing is an old issue described here: https://bugs.launchpad.net/nova/+bug/1458122 I don't understand how it's possible that you saw the 4 active 1 in error behavior last week. The behavior has been "error all" since 2015, at least. Unless there's some kind of race condition bug happening, maybe. Did you consistently see it fulfill less than --max last week or was it just once? Changing the behavior would be an API change so it would need a spec and new microversion, I think. It's been an undesirable behavior for a long time but it seemingly hasn't been enough of a pain point for someone to sign up and do the work. -melanie
The --max thing is pretty useful and we use it a lot; it allows us to use up the cluster without knowing exactly how much space we have.
-----Original Message----- From: Matt Riedemann <mriedemos@gmail.com> Sent: Wednesday, November 20, 2019 2:00 PM To: openstack-discuss@lists.openstack.org Subject: Re: All VMs fail when --max exceeds available resources
On 11/20/2019 3:21 PM, Albert Braden wrote:
I think the document is saying that we need to set them in nova.conf on each HV. I tried that and it seems to fix the allocation failure:
root@us01odc-dev1-ctrl1:~# os resource provider inventory list f20fa03d-18f4-486b-9b40-ceaaf52dabf8 +----------------+------------------+----------+----------+-----------+----------+--------+ | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total | +----------------+------------------+----------+----------+-----------+----------+--------+ | VCPU | 1.0 | 16 | 2 | 1 | 1 | 16 | | MEMORY_MB | 1.0 | 128888 | 8192 | 1 | 1 | 128888 | | DISK_GB | 1.0 | 1208 | 246 | 1 | 1 | 1208 | +----------------+------------------+----------+----------+-----------+----------+--------+
Yup, the config on the controller doesn't apply to the computes or placement because the computes are what report the inventory to placement so you have to configure the allocation ratios there, or starting in stein via (resource provider) aggregate.
This fixed the "allocation ratio" issue but I still see the --max issue. What could be causing that?
That's something else yeah? I didn't quite dig into that email and the allocation ratio thing popped up to me since it's been a long standing known painful issue/behavior change since Ocata.
One question though, I read your original email as essentially "(1) I did x and got some failures, then (2) I changed something and now everything fails", but are you running from a clean environment in both test scenarios because if you have VMs on the computes when you're doing (2) then that's going to change the scheduling results in (2), i.e. the computes will have less capacity since there are resources allocated on them in placement.
On Wed, 2019-11-20 at 16:04 -0800, melanie witt wrote:
On 11/20/19 15:16, Albert Braden wrote:
The expected result (that I was seeing last week) is that, if my cluster has capacity for 4 VMs and I use --max 5, 4 will go active and 1 will go to error. This week all 5 are going to error. I can still build 4 VMs of that flavor, one at a time, or use --max 4, but if I use --max 5, then all 5 will fail. If I use smaller VMs, the --max numbers get bigger but I still see the same symptom.
The behavior you're describing is an old issue described here:
https://bugs.launchpad.net/nova/+bug/1458122
I don't understand how it's possible that you saw the 4 active 1 in error behavior last week. The behavior has been "error all" since 2015, at least. Unless there's some kind of race condition bug happening, maybe. Did you consistently see it fulfill less than --max last week or was it just once? for what its worth i have definitely seen the behavior where you use max an only some go active and some go to error i cant recall if its post rocky but i a agree with the bug in that they should only all go to error if the min value was not met. max should not error out.
Changing the behavior would be an API change so it would need a spec and new microversion, I think. It's been an undesirable behavior for a long time but it seemingly hasn't been enough of a pain point for someone to sign up and do the work. well what would be the api change i previous though that the behavior was some would go active and some would not if that is not the current behaviour it was change without a spec and that is a regressions.
i think the behavior might change if the max vaule exceeds the batch size. we group the resues in set of 10? by default if all the vms in a batch go active and latter vms in a different set fail the first vms will remain active. i cant remember which config option contolse that but there is one. its max concurent build or somethign like that.
-melanie
The --max thing is pretty useful and we use it a lot; it allows us to use up the cluster without knowing exactly how much space we have.
-----Original Message----- From: Matt Riedemann <mriedemos@gmail.com> Sent: Wednesday, November 20, 2019 2:00 PM To: openstack-discuss@lists.openstack.org Subject: Re: All VMs fail when --max exceeds available resources
On 11/20/2019 3:21 PM, Albert Braden wrote:
I think the document is saying that we need to set them in nova.conf on each HV. I tried that and it seems to fix the allocation failure:
root@us01odc-dev1-ctrl1:~# os resource provider inventory list f20fa03d-18f4-486b-9b40-ceaaf52dabf8 +----------------+------------------+----------+----------+-----------+----------+--------+
resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
+----------------+------------------+----------+----------+-----------+----------+--------+
VCPU | 1.0 | 16 | 2 | 1 | 1 | 16 | MEMORY_MB | 1.0 | 128888 | 8192 | 1 | 1 | 128888 | DISK_GB | 1.0 | 1208 | 246 | 1 | 1 | 1208 |
+----------------+------------------+----------+----------+-----------+----------+--------+
Yup, the config on the controller doesn't apply to the computes or placement because the computes are what report the inventory to placement so you have to configure the allocation ratios there, or starting in stein via (resource provider) aggregate.
This fixed the "allocation ratio" issue but I still see the --max issue. What could be causing that?
That's something else yeah? I didn't quite dig into that email and the allocation ratio thing popped up to me since it's been a long standing known painful issue/behavior change since Ocata.
One question though, I read your original email as essentially "(1) I did x and got some failures, then (2) I changed something and now everything fails", but are you running from a clean environment in both test scenarios because if you have VMs on the computes when you're doing (2) then that's going to change the scheduling results in (2), i.e. the computes will have less capacity since there are resources allocated on them in placement.
On 11/21/2019 6:04 AM, Sean Mooney wrote:
i think the behavior might change if the max vaule exceeds the batch size. we group the resues in set of 10? by default if all the vms in a batch go active and latter vms in a different set fail the first vms will remain active. i cant remember which config option contolse that but there is one. its max concurent build or somethign like that.
That batch size option is per-compute. For what Albert was hitting it failed with NoValidHost in the scheduler so the compute isn't involved. What you're describing is likely legacy behavior where the scheduler said, "yup sure putting 20 instances on a few computes is probably OK" and then they raced to do the RT claim on the compute and failed late and went to ERROR while some went ACTIVE. That window was closed for vcpu/ram/disk claims in Pike when the scheduler started using placement to create atomic resource allocation claims. So if someone can reproduce this issue with --max and some go active while some go error in the same request post-pike I'd be surprised. Doing that in *concurrent* requests I could understand since the scheduler could be a bit split brain there but placement still would not be. -- Thanks, Matt
On 11/21/19 04:04, Sean Mooney wrote:
On Wed, 2019-11-20 at 16:04 -0800, melanie witt wrote:
Changing the behavior would be an API change so it would need a spec and new microversion, I think. It's been an undesirable behavior for a long time but it seemingly hasn't been enough of a pain point for someone to sign up and do the work. well what would be the api change i previous though that the behavior was some would go active and some would not if that is not the current behaviour it was change without a spec and that is a regressions.
If it's a regression, sure. But the bug [1] was opened on 2015-05-22 which was Liberty and I'm not aware that the behavior has ever been different prior to Liberty (save for the parallel requests/race condition case). I don't think it's a regression. That said, if everyone else is cool with changing it without a spec, that's fine with me. Either way, someone would have to spend the time and do the work. -melanie [1] https://bugs.launchpad.net/nova/+bug/1458122
My co-worker and I both thought that we had seen the 4/5-active behavior last week, but now we can't duplicate it. So maybe we were confused. I think that is a standard condition among OpenStack operators! -----Original Message----- From: melanie witt <melwittt@gmail.com> Sent: Wednesday, November 20, 2019 4:05 PM To: openstack-discuss@lists.openstack.org Subject: Re: All VMs fail when --max exceeds available resources On 11/20/19 15:16, Albert Braden wrote:
The expected result (that I was seeing last week) is that, if my cluster has capacity for 4 VMs and I use --max 5, 4 will go active and 1 will go to error. This week all 5 are going to error. I can still build 4 VMs of that flavor, one at a time, or use --max 4, but if I use --max 5, then all 5 will fail. If I use smaller VMs, the --max numbers get bigger but I still see the same symptom.
The behavior you're describing is an old issue described here: https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1458122&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=TSs8buOdmVE0QnkzlNF2pR_2osW1wdg5PBtapOOVFXs&s=nn72IvZ5lQjOkSVX0aMux32HxcBBjxrJ15SCvWOYfts&e= I don't understand how it's possible that you saw the 4 active 1 in error behavior last week. The behavior has been "error all" since 2015, at least. Unless there's some kind of race condition bug happening, maybe. Did you consistently see it fulfill less than --max last week or was it just once? Changing the behavior would be an API change so it would need a spec and new microversion, I think. It's been an undesirable behavior for a long time but it seemingly hasn't been enough of a pain point for someone to sign up and do the work. -melanie
The --max thing is pretty useful and we use it a lot; it allows us to use up the cluster without knowing exactly how much space we have.
-----Original Message----- From: Matt Riedemann <mriedemos@gmail.com> Sent: Wednesday, November 20, 2019 2:00 PM To: openstack-discuss@lists.openstack.org Subject: Re: All VMs fail when --max exceeds available resources
On 11/20/2019 3:21 PM, Albert Braden wrote:
I think the document is saying that we need to set them in nova.conf on each HV. I tried that and it seems to fix the allocation failure:
root@us01odc-dev1-ctrl1:~# os resource provider inventory list f20fa03d-18f4-486b-9b40-ceaaf52dabf8 +----------------+------------------+----------+----------+-----------+----------+--------+ | resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total | +----------------+------------------+----------+----------+-----------+----------+--------+ | VCPU | 1.0 | 16 | 2 | 1 | 1 | 16 | | MEMORY_MB | 1.0 | 128888 | 8192 | 1 | 1 | 128888 | | DISK_GB | 1.0 | 1208 | 246 | 1 | 1 | 1208 | +----------------+------------------+----------+----------+-----------+----------+--------+
Yup, the config on the controller doesn't apply to the computes or placement because the computes are what report the inventory to placement so you have to configure the allocation ratios there, or starting in stein via (resource provider) aggregate.
This fixed the "allocation ratio" issue but I still see the --max issue. What could be causing that?
That's something else yeah? I didn't quite dig into that email and the allocation ratio thing popped up to me since it's been a long standing known painful issue/behavior change since Ocata.
One question though, I read your original email as essentially "(1) I did x and got some failures, then (2) I changed something and now everything fails", but are you running from a clean environment in both test scenarios because if you have VMs on the computes when you're doing (2) then that's going to change the scheduling results in (2), i.e. the computes will have less capacity since there are resources allocated on them in placement.
participants (4)
-
Albert Braden
-
Matt Riedemann
-
melanie witt
-
Sean Mooney