[nova][placement] Openstack only building one VM per machine in cluster, then runs out of resources

Laurent Dumont laurentfdumont at gmail.com
Wed Jun 30 21:52:54 UTC 2021


I unfortunately can't add much as I don't have an Ussuri cloud to test
with. That said, I would be curious about the debug level outputs from both
the controller scheduler/placement as well as one compute where a claim
could have happened.

The allocation ratio for CPU is 16 by default. That said, you could also
leverage pinned CPUs to prevent any overcommit. But that's not as "simple"
as telling Openstack not to oversub on CPU cores.

On Wed, Jun 30, 2021 at 5:06 PM Jeffrey Mazzone <jmazzone at uchicago.edu>
wrote:

> Yes, this is almost exactly what I did. No, I am not running mysql in a HA
> deployment and I have ran nova-manage api_db sync several times throughout
> the process below.
>
>  I think I found a work around but im not sure how feasible this is.
>
> I first, changed the reallocation ratio to 1:1. In the nova.conf on the
> controller. Nova would not accept this for some reason and seemed like it
> needed to be changed on the compute node. So I deleted the hypervisor,
> resource provider, and compute service. Changed the ratios on the compute
> node itself, and then re-added it back in. Now the capacity changed to 64
> which is the number of cores on the systems. When starting a vm, it still
> gets the same number for “used” in the placement-api.log: See below:
>
> New ratios
>
> ~# openstack resource provider inventory list 554f2a3b-924e-440c-9847-596064ea0f3f
> +----------------+------------------+----------+----------+----------+-----------+--------+
> | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size |  total |
> +----------------+------------------+----------+----------+----------+-----------+--------+
> | VCPU           |              1.0 |        1 |       64 |        0 |         1 |     64 |
> | MEMORY_MB      |              1.0 |        1 |   515655 |      512 |         1 | 515655 |
> | DISK_GB        |              1.0 |        1 |     7096 |        0 |         1 |   7096 |
> +----------------+------------------+----------+----------+----------+-----------+--------+
>
>
> Error from placement.log
>
> 2021-06-30 13:49:24.877 4381 WARNING placement.objects.allocation [req-7dc8930f-1eac-401a-ade7-af36e64c2ba8 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider c4199e84-8259-4d0e-9361-9b0d9e6e66b7. Needed: 4, Used: 8206, Capacity: 64.0
>
>
> With that in mind, I did the same procedure again but set the ratio to 1024
>
> New ratios
>
> ~# openstack resource provider inventory list 519c1e10-3546-4e3b-a017-3e831376cde8
> +----------------+------------------+----------+----------+----------+-----------+--------+
> | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size |  total |
> +----------------+------------------+----------+----------+----------+-----------+--------+
> | VCPU           |           1024.0 |        1 |       64 |        0 |         1 |     64 |
> | MEMORY_MB      |              1.0 |        1 |   515655 |      512 |         1 | 515655 |
> | DISK_GB        |              1.0 |        1 |     7096 |        0 |         1 |   7096 |
> +----------------+------------------+----------+----------+----------+-----------+--------+
>
>
>
> Now I can spin up vms without issues.
>
> I have 1 test AZ with 2 hosts inside. I have set these hosts to the ratio
> above. I was able to spin up approx 45 4x core VMs without issues and no
> signs of it hitting an upper limit on the host.
>
> 120 | VCPU=64    | 519c1e10-3546-4e3b-a017-3e831376cde8 | VCPU=88/65536
> 23 | VCPU=64    | 8f97a3ba-98a0-475e-a3cf-41425569b2cb | VCPU=96/65536
>
>
>
> I have 2 problems with this fix.
>
> 1) the overcommit is now super high and I have no way, besides quotas, to
> guarantee the system won’t be over provisioned.
> 2) I still don’t know how that “used” resources value is being calculated.
> When this issue first started, the “used” resources were a different
> number. Over the past two days, the used resources for a 4 core virtual
> machine have remained at 8206 but I have no way to guarantee this.
>
> My initial tests when this started was to compare the resource values when
> building different size vms. Here is that list:
>
> 1 core - 4107
> 2 core - 4108
> 4 core- 4110
> 8 core - 4114
> 16 core - 4122
> 32 core - 8234
>
>
> The number on the right is the number the “used” value used to be.
> Yesterday and today, it has changed to 8206 for a 4 core vm, I have not
> tested the rest.
>
> Before I commit to combing through the placement api source code to figure
> out how the “used” value in the placement log is being calculated, im
> hoping someone knows where and how that value is being calculated. It does
> not seem to be a fixed value in the database and it doesn’t seem to be
> effected by the allocation ratios.
>
>
> Thank you in advance!!
> -Jeff Mazzone
> Senior Linux Systems Administrator
> Center for Translational Data Science
> University of Chicago.
>
>
>
> On Jun 30, 2021, at 2:40 PM, Laurent Dumont <laurentfdumont at gmail.com>
> wrote:
>
> In some cases, the DEBUG messages are a bit verbose but can really walk
> you through the allocation/scheduling process. You could increase it for
> nova and restart the api + scheduler on the controllers. I wonder if a
> desync of the DB could be in cause? Are you running an HA deployment for
> the mysql backend?
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210630/242cff85/attachment.html>


More information about the openstack-discuss mailing list