If you can post the full logs (in debug mode) somewhere I can have a look. Based on what you're saying, it looks like there might be a race between updating the host topology and another instance claiming resources - although claims are supposed to be race-free because they use the COMPUTE_RESOURCES_SEMAPHORE [1]. [1] https://github.com/openstack/nova/blob/082c91a9286ae55fd5eb6adeed52500dc75be... On Tue, Sep 17, 2019 at 8:44 AM wang.ya <wang.ya@99cloud.net> wrote:
Hi:
The main code of NUMA aware live migration was merged. I’m testing it recently.
If only set NUMA property(‘hw:numa_nodes’, ‘hw:numa_cpus’, ‘hw:numa_mem’), it works well. But if add the property “hw:cpu_policy='dedicated'”, it will not correct after serval live migrations.
Which means the live migrate can be success, but the vCPU pin are not correct(two instance have serval same vCPU pin on same host).
Below is my test steps.
env:
code: master branch (build on 16 September 2019, include the patches of NUMA aware live migration)
three compute node:
- s1: 24C, 48G (2 NUMA nodes)
- stein-2: 12C, 24G (2 NUMA nodes)
- stein-3: 8C, 16G (2 NUMA nodes)
flavor1 (2c2g): hw:cpu_policy='dedicated', hw:numa_cpus.0='0', hw:numa_cpus.1='1', hw:numa_mem.0='1024', hw:numa_mem.1='1024', hw:numa_nodes='2'
flavor2 (4c4g): hw:cpu_policy='dedicated', hw:numa_cpus.0='0,1,2', hw:numa_cpus.1='3', hw:numa_mem.0='1024', hw:numa_mem.1='3072', hw:numa_nodes='2'
image has no property.
I create four instances(2*flavor1, 2* flavor2), then begin live migration on by one(one instance live migrate done, next instance begin live migrate) and check the vCPU pin whether is correct.
After serval live migrations, the vCPU pin will not correct. (You can find full migration list in attached file). The last live migrate is:
*+-----+--------------------------------------+-------------+-----------+----------------+--------------+----------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+*
*| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Instance UUID | Old Flavor | New Flavor | Created At | Updated At | Type |*
*+-----+--------------------------------------+-------------+-----------+----------------+--------------+----------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+*
*| 470 | 2a9ba183-4f91-4fbf-93cf-6f0e55cc085a | s1 | stein-3 | s1 | stein-3 | 172.16.130.153 | completed | bf0466f6-4815-4824-8586-899817207564 | 1 | 1 | 2019-09-17T10:28:46.000000 | 2019-09-17T10:29:09.000000 | live-migration |*
*| 469 | c05ea0e8-f040-463e-8957-a59f70ed8bf6 | s1 | stein-3 | s1 | stein-3 | 172.16.130.153 | completed | a3ec7a29-80de-4541-989d-4b9f4377f0bd | 1 | 1 | 2019-09-17T10:28:21.000000 | 2019-09-17T10:28:45.000000 | live-migration |*
*| 468 | cef4c609-157e-4b39-b6cc-f5528d49c75a | s1 | stein-2 | s1 | stein-2 | 172.16.130.152 | completed | 83dab721-3343-436d-bee7-f5ffc0d0d38d | 4 | 4 | 2019-09-17T10:27:57.000000 | 2019-09-17T10:28:21.000000 | live-migration |*
*| 467 | 5471e441-2a50-465a-bb63-3fe1bb2e81b9 | s1 | stein-2 | s1 | stein-2 | 172.16.130.152 | completed | e3c19fbe-7b94-4a65-a803-51daa9934378 | 4 | 4 | 2019-09-17T10:27:32.000000 | 2019-09-17T10:27:57.000000 | live-migration |*
There are two instances land on stein-3, and the two instances have same vCPU pin:
*(nova-libvirt)[root@stein-3 /]# virsh list --all*
* Id Name State*
*----------------------------------------------------*
* 32 instance-00000025 running*
* 33 instance-00000024 running*
*(nova-libvirt)[root@stein-3 /]# virsh vcpupin 32*
*VCPU: CPU Affinity*
*----------------------------------*
* 0: 2*
* 1: 7*
*(nova-libvirt)[root@stein-3 /]# virsh vcpupin 33*
*VCPU: CPU Affinity*
*----------------------------------*
* 0: 2*
* 1: 7*
I checked the nova compute’s log on stein-3(you can find the log in attached log), then I found ‘host_topology’ isn’t updated when ‘hardware.numa_fit_instance_to_host’ be called in claims. ‘host_topology’ is the property of ‘objects.ComputeNode’ and it’s cached in ‘ResourceTracker’, it will use cached ‘cn’ to build ‘claim’ when ‘check_can_live_migrate_destination’ called. Therefore, I guess the cache was not updated or updated too late or some other reason.
I also checked the database, the NUMA topology of *the two instances* have same vCPU pin: “[0,2], [1,7]”, and the *compute node: stein-*3 only has vCPU pin: “[2], [7]”.
Please correct me if there is something wrong :)
Best Regards