[nova] The test of NUMA aware live migration
Hi: The main code of NUMA aware live migration was merged. I’m testing it recently. If only set NUMA property(‘hw:numa_nodes’, ‘hw:numa_cpus’, ‘hw:numa_mem’), it works well. But if add the property “hw:cpu_policy='dedicated'”, it will not correct after serval live migrations. Which means the live migrate can be success, but the vCPU pin are not correct(two instance have serval same vCPU pin on same host). Below is my test steps. env: code: master branch (build on 16 September 2019, include the patches of NUMA aware live migration) three compute node: - s1: 24C, 48G (2 NUMA nodes) - stein-2: 12C, 24G (2 NUMA nodes) - stein-3: 8C, 16G (2 NUMA nodes) flavor1 (2c2g): hw:cpu_policy='dedicated', hw:numa_cpus.0='0', hw:numa_cpus.1='1', hw:numa_mem.0='1024', hw:numa_mem.1='1024', hw:numa_nodes='2' flavor2 (4c4g): hw:cpu_policy='dedicated', hw:numa_cpus.0='0,1,2', hw:numa_cpus.1='3', hw:numa_mem.0='1024', hw:numa_mem.1='3072', hw:numa_nodes='2' image has no property. I create four instances(2*flavor1, 2* flavor2), then begin live migration on by one(one instance live migrate done, next instance begin live migrate) and check the vCPU pin whether is correct. After serval live migrations, the vCPU pin will not correct. (You can find full migration list in attached file). The last live migrate is: +-----+--------------------------------------+-------------+-----------+----------------+--------------+----------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+ | Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Instance UUID | Old Flavor | New Flavor | Created At | Updated At | Type | +-----+--------------------------------------+-------------+-----------+----------------+--------------+----------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+ | 470 | 2a9ba183-4f91-4fbf-93cf-6f0e55cc085a | s1 | stein-3 | s1 | stein-3 | 172.16.130.153 | completed | bf0466f6-4815-4824-8586-899817207564 | 1 | 1 | 2019-09-17T10:28:46.000000 | 2019-09-17T10:29:09.000000 | live-migration | | 469 | c05ea0e8-f040-463e-8957-a59f70ed8bf6 | s1 | stein-3 | s1 | stein-3 | 172.16.130.153 | completed | a3ec7a29-80de-4541-989d-4b9f4377f0bd | 1 | 1 | 2019-09-17T10:28:21.000000 | 2019-09-17T10:28:45.000000 | live-migration | | 468 | cef4c609-157e-4b39-b6cc-f5528d49c75a | s1 | stein-2 | s1 | stein-2 | 172.16.130.152 | completed | 83dab721-3343-436d-bee7-f5ffc0d0d38d | 4 | 4 | 2019-09-17T10:27:57.000000 | 2019-09-17T10:28:21.000000 | live-migration | | 467 | 5471e441-2a50-465a-bb63-3fe1bb2e81b9 | s1 | stein-2 | s1 | stein-2 | 172.16.130.152 | completed | e3c19fbe-7b94-4a65-a803-51daa9934378 | 4 | 4 | 2019-09-17T10:27:32.000000 | 2019-09-17T10:27:57.000000 | live-migration | There are two instances land on stein-3, and the two instances have same vCPU pin: (nova-libvirt)[root@stein-3 /]# virsh list --all Id Name State ---------------------------------------------------- 32 instance-00000025 running 33 instance-00000024 running (nova-libvirt)[root@stein-3 /]# virsh vcpupin 32 VCPU: CPU Affinity ---------------------------------- 0: 2 1: 7 (nova-libvirt)[root@stein-3 /]# virsh vcpupin 33 VCPU: CPU Affinity ---------------------------------- 0: 2 1: 7 I checked the nova compute’s log on stein-3(you can find the log in attached log), then I found ‘host_topology’ isn’t updated when ‘hardware.numa_fit_instance_to_host’ be called in claims. ‘host_topology’ is the property of ‘objects.ComputeNode’ and it’s cached in ‘ResourceTracker’, it will use cached ‘cn’ to build ‘claim’ when ‘check_can_live_migrate_destination’ called. Therefore, I guess the cache was not updated or updated too late or some other reason. I also checked the database, the NUMA topology of the two instances have same vCPU pin: “[0,2], [1,7]”, and the compute node: stein-3 only has vCPU pin: “[2], [7]”. Please correct me if there is something wrong :) Best Regards
If you can post the full logs (in debug mode) somewhere I can have a look. Based on what you're saying, it looks like there might be a race between updating the host topology and another instance claiming resources - although claims are supposed to be race-free because they use the COMPUTE_RESOURCES_SEMAPHORE [1]. [1] https://github.com/openstack/nova/blob/082c91a9286ae55fd5eb6adeed52500dc75be... On Tue, Sep 17, 2019 at 8:44 AM wang.ya <wang.ya@99cloud.net> wrote:
Hi:
The main code of NUMA aware live migration was merged. I’m testing it recently.
If only set NUMA property(‘hw:numa_nodes’, ‘hw:numa_cpus’, ‘hw:numa_mem’), it works well. But if add the property “hw:cpu_policy='dedicated'”, it will not correct after serval live migrations.
Which means the live migrate can be success, but the vCPU pin are not correct(two instance have serval same vCPU pin on same host).
Below is my test steps.
env:
code: master branch (build on 16 September 2019, include the patches of NUMA aware live migration)
three compute node:
- s1: 24C, 48G (2 NUMA nodes)
- stein-2: 12C, 24G (2 NUMA nodes)
- stein-3: 8C, 16G (2 NUMA nodes)
flavor1 (2c2g): hw:cpu_policy='dedicated', hw:numa_cpus.0='0', hw:numa_cpus.1='1', hw:numa_mem.0='1024', hw:numa_mem.1='1024', hw:numa_nodes='2'
flavor2 (4c4g): hw:cpu_policy='dedicated', hw:numa_cpus.0='0,1,2', hw:numa_cpus.1='3', hw:numa_mem.0='1024', hw:numa_mem.1='3072', hw:numa_nodes='2'
image has no property.
I create four instances(2*flavor1, 2* flavor2), then begin live migration on by one(one instance live migrate done, next instance begin live migrate) and check the vCPU pin whether is correct.
After serval live migrations, the vCPU pin will not correct. (You can find full migration list in attached file). The last live migrate is:
*+-----+--------------------------------------+-------------+-----------+----------------+--------------+----------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+*
*| Id | UUID | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Instance UUID | Old Flavor | New Flavor | Created At | Updated At | Type |*
*+-----+--------------------------------------+-------------+-----------+----------------+--------------+----------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+*
*| 470 | 2a9ba183-4f91-4fbf-93cf-6f0e55cc085a | s1 | stein-3 | s1 | stein-3 | 172.16.130.153 | completed | bf0466f6-4815-4824-8586-899817207564 | 1 | 1 | 2019-09-17T10:28:46.000000 | 2019-09-17T10:29:09.000000 | live-migration |*
*| 469 | c05ea0e8-f040-463e-8957-a59f70ed8bf6 | s1 | stein-3 | s1 | stein-3 | 172.16.130.153 | completed | a3ec7a29-80de-4541-989d-4b9f4377f0bd | 1 | 1 | 2019-09-17T10:28:21.000000 | 2019-09-17T10:28:45.000000 | live-migration |*
*| 468 | cef4c609-157e-4b39-b6cc-f5528d49c75a | s1 | stein-2 | s1 | stein-2 | 172.16.130.152 | completed | 83dab721-3343-436d-bee7-f5ffc0d0d38d | 4 | 4 | 2019-09-17T10:27:57.000000 | 2019-09-17T10:28:21.000000 | live-migration |*
*| 467 | 5471e441-2a50-465a-bb63-3fe1bb2e81b9 | s1 | stein-2 | s1 | stein-2 | 172.16.130.152 | completed | e3c19fbe-7b94-4a65-a803-51daa9934378 | 4 | 4 | 2019-09-17T10:27:32.000000 | 2019-09-17T10:27:57.000000 | live-migration |*
There are two instances land on stein-3, and the two instances have same vCPU pin:
*(nova-libvirt)[root@stein-3 /]# virsh list --all*
* Id Name State*
*----------------------------------------------------*
* 32 instance-00000025 running*
* 33 instance-00000024 running*
*(nova-libvirt)[root@stein-3 /]# virsh vcpupin 32*
*VCPU: CPU Affinity*
*----------------------------------*
* 0: 2*
* 1: 7*
*(nova-libvirt)[root@stein-3 /]# virsh vcpupin 33*
*VCPU: CPU Affinity*
*----------------------------------*
* 0: 2*
* 1: 7*
I checked the nova compute’s log on stein-3(you can find the log in attached log), then I found ‘host_topology’ isn’t updated when ‘hardware.numa_fit_instance_to_host’ be called in claims. ‘host_topology’ is the property of ‘objects.ComputeNode’ and it’s cached in ‘ResourceTracker’, it will use cached ‘cn’ to build ‘claim’ when ‘check_can_live_migrate_destination’ called. Therefore, I guess the cache was not updated or updated too late or some other reason.
I also checked the database, the NUMA topology of *the two instances* have same vCPU pin: “[0,2], [1,7]”, and the *compute node: stein-*3 only has vCPU pin: “[2], [7]”.
Please correct me if there is something wrong :)
Best Regards
On 9/17/2019 7:44 AM, wang.ya wrote:
But if add the property “hw:cpu_policy='dedicated'”, it will not correct after serval live migrations.
Which means the live migrate can be success, but the vCPU pin are not correct(two instance have serval same vCPU pin on same host).
Is the race you're describing the same issue reported in this bug? https://bugs.launchpad.net/nova/+bug/1829349 Also, what is the max_concurrent_live_migrations config option set to? That defaults to 1 but I'm wondering if you've changed it at all. -- Thanks, Matt
I think the two issues should be similar. As I said, the first instance live migrate to host, but in resource tracker, the cache 'cn' not updated, at the moment, second instance live migrate to same host, then the vCPU pin policy broken. The issue is not reproducible every time, it need to go through multiple live migrate (I wrote a script to run live migrate automatic). I have checked the nova's config, the ' max_concurrent_live_migrations' option is default :) I've report the issue to launchpad, you can find the log in attachment: https://bugs.launchpad.net/nova/+bug/1845146 On 2019/9/20, 11:52 PM, "Matt Riedemann" <openstack-discuss-bounces+wang.ya=99cloud.net@lists.openstack.org on behalf of mriedemos@gmail.com> wrote: On 9/17/2019 7:44 AM, wang.ya wrote: > But if add the property “hw:cpu_policy='dedicated'”, it will not correct > after serval live migrations. > > Which means the live migrate can be success, but the vCPU pin are not > correct(two instance have serval same vCPU pin on same host). > Is the race you're describing the same issue reported in this bug? https://bugs.launchpad.net/nova/+bug/1829349 Also, what is the max_concurrent_live_migrations config option set to? That defaults to 1 but I'm wondering if you've changed it at all. -- Thanks, Matt
I've proposed [1], which I think should solve the issue. Could you test with that patch and let us know if the bug goes away? Thank again for helping improve this! [1] https://review.opendev.org/#/c/684409/ On Tue, Sep 24, 2019 at 5:04 AM wang.ya <wang.ya@99cloud.net> wrote:
I think the two issues should be similar. As I said, the first instance live migrate to host, but in resource tracker, the cache 'cn' not updated, at the moment, second instance live migrate to same host, then the vCPU pin policy broken. The issue is not reproducible every time, it need to go through multiple live migrate (I wrote a script to run live migrate automatic).
I have checked the nova's config, the ' max_concurrent_live_migrations' option is default :)
I've report the issue to launchpad, you can find the log in attachment: https://bugs.launchpad.net/nova/+bug/1845146
On 2019/9/20, 11:52 PM, "Matt Riedemann" <openstack-discuss-bounces+wang.ya=99cloud.net@lists.openstack.org on behalf of mriedemos@gmail.com> wrote:
On 9/17/2019 7:44 AM, wang.ya wrote: > But if add the property “hw:cpu_policy='dedicated'”, it will not correct > after serval live migrations. > > Which means the live migrate can be success, but the vCPU pin are not > correct(two instance have serval same vCPU pin on same host). >
Is the race you're describing the same issue reported in this bug?
https://bugs.launchpad.net/nova/+bug/1829349
Also, what is the max_concurrent_live_migrations config option set to? That defaults to 1 but I'm wondering if you've changed it at all.
--
Thanks,
Matt
-- Artom Lifshitz Software Engineer, OpenStack Compute DFG
participants (4)
-
Artom Lifshitz
-
Artom Lifshitz
-
Matt Riedemann
-
wang.ya