[openstack] how to speed up live migration?

newer
[manila] Feature Proposal Freeze...

Ignazio Cassano

3 Aug 2022 3 Aug '22

3:27 a.m.

Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers, live migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf. I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio

Attachments:

attachment.html (text/html — 1.2 KB)

Show replies by date

Felix Hüttner

3 Aug 3 Aug

3:55 a.m.

Hi Ignazio, Is it the actual live-migration that takes long (e.g. the libvirt migration you can watch with “virsh domjobinfo <instance>”) or the whole live-migration process as observed by nova. We have seen it a few times that the thing that actually takes long is plugging the neutron port on the target hypervisor (although I think this only applies to ml2-ovs). For us this seems to happen because the neutron-openvswitch-agent can take some time to assemble the firewall rules for the security group of the port (especially if you use large remote security groups). This would also explain why migrating back is fast, because the neutron-openvswitch-agent on the source will have the information cached. Alternatively you could have multiple live-migrations queued for the same source hypervisor, but nova only handles them one-by-one (unless you set max_concurrent_live_migrations). -- Felix Huettner From: Ignazio Cassano <ignaziocassano@gmail.com> Sent: Wednesday, August 3, 2022 12:28 PM To: openstack-discuss <openstack-discuss@lists.openstack.org> Subject: [openstack] how to speed up live migration? Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers, live migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf. I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio Diese E Mail enthält möglicherweise vertrauliche Inhalte und ist nur für die Verwertung durch den vorgesehenen Empfänger bestimmt. Sollten Sie nicht der vorgesehene Empfänger sein, setzen Sie den Absender bitte unverzüglich in Kenntnis und löschen diese E Mail. Hinweise zum Datenschutz finden Sie hier<https://www.datenschutz.schwarz>.

Fabian Zimmermann

4 Aug 4 Aug

2:34 a.m.

Hi, take a look at: https://docs.openstack.org/nova/latest/admin/configuring-migrations.html#adv... esp. Auto-convergence and Post-copy Fabian Am Mi., 3. Aug. 2022 um 12:43 Uhr schrieb Ignazio Cassano <ignaziocassano@gmail.com>:

...

Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers, live migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf. I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio

Ignazio Cassano

2:58 a.m.

Hello, menu thanks for your reply. I tried both post copy and converge but the live migrations is always very slow for applications servers like tomcat with java. The very strange behaviour is that if I migrate from compute node A to compute node B it takes more than 20 minutes. After that, if I migrate from compute B to compute A, it takes few seconds. I do not know if its is because dirty memory iscleaned during the first live migration. Live migration network si 10Gbs so I do not think the first live migration is affected by network performances. Ignazio Il giorno gio 4 ago 2022 alle ore 11:34 Fabian Zimmermann <dev.faz@gmail.com> ha scritto:

...

Hi,

take a look at:

https://docs.openstack.org/nova/latest/admin/configuring-migrations.html#adv...

esp. Auto-convergence and Post-copy

Fabian

Am Mi., 3. Aug. 2022 um 12:43 Uhr schrieb Ignazio Cassano <ignaziocassano@gmail.com>:

...
Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers, live

migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf.

...
I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio

Gorka Eguileor

7:56 a.m.

On 03/08, Ignazio Cassano wrote:

...

Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers, live migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf.

Hi, Probably doesn't affect your case, but I assume you are using ephemeral nova boot volumes. Have you tried using only Cinder volumes on the VM? Cheers, Gorka.

...

I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio

Ignazio Cassano

8:09 a.m.

HI, I am using cinder volumes. Ignazio Il giorno gio 4 ago 2022 alle ore 16:56 Gorka Eguileor <geguileo@redhat.com> ha scritto:

...

On 03/08, Ignazio Cassano wrote:

...
Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers, live migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf.

Hi,

Probably doesn't affect your case, but I assume you are using ephemeral nova boot volumes.

Have you tried using only Cinder volumes on the VM?

Cheers, Gorka.

...
I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio

Gorka Eguileor

5 Aug 5 Aug

1:17 a.m.

On 04/08, Ignazio Cassano wrote:

...

HI, I am using cinder volumes. Ignazio

Hi, In that case there is no volume data being copied for the instance migration, and volume attach on the destination should not account for more than 30 seconds of those 20 minutes, so not much improvement possible there. Cheers, Gorka.

...

Il giorno gio 4 ago 2022 alle ore 16:56 Gorka Eguileor <geguileo@redhat.com> ha scritto:

...
On 03/08, Ignazio Cassano wrote:

...
Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers, live migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf.

Hi,

Probably doesn't affect your case, but I assume you are using ephemeral nova boot volumes.

Have you tried using only Cinder volumes on the VM?

Cheers, Gorka.

...
I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio

Ignazio Cassano

1:24 a.m.

Hello, firstly let me to thank you for reply and sorry if I come back to ask why when I do the first migration from A to B it takes 20 minutes and then, when I migrate from B to A it takes few seconds. I wonder if after the first migration memory is reorganized. In the first live migration it lost time to get memory pages ? Ignazio Il giorno ven 5 ago 2022 alle ore 10:17 Gorka Eguileor <geguileo@redhat.com> ha scritto:

...

On 04/08, Ignazio Cassano wrote:

...
HI, I am using cinder volumes. Ignazio

Hi,

In that case there is no volume data being copied for the instance migration, and volume attach on the destination should not account for more than 30 seconds of those 20 minutes, so not much improvement possible there.

Cheers, Gorka.

...
Il giorno gio 4 ago 2022 alle ore 16:56 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
...
Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers,

On 03/08, Ignazio Cassano wrote: live

...
migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf.

Hi,

Probably doesn't affect your case, but I assume you are using ephemeral nova boot volumes.

Have you tried using only Cinder volumes on the VM?

Cheers, Gorka.

...
I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio

Radosław Piliszek

1:40 a.m.

On Fri, 5 Aug 2022 at 10:28, Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...

why when I do the first migration from A to B it takes 20 minutes and then, when I migrate from B to A it takes few seconds. I wonder if after the first migration memory is reorganized. In the first live migration it lost time to get memory pages ?

Just curious. Did you try migrating the same instance again after that, i.e. again from A to B. Is it still fast or is it slow again? Does it only happen with long-running instances? -yoctozepto

Ignazio Cassano

1:42 a.m.

Hi, I am going to try it. Thanks Il giorno ven 5 ago 2022 alle ore 10:40 Radosław Piliszek < radoslaw.piliszek@gmail.com> ha scritto:

...

On Fri, 5 Aug 2022 at 10:28, Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
why when I do the first migration from A to B it takes 20 minutes and

then, when I migrate from B to A it takes few seconds.

...
I wonder if after the first migration memory is reorganized. In the first live migration it lost time to get memory pages ?

Just curious. Did you try migrating the same instance again after that, i.e. again from A to B. Is it still fast or is it slow again? Does it only happen with long-running instances?

-yoctozepto

Ignazio Cassano

2:05 a.m.

Hi, migration from A to B (750 sec) migration from B to A (10 sec) Migration from A to B (10 sec) Ignazio Il giorno ven 5 ago 2022 alle ore 10:40 Radosław Piliszek < radoslaw.piliszek@gmail.com> ha scritto:

...

On Fri, 5 Aug 2022 at 10:28, Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
why when I do the first migration from A to B it takes 20 minutes and

then, when I migrate from B to A it takes few seconds.

...
I wonder if after the first migration memory is reorganized. In the first live migration it lost time to get memory pages ?

Just curious. Did you try migrating the same instance again after that, i.e. again from A to B. Is it still fast or is it slow again? Does it only happen with long-running instances?

-yoctozepto

Radosław Piliszek

2:32 a.m.

On Fri, 5 Aug 2022 at 11:06, Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...

Hi, migration from A to B (750 sec) migration from B to A (10 sec) Migration from A to B (10 sec)

Interesting! So it indeed looks like a dirty/cold case. However, as Gorka and others have already mentioned - you need to really pinpoint WHAT takes that long. Which involved component does its thing for too long. It could be that in these 740 secs there is actually no real throughput happening, just some thing waiting for timeout to progress on the 2nd try. -yoctozepto

Ignazio Cassano

2:40 a.m.

Hi, I sent in my previous email what happens in the first lige migration: *migration running for 30 secs, memory 89% remaining; (bytes processed=1258508063, remaining=15356194816, total=17184923648)2022-08-05 10:47:23.910 55600 INFO nova.virt.libvirt.driver [req-ff02667e-9d38-4a08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 60 secs, memory 87% remaining; (bytes processed=1489083638, remaining=15035801600, total=17184923648)08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 90 secs, memory 86% remaining; (bytes processed=1689004421, remaining=14802731008, total=17184923648)* and so on Tcpudumping from A to B I can see a lot traffic. Do you suggest to enable debug ? It seems clear that the memory content migration is slow. Ignazio Il giorno ven 5 ago 2022 alle ore 11:33 Radosław Piliszek < radoslaw.piliszek@gmail.com> ha scritto:

...

On Fri, 5 Aug 2022 at 11:06, Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Hi, migration from A to B (750 sec) migration from B to A (10 sec) Migration from A to B (10 sec)

Interesting! So it indeed looks like a dirty/cold case. However, as Gorka and others have already mentioned - you need to really pinpoint WHAT takes that long. Which involved component does its thing for too long. It could be that in these 740 secs there is actually no real throughput happening, just some thing waiting for timeout to progress on the 2nd try.

-yoctozepto

Gorka Eguileor

1:49 a.m.

On 05/08, Ignazio Cassano wrote:

...

Hello, firstly let me to thank you for reply and sorry if I come back to ask why when I do the first migration from A to B it takes 20 minutes and then, when I migrate from B to A it takes few seconds. I wonder if after the first migration memory is reorganized. In the first live migration it lost time to get memory pages ? Ignazio

Hi, I work on Cinder, so my knowledge on live migrations is mostly limited to the attach/detach flow of the volumes. I thought that maybe if you were using ephemeral nova volumes (non-cinder) maybe the volume had not yet been deleted from the old node, or maybe it was using a qcow2 base file for multiple instances on the source (each using a different chain on top of it) and this qcow2 was not originally present in the destination (hence the time to copy it), so when we do a migration back since there are other instances that were also using it on the destination (original location) only de difference needs to be copied. But these are just brainstorming ideas, since I don't really know how Nova handles all this. I would recommend setting Nova log to debug mode in both source and destination nodes and look at where the time difference really is, in case it's not where you think it is. Cheers, Gorka.

...

Il giorno ven 5 ago 2022 alle ore 10:17 Gorka Eguileor <geguileo@redhat.com> ha scritto:

...
On 04/08, Ignazio Cassano wrote:

...
HI, I am using cinder volumes. Ignazio

Hi,

In that case there is no volume data being copied for the instance migration, and volume attach on the destination should not account for more than 30 seconds of those 20 minutes, so not much improvement possible there.

Cheers, Gorka.

...
Il giorno gio 4 ago 2022 alle ore 16:56 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
...
Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers,

On 03/08, Ignazio Cassano wrote: live

...
migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf.

Hi,

Probably doesn't affect your case, but I assume you are using ephemeral nova boot volumes.

Have you tried using only Cinder volumes on the VM?

Cheers, Gorka.

...
I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio

Ignazio Cassano

2:18 a.m.

Hi, this is the volume attached on netapp nfs about the vm I am migrating: qemu-img info volume-002ff8af-9067-4f84-a01c-d147cdd1f70dqimage: volume-002ff8af-9067-4f84-a01c-d147cdd1f70d file format: raw virtual size: 40G (42949672960 bytes) disk size: 21G As you can see it is raw and it does not ha base image. Ignazio Il giorno ven 5 ago 2022 alle ore 10:49 Gorka Eguileor <geguileo@redhat.com> ha scritto:

...

On 05/08, Ignazio Cassano wrote:

...
Hello, firstly let me to thank you for reply and sorry if I come back to ask why when I do the first migration from A to B it takes 20 minutes and then, when I migrate from B to A it takes few seconds. I wonder if after the first migration memory is reorganized. In the first live migration it lost time to get memory pages ? Ignazio

Hi,

I work on Cinder, so my knowledge on live migrations is mostly limited to the attach/detach flow of the volumes.

I thought that maybe if you were using ephemeral nova volumes (non-cinder) maybe the volume had not yet been deleted from the old node, or maybe it was using a qcow2 base file for multiple instances on the source (each using a different chain on top of it) and this qcow2 was not originally present in the destination (hence the time to copy it), so when we do a migration back since there are other instances that were also using it on the destination (original location) only de difference needs to be copied.

But these are just brainstorming ideas, since I don't really know how Nova handles all this.

I would recommend setting Nova log to debug mode in both source and destination nodes and look at where the time difference really is, in case it's not where you think it is.

Cheers, Gorka.

...
Il giorno ven 5 ago 2022 alle ore 10:17 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
On 04/08, Ignazio Cassano wrote:

...
HI, I am using cinder volumes. Ignazio

Hi,

In that case there is no volume data being copied for the instance migration, and volume attach on the destination should not account for more than 30 seconds of those 20 minutes, so not much improvement possible there.

Cheers, Gorka.

...
Il giorno gio 4 ago 2022 alle ore 16:56 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
...
Hello All, I am looking for a solution to speed up live migration. Instances where ram is used heavily like java application servers,

On 03/08, Ignazio Cassano wrote: live

...
migration take a long time (more than 20 minutes for 8GB ram instance) and converge mode is already set to True in nova.conf.

Hi,

Probably doesn't affect your case, but I assume you are using ephemeral nova boot volumes.

Have you tried using only Cinder volumes on the VM?

Cheers, Gorka.

...
I also tried with post_copy but it does not change. After the first live migration (very solow) if I try to migrate again it is very fast. I presume the first migration is slow because memory fragmentation when an instance is running on the same compute node for a long time. I am looking for a solution considering the on my computing node I can have a little ram overcommit. Any case I am increasing the number of compute nodes to reduce it. Thanks Ignazio

Ignazio Cassano

2:27 a.m.

Migrating again to a new node (COMPUTE C) it takes 10 sec. The first migration from A to B (750 sec) is slow in migrating memory : *migration running for 30 secs, memory 89% remaining; (bytes processed=1258508063, remaining=15356194816, total=17184923648)2022-08-05 10:47:23.910 55600 INFO nova.virt.libvirt.driver [req-ff02667e-9d38-4a08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 60 secs, memory 87% remaining; (bytes processed=1489083638, remaining=15035801600, total=17184923648)08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 90 secs, memory 86% remaining; (bytes processed=1689004421, remaining=14802731008, total=17184923648)* and so on Il giorno ven 5 ago 2022 alle ore 11:18 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...

Hi, this is the volume attached on netapp nfs about the vm I am migrating: qemu-img info volume-002ff8af-9067-4f84-a01c-d147cdd1f70dqimage: volume-002ff8af-9067-4f84-a01c-d147cdd1f70d file format: raw virtual size: 40G (42949672960 bytes) disk size: 21G

As you can see it is raw and it does not ha base image. Ignazio

Il giorno ven 5 ago 2022 alle ore 10:49 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
On 05/08, Ignazio Cassano wrote:

...
Hello, firstly let me to thank you for reply and sorry if I come back to ask why when I do the first migration from A to B it takes 20 minutes and then, when I migrate from B to A it takes few seconds. I wonder if after the first migration memory is reorganized. In the first live migration it lost time to get memory pages ? Ignazio

Hi,

I work on Cinder, so my knowledge on live migrations is mostly limited to the attach/detach flow of the volumes.

I thought that maybe if you were using ephemeral nova volumes (non-cinder) maybe the volume had not yet been deleted from the old node, or maybe it was using a qcow2 base file for multiple instances on the source (each using a different chain on top of it) and this qcow2 was not originally present in the destination (hence the time to copy it), so when we do a migration back since there are other instances that were also using it on the destination (original location) only de difference needs to be copied.

But these are just brainstorming ideas, since I don't really know how Nova handles all this.

I would recommend setting Nova log to debug mode in both source and destination nodes and look at where the time difference really is, in case it's not where you think it is.

Cheers, Gorka.

...
Il giorno ven 5 ago 2022 alle ore 10:17 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
On 04/08, Ignazio Cassano wrote:

...
HI, I am using cinder volumes. Ignazio

Hi,

In that case there is no volume data being copied for the instance migration, and volume attach on the destination should not account for more than 30 seconds of those 20 minutes, so not much improvement possible there.

Cheers, Gorka.

...
Il giorno gio 4 ago 2022 alle ore 16:56 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
On 03/08, Ignazio Cassano wrote: > Hello All, > I am looking for a solution to speed up live migration. > Instances where ram is used heavily like java application servers, live > migration take a long time (more than 20 minutes for 8GB ram instance) and > converge mode is already set to True in nova.conf.

Hi,

Probably doesn't affect your case, but I assume you are using ephemeral nova boot volumes.

Have you tried using only Cinder volumes on the VM?

Cheers, Gorka.

> I also tried with post_copy but it does not change. > After the first live migration (very solow) if I try to migrate again it is > very fast. > I presume the first migration is slow because memory fragmentation when an > instance is running on the same compute node for a long time. > I am looking for a solution considering the on my computing node I can have > a little ram overcommit. Any case I am increasing the number of compute > nodes to reduce it. > Thanks > Ignazio

Gorka Eguileor

2:45 a.m.

On 05/08, Ignazio Cassano wrote:

...

Migrating again to a new node (COMPUTE C) it takes 10 sec. The first migration from A to B (750 sec) is slow in migrating memory :

*migration running for 30 secs, memory 89% remaining; (bytes processed=1258508063, remaining=15356194816, total=17184923648)2022-08-05 10:47:23.910 55600 INFO nova.virt.libvirt.driver [req-ff02667e-9d38-4a08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 60 secs, memory 87% remaining; (bytes processed=1489083638, remaining=15035801600, total=17184923648)08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 90 secs, memory 86% remaining; (bytes processed=1689004421, remaining=14802731008, total=17184923648)*

and so on

That sounds crazy to me. Unless the first node has more load or more network usage than the others, or the VM isn't actually running on Compute B so the migration is not really of a running VM...

...

Il giorno ven 5 ago 2022 alle ore 11:18 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hi, this is the volume attached on netapp nfs about the vm I am migrating: qemu-img info volume-002ff8af-9067-4f84-a01c-d147cdd1f70dqimage: volume-002ff8af-9067-4f84-a01c-d147cdd1f70d file format: raw virtual size: 40G (42949672960 bytes) disk size: 21G

As you can see it is raw and it does not ha base image. Ignazio

Il giorno ven 5 ago 2022 alle ore 10:49 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
On 05/08, Ignazio Cassano wrote:

...
Hello, firstly let me to thank you for reply and sorry if I come back to ask why when I do the first migration from A to B it takes 20 minutes and then, when I migrate from B to A it takes few seconds. I wonder if after the first migration memory is reorganized. In the first live migration it lost time to get memory pages ? Ignazio

Hi,

I work on Cinder, so my knowledge on live migrations is mostly limited to the attach/detach flow of the volumes.

I thought that maybe if you were using ephemeral nova volumes (non-cinder) maybe the volume had not yet been deleted from the old node, or maybe it was using a qcow2 base file for multiple instances on the source (each using a different chain on top of it) and this qcow2 was not originally present in the destination (hence the time to copy it), so when we do a migration back since there are other instances that were also using it on the destination (original location) only de difference needs to be copied.

But these are just brainstorming ideas, since I don't really know how Nova handles all this.

I would recommend setting Nova log to debug mode in both source and destination nodes and look at where the time difference really is, in case it's not where you think it is.

Cheers, Gorka.

...
Il giorno ven 5 ago 2022 alle ore 10:17 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
On 04/08, Ignazio Cassano wrote:

...
HI, I am using cinder volumes. Ignazio

Hi,

In that case there is no volume data being copied for the instance migration, and volume attach on the destination should not account for more than 30 seconds of those 20 minutes, so not much improvement possible there.

Cheers, Gorka.

...
Il giorno gio 4 ago 2022 alle ore 16:56 Gorka Eguileor < geguileo@redhat.com> ha scritto:

> On 03/08, Ignazio Cassano wrote: > > Hello All, > > I am looking for a solution to speed up live migration. > > Instances where ram is used heavily like java application servers, live > > migration take a long time (more than 20 minutes for 8GB ram instance) > and > > converge mode is already set to True in nova.conf. > > Hi, > > Probably doesn't affect your case, but I assume you are using ephemeral > nova boot volumes. > > Have you tried using only Cinder volumes on the VM? > > Cheers, > Gorka. > > > > I also tried with post_copy but it does not change. > > After the first live migration (very solow) if I try to migrate again it > is > > very fast. > > I presume the first migration is slow because memory fragmentation when > an > > instance is running on the same compute node for a long time. > > I am looking for a solution considering the on my computing node I can > have > > a little ram overcommit. Any case I am increasing the number of compute > > nodes to reduce it. > > Thanks > > Ignazio > >

Ignazio Cassano

2:53 a.m.

When the instance is migrated again from te second to the first it takes 10 seconds. If first node has more loads on network or memory, it should take a long time in any case. Keep in mind I am not using hugepages but default configuration. I am convinced that it is about how the memory of an instance is managed after it runs for a long time on a node Ignazio Il giorno ven 5 ago 2022 alle ore 11:45 Gorka Eguileor <geguileo@redhat.com> ha scritto:

...

On 05/08, Ignazio Cassano wrote:

...
Migrating again to a new node (COMPUTE C) it takes 10 sec. The first migration from A to B (750 sec) is slow in migrating memory :

*migration running for 30 secs, memory 89% remaining; (bytes processed=1258508063, remaining=15356194816, total=17184923648)2022-08-05 10:47:23.910 55600 INFO nova.virt.libvirt.driver [req-ff02667e-9d38-4a08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 60 secs, memory 87% remaining; (bytes processed=1489083638, remaining=15035801600, total=17184923648)08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 90 secs, memory 86% remaining; (bytes processed=1689004421, remaining=14802731008, total=17184923648)*

and so on

That sounds crazy to me. Unless the first node has more load or more network usage than the others, or the VM isn't actually running on Compute B so the migration is not really of a running VM...

...
Il giorno ven 5 ago 2022 alle ore 11:18 Ignazio Cassano < ignaziocassano@gmail.com> ha scritto:

...
Hi, this is the volume attached on netapp nfs about the vm I am

...
...
qemu-img info volume-002ff8af-9067-4f84-a01c-d147cdd1f70dqimage: volume-002ff8af-9067-4f84-a01c-d147cdd1f70d file format: raw virtual size: 40G (42949672960 bytes) disk size: 21G

As you can see it is raw and it does not ha base image. Ignazio

Il giorno ven 5 ago 2022 alle ore 10:49 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
On 05/08, Ignazio Cassano wrote:

...
Hello, firstly let me to thank you for reply and sorry if I come back to ask why when I do the first migration from A to B it takes 20 minutes and then, when I migrate from B to A it takes few seconds. I wonder if after the first migration memory is reorganized. In the first live migration it lost time to get memory pages ? Ignazio

Hi,

I work on Cinder, so my knowledge on live migrations is mostly limited to the attach/detach flow of the volumes.

I thought that maybe if you were using ephemeral nova volumes (non-cinder) maybe the volume had not yet been deleted from the old node, or maybe it was using a qcow2 base file for multiple instances on the source (each using a different chain on top of it) and this qcow2 was not originally present in the destination (hence the time to copy it), so when we do a migration back since there are other instances

migrating: that

...
...
...
were also using it on the destination (original location) only de difference needs to be copied.

But these are just brainstorming ideas, since I don't really know how Nova handles all this.

I would recommend setting Nova log to debug mode in both source and destination nodes and look at where the time difference really is, in case it's not where you think it is.

Cheers, Gorka.

...
Il giorno ven 5 ago 2022 alle ore 10:17 Gorka Eguileor < geguileo@redhat.com> ha scritto:

...
On 04/08, Ignazio Cassano wrote: > HI, > I am using cinder volumes. > Ignazio >

Hi,

In that case there is no volume data being copied for the instance migration, and volume attach on the destination should not account for more than 30 seconds of those 20 minutes, so not much improvement possible there.

Cheers, Gorka.

> Il giorno gio 4 ago 2022 alle ore 16:56 Gorka Eguileor < geguileo@redhat.com> > ha scritto: > > > On 03/08, Ignazio Cassano wrote: > > > Hello All, > > > I am looking for a solution to speed up live migration. > > > Instances where ram is used heavily like java application servers, live > > > migration take a long time (more than 20 minutes for 8GB ram instance) > > and > > > converge mode is already set to True in nova.conf. > > > > Hi, > > > > Probably doesn't affect your case, but I assume you are using ephemeral > > nova boot volumes. > > > > Have you tried using only Cinder volumes on the VM? > > > > Cheers, > > Gorka. > > > > > > > I also tried with post_copy but it does not change. > > > After the first live migration (very solow) if I try to migrate again it > > is > > > very fast. > > > I presume the first migration is slow because memory fragmentation when > > an > > > instance is running on the same compute node for a long time. > > > I am looking for a solution considering the on my computing node I can > > have > > > a little ram overcommit. Any case I am increasing the number of compute > > > nodes to reduce it. > > > Thanks > > > Ignazio > > > >

Radosław Piliszek

3:04 a.m.

On Fri, 5 Aug 2022 at 12:00, Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...

When the instance is migrated again from te second to the first it takes 10 seconds. If first node has more loads on network or memory, it should take a long time in any case. Keep in mind I am not using hugepages but default configuration.

I am convinced that it is about how the memory of an instance is managed after it runs for a long time on a node

Just keep in mind the transfer rates you get are VERY LOW for anything RAM-like. It's around 20 MiB/s - my old HDD could go faster than that with mediocre fragmentation. ;-) It's more likely it spends time waiting for something instead of doing real work. -yoctozepto

Sean Mooney

3:34 a.m.

one thing to be aware of is if the vm writes even a singel byte to a memory page during the migration then entire page needs to be transferred again. not just that one byte which gets expensive if you use hugepages as a one byte write gets amplified to at 2mb or 1GB page copy. even for the default 4k pages its expensive. post-copy adn auto converge help with that to a degree but yes it sounds like this might be memory related but it could still be a network bandwidth limitation. using jumbo frames on the migration network may help as well as disabling tcp slow start. im not sure if there is really anything that can be done to increase the initial migration time beyond that. On Fri, Aug 5, 2022 at 11:26 AM Radosław Piliszek <radoslaw.piliszek@gmail.com> wrote:

...

On Fri, 5 Aug 2022 at 12:00, Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
When the instance is migrated again from te second to the first it takes 10 seconds. If first node has more loads on network or memory, it should take a long time in any case. Keep in mind I am not using hugepages but default configuration.

I am convinced that it is about how the memory of an instance is managed after it runs for a long time on a node

Just keep in mind the transfer rates you get are VERY LOW for anything RAM-like. It's around 20 MiB/s - my old HDD could go faster than that with mediocre fragmentation. ;-) It's more likely it spends time waiting for something instead of doing real work.

-yoctozepto

Ignazio Cassano

3:47 a.m.

Hi Sean, I am going to test it. At this time live migration interfaces are bonded on tow 10 Gbs nic but they are used also for tenant and providers networks. I have a free nic (1gbs) on a vlan where there is no traffic.... Do you think I can try to switch on the above nic also if it only 1 gbs ? Il giorno ven 5 ago 2022 alle ore 12:34 Sean Mooney <smooney@redhat.com> ha scritto:

...

one thing to be aware of is if the vm writes even a singel byte to a memory page during the migration then entire page needs to be transferred again. not just that one byte which gets expensive if you use hugepages as a one byte write gets amplified to at 2mb or 1GB page copy.

even for the default 4k pages its expensive. post-copy adn auto converge help with that to a degree but yes it sounds like this might be memory related but it could still be a network bandwidth limitation.

using jumbo frames on the migration network may help as well as disabling tcp slow start.

im not sure if there is really anything that can be done to increase the initial migration time beyond that.

On Fri, Aug 5, 2022 at 11:26 AM Radosław Piliszek <radoslaw.piliszek@gmail.com> wrote:

...
On Fri, 5 Aug 2022 at 12:00, Ignazio Cassano <ignaziocassano@gmail.com>

wrote:

...
...
When the instance is migrated again from te second to the first it

takes 10 seconds.

...
If first node has more loads on network or memory, it should take a long time in any case. Keep in mind I am not using hugepages but default configuration.

I am convinced that it is about how the memory of an instance is managed after it runs for a long time on a node

Just keep in mind the transfer rates you get are VERY LOW for anything RAM-like. It's around 20 MiB/s - my old HDD could go faster than that with mediocre fragmentation. ;-) It's more likely it spends time waiting for something instead of doing real work.

-yoctozepto

Radosław Piliszek

2:57 a.m.

On Fri, 5 Aug 2022 at 11:51, Gorka Eguileor <geguileo@redhat.com> wrote:

...

On 05/08, Ignazio Cassano wrote:

...
Migrating again to a new node (COMPUTE C) it takes 10 sec. The first migration from A to B (750 sec) is slow in migrating memory :

*migration running for 30 secs, memory 89% remaining; (bytes processed=1258508063, remaining=15356194816, total=17184923648)2022-08-05 10:47:23.910 55600 INFO nova.virt.libvirt.driver [req-ff02667e-9d38-4a08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 60 secs, memory 87% remaining; (bytes processed=1489083638, remaining=15035801600, total=17184923648)08-9c63-013ed1064218 66adb965bef64eaaab2af93ade87e2ca 85cace94dcc7484c85ff9337eb1d0c4c - default default] [instance: d1aae4bb-9a2b-454f-9018-568af6a98cc3] Migration running for 90 secs, memory 86% remaining; (bytes processed=1689004421, remaining=14802731008, total=17184923648)*

and so on

That sounds crazy to me. Unless the first node has more load or more network usage than the others, or the VM isn't actually running on Compute B so the migration is not really of a running VM...

Wow, I agree it looks crazy just like Gorka has said. Indeed, by looking at the counters, it seems the process is progressing at approx. the rate of 1.4 GB per minute so approx. 12 minutes total makes perfect sense. So it really boils down to "why is the memory migration so slow?". More like a topic to discuss with libvirt, QEMU and KVM folks as I doubt nova (and the rest of the OpenStack stuff) has any impact on it. -yoctozepto

1093

Age (days ago)

1095

Last active (days ago)

List overview

Download

21 comments

6 participants

participants (6)

Fabian Zimmermann
Felix Hüttner
Gorka Eguileor
Ignazio Cassano
Radosław Piliszek
Sean Mooney

[openstack] how to speed up live migration?

tags

participants (6)