How to debug silent live migration errors
How would I debug silent (or mostly silent) live migration errors? We're using the Stein release of Canonical's Charmed OpenStack. I have configured it for live migration per the instructions at this link: https://docs.openstack.org/nova/pike/admin/configuring-migrations.html#secti... Specifically: 1. I did not specify vncserver_listen=0.0.0.0 in nova.conf because we are not running VNC on our instances 2. instances_path is /var/lib/nova/instances on all compute nodes 3. I believe that MAAS is "the sole provider of DHCP and DNS for the network hosting the MAAS cluster", per https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/lates... 4. Identical authorized_keys files are present on all compute nodes with keys from all compute nodes by default 5. I manually configured the firewalls on all compute nodes to allow libvirt to communicate between compute hosts with: sudo ufw allow 49152:49261/tcp 6. The following settings are specified in nova.conf on each compute node: live_migration_downtime = 500 live_migration_downtime_steps = 10 live_migration_downtime_delay = 75 live_migration_permit_post_copy=true Here's what happens when I try to Live Migrate from the Horizon Dashboard: 1. As admin, in the Admin --> Instances menu, I select the dropdown arrow to the right of the instance. Live Migrate Instance appears (but in black, unlike Migrate Instance, which appears in red). I select Live Migrate Instance, and whether or not I Automatically schedule new host or manually select a new host the Task column says "Migrating" and then it stops and reverts to None. The server never changes. The Action Log shows the live migration request but the Message column is blank. 2. I do the very same thing but this time select Disk Over Commit. Same results. Migrating reverts back to None and the server never changes. 3. I do the very same thing but this time select Block Migration. This time I do get an error: "Failed to live migrate instance to host 'AUTO_SCHEDULE'". And this time the Action Log has "Error" in the Message column. Same behavior with the CLI. For example, this CLI command below completes silently, yet the server for the instance never changes. john@vm-dev-john:~/bin$ openstack server migrate <instanceID> --live <newServerName> [Silent failure] john@vm-dev-john:~/bin$ openstack server show <instancID> [Still running on original server] Note that I *can* successfully Migrate, both using the Horizon Dashboard and the CLI. What fails is Live Migration. I just have no idea why, and no error is displayed in the Action Log for the instance. For reference, the instance is an m1.small with 2GB of RAM, 1 VCPU, and a 20GB Cinder disk volume attached on /dev/vda. Any and all debugging ideas would be most welcome. Without logs I am simply guessing in the dark at this point. Thanks! Enjoy! John M. Linebarger, PhD, MBA Principal Member of Technical Staff Sandia National Laboratories (Office) 505-845-8282 (Cell) 505-681-4879 [cid:image002.jpg@01D72535.958BCAE0]<http://www.sandia.gov/>[AWS Certified Solutions Architect - Professional]<https://www.youracclaim.com/badges/24fe4e43-2f72-4ecc-a11a-783d21dada0f>[AWS Certified Solutions Architect - Associate]<https://www.youracclaim.com/badges/707a5f6f-d919-4daa-bbc4-81b5a779a6cd>[AWS Certified Developer - Associate]<https://www.youracclaim.com/badges/b52ac093-5415-4493-85c3-b77e055211f5>[cid:image003.png@01D72531.072F13F0]<https://www.youracclaim.com/badges/d6e5a453-0e61-4dbe-9716-6af51f364710/public_url>[cid:image005.png@01D72535.958BCAE0]<https://www.youracclaim.com/badges/035d4967-f77d-476e-915b-1061bb789ec3/public_url>
How would I debug silent (or mostly silent) live migration errors? We're using the Stein release of Canonical's Charmed OpenStack. I have configured it for live migration per the instructions at this link: https://docs.openstack.org/nova/pike/admin/configuring-migrations.html#secti... Specifically: 1. I did not specify vncserver_listen=0.0.0.0 in nova.conf because we are not running VNC on our instances 2. instances_path is /var/lib/nova/instances on all compute nodes 3. I believe that MAAS is "the sole provider of DHCP and DNS for the network hosting the MAAS cluster", per https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/lates... 4. Identical authorized_keys files are present on all compute nodes with keys from all compute nodes by default 5. I manually configured the firewalls on all compute nodes to allow libvirt to communicate between compute hosts with: sudo ufw allow 49152:49261/tcp 6. The following settings are specified in nova.conf on each compute node: live_migration_downtime = 500 live_migration_downtime_steps = 10 live_migration_downtime_delay = 75 live_migration_permit_post_copy=true Here's what happens when I try to Live Migrate from the Horizon Dashboard: 1. As admin, in the Admin --> Instances menu, I select the dropdown arrow to the right of the instance. Live Migrate Instance appears (but in black, unlike Migrate Instance, which appears in red). I select Live Migrate Instance, and whether or not I Automatically schedule new host or manually select a new host the Task column says "Migrating" and then it stops and reverts to None. The server never changes. The Action Log shows the live migration request but the Message column is blank. 2. I do the very same thing but this time select Disk Over Commit. Same results. Migrating reverts back to None and the server never changes. 3. I do the very same thing but this time select Block Migration. This time I do get an error: "Failed to live migrate instance to host 'AUTO_SCHEDULE'". And this time the Action Log has "Error" in the Message column. Same behavior with the CLI. For example, this CLI command below completes silently, yet the server for the instance never changes. openstack server migrate <instanceID> --live <newServerName> [Silent failure] openstack server show <instanceID> [Still running on original server] Note that I *can* successfully Migrate, both using the Horizon Dashboard and the CLI. What fails is Live Migration. I just have no idea why, and no error is displayed in the Action Log for the instance. For reference, the instance is an m1.small with 2GB of RAM, 1 VCPU, and a 20GB Cinder disk volume attached on /dev/vda. Any and all debugging ideas would be most welcome. Without logs I am simply guessing in the dark at this point. Thanks! Enjoy! John M. Linebarger, PhD, MBA Principal Member of Technical Staff Sandia National Laboratories (Office) 505-845-8282
Live migration is an asynchronous operation so without --wait on the command line it returns once the API initially returns 202 to indicate the request was accepted [1]. As an admin you can use the server migrations API to track the status of the migration [2] via openstackclient: $ openstack server migration list --server $instance_uuid $ openstack server migration show $instance_uuid $migration_id You also have the event list so you can find the specific request-id associated with the live migration and trace that through your logs: $ openstack server event list $instance_uuid $ openstack server event show $instance_uuid $request-id Hope that helps, Lee [1] https://docs.openstack.org/api-ref/compute/?expanded=live-migrate-server-os-... [2] https://docs.openstack.org/api-ref/compute/?expanded=show-migration-details-... [3] https://docs.openstack.org/api-guide/compute/faults.html On Tue, 30 Mar 2021 at 14:33, Linebarger, John <jmlineb@sandia.gov> wrote:
How would I debug silent (or mostly silent) live migration errors? We’re using the Stein release of Canonical’s Charmed OpenStack. I have configured it for live migration per the instructions at this link:
https://docs.openstack.org/nova/pike/admin/configuring-migrations.html#secti...
Specifically:
1. I did not specify vncserver_listen=0.0.0.0 in nova.conf because we are not running VNC on our instances
2. instances_path is /var/lib/nova/instances on all compute nodes
3. I believe that MAAS is “the sole provider of DHCP and DNS for the network hosting the MAAS cluster”, per https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/lates...
4. Identical authorized_keys files are present on all compute nodes with keys from all compute nodes by default
5. I manually configured the firewalls on all compute nodes to allow libvirt to communicate between compute hosts with:
sudo ufw allow 49152:49261/tcp
6. The following settings are specified in nova.conf on each compute node:
live_migration_downtime = 500
live_migration_downtime_steps = 10
live_migration_downtime_delay = 75
live_migration_permit_post_copy=true
Here’s what happens when I try to Live Migrate from the Horizon Dashboard:
1. As admin, in the Admin à Instances menu, I select the dropdown arrow to the right of the instance. Live Migrate Instance appears (but in black, unlike Migrate Instance, which appears in red). I select Live Migrate Instance, and whether or not I Automatically schedule new host or manually select a new host the Task column says “Migrating” and then it stops and reverts to None. The server never changes. The Action Log shows the live migration request but the Message column is blank.
2. I do the very same thing but this time select Disk Over Commit. Same results. Migrating reverts back to None and the server never changes.
3. I do the very same thing but this time select Block Migration. This time I do get an error: “Failed to live migrate instance to host ‘AUTO_SCHEDULE’”. And this time the Action Log has “Error” in the Message column.
Same behavior with the CLI. For example, this CLI command below completes silently, yet the server for the instance never changes.
openstack server migrate <instanceID> --live <newServerName>
[Silent failure]
openstack server show <instanceID>
[Still running on original server]
Note that I *can* successfully Migrate, both using the Horizon Dashboard and the CLI. What fails is Live Migration. I just have no idea why, and no error is displayed in the Action Log for the instance.
For reference, the instance is an m1.small with 2GB of RAM, 1 VCPU, and a 20GB Cinder disk volume attached on /dev/vda.
Any and all debugging ideas would be most welcome. Without logs I am simply guessing in the dark at this point.
Thanks! Enjoy!
John M. Linebarger, PhD, MBA
Principal Member of Technical Staff
Sandia National Laboratories
(Office) 505-845-8282
John; I recently had to work through a similar issue, though I am working with Victoria, so take this with a grain of salt. I finally found the correct path by looking in the hypervisor's logs on the machines sending and receiving the live migration. For us that is KVM. Thank you, Dominic L. Hilsbos, MBA Director - Information Technology Perform Air International Inc. DHilsbos@PerformAir.com www.PerformAir.com From: Linebarger, John [mailto:jmlineb@sandia.gov] Sent: Tuesday, March 30, 2021 6:24 AM To: openstack-discuss@lists.openstack.org Cc: Hostetler, Sarah N; Shurtz, Peter; Urbaniak, Kendrick Subject: How to debug silent live migration errors How would I debug silent (or mostly silent) live migration errors? We're using the Stein release of Canonical's Charmed OpenStack. I have configured it for live migration per the instructions at this link: https://docs.openstack.org/nova/pike/admin/configuring-migrations.html#section-configuring-compute-migrations<https://docs.openstack.org/nova/pike/admin/configuring-migrations.html> Specifically: 1. I did not specify vncserver_listen=0.0.0.0 in nova.conf because we are not running VNC on our instances 2. instances_path is /var/lib/nova/instances on all compute nodes 3. I believe that MAAS is "the sole provider of DHCP and DNS for the network hosting the MAAS cluster", per https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/lates... 4. Identical authorized_keys files are present on all compute nodes with keys from all compute nodes by default 5. I manually configured the firewalls on all compute nodes to allow libvirt to communicate between compute hosts with: sudo ufw allow 49152:49261/tcp 6. The following settings are specified in nova.conf on each compute node: live_migration_downtime = 500 live_migration_downtime_steps = 10 live_migration_downtime_delay = 75 live_migration_permit_post_copy=true Here's what happens when I try to Live Migrate from the Horizon Dashboard: 1. As admin, in the Admin --> Instances menu, I select the dropdown arrow to the right of the instance. Live Migrate Instance appears (but in black, unlike Migrate Instance, which appears in red). I select Live Migrate Instance, and whether or not I Automatically schedule new host or manually select a new host the Task column says "Migrating" and then it stops and reverts to None. The server never changes. The Action Log shows the live migration request but the Message column is blank. 2. I do the very same thing but this time select Disk Over Commit. Same results. Migrating reverts back to None and the server never changes. 3. I do the very same thing but this time select Block Migration. This time I do get an error: "Failed to live migrate instance to host 'AUTO_SCHEDULE'". And this time the Action Log has "Error" in the Message column. Same behavior with the CLI. For example, this CLI command below completes silently, yet the server for the instance never changes. john@vm-dev-john:~/bin$ openstack server migrate <instanceID> --live <newServerName> [Silent failure] john@vm-dev-john:~/bin$ openstack server show <instancID> [Still running on original server] Note that I *can* successfully Migrate, both using the Horizon Dashboard and the CLI. What fails is Live Migration. I just have no idea why, and no error is displayed in the Action Log for the instance. For reference, the instance is an m1.small with 2GB of RAM, 1 VCPU, and a 20GB Cinder disk volume attached on /dev/vda. Any and all debugging ideas would be most welcome. Without logs I am simply guessing in the dark at this point. Thanks! Enjoy! John M. Linebarger, PhD, MBA Principal Member of Technical Staff Sandia National Laboratories (Office) 505-845-8282 (Cell) 505-681-4879 <http://www.sandia.gov/> [https://www.certmetrics.com/api/ob/image/amazon/c/4] <https://www.youracclaim.com/badges/24fe4e43-2f72-4ecc-a11a-783d21dada0f> [https://www.certmetrics.com/api/ob/image/amazon/c/1] <https://www.youracclaim.com/badges/707a5f6f-d919-4daa-bbc4-81b5a779a6cd> [https://www.certmetrics.com/api/ob/image/amazon/c/2] <https://www.youracclaim.com/badges/b52ac093-5415-4493-85c3-b77e055211f5> <https://www.youracclaim.com/badges/d6e5a453-0e61-4dbe-9716-6af51f364710/public_url> <https://www.youracclaim.com/badges/035d4967-f77d-476e-915b-1061bb789ec3/public_url>
participants (3)
-
DHilsbos@performair.com
-
Lee Yarwood
-
Linebarger, John