How to debug silent live migration errors
DHilsbos at performair.com
DHilsbos at performair.com
Wed Mar 31 17:55:17 UTC 2021
John;
I recently had to work through a similar issue, though I am working with Victoria, so take this with a grain of salt.
I finally found the correct path by looking in the hypervisor's logs on the machines sending and receiving the live migration. For us that is KVM.
Thank you,
Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
DHilsbos at PerformAir.com
www.PerformAir.com
From: Linebarger, John [mailto:jmlineb at sandia.gov]
Sent: Tuesday, March 30, 2021 6:24 AM
To: openstack-discuss at lists.openstack.org
Cc: Hostetler, Sarah N; Shurtz, Peter; Urbaniak, Kendrick
Subject: How to debug silent live migration errors
How would I debug silent (or mostly silent) live migration errors? We're using the Stein release of Canonical's Charmed OpenStack. I have configured it for live migration per the instructions at this link:
https://docs.openstack.org/nova/pike/admin/configuring-migrations.html#section-configuring-compute-migrations<https://docs.openstack.org/nova/pike/admin/configuring-migrations.html>
Specifically:
1. I did not specify vncserver_listen=0.0.0.0 in nova.conf because we are not running VNC on our instances
2. instances_path is /var/lib/nova/instances on all compute nodes
3. I believe that MAAS is "the sole provider of DHCP and DNS for the network hosting the MAAS cluster", per https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/install-maas.html
4. Identical authorized_keys files are present on all compute nodes with keys from all compute nodes by default
5. I manually configured the firewalls on all compute nodes to allow libvirt to communicate between compute hosts with:
sudo ufw allow 49152:49261/tcp
6. The following settings are specified in nova.conf on each compute node:
live_migration_downtime = 500
live_migration_downtime_steps = 10
live_migration_downtime_delay = 75
live_migration_permit_post_copy=true
Here's what happens when I try to Live Migrate from the Horizon Dashboard:
1. As admin, in the Admin --> Instances menu, I select the dropdown arrow to the right of the instance. Live Migrate Instance appears (but in black, unlike Migrate Instance, which appears in red). I select Live Migrate Instance, and whether or not I Automatically schedule new host or manually select a new host the Task column says "Migrating" and then it stops and reverts to None. The server never changes. The Action Log shows the live migration request but the Message column is blank.
2. I do the very same thing but this time select Disk Over Commit. Same results. Migrating reverts back to None and the server never changes.
3. I do the very same thing but this time select Block Migration. This time I do get an error: "Failed to live migrate instance to host 'AUTO_SCHEDULE'". And this time the Action Log has "Error" in the Message column.
Same behavior with the CLI. For example, this CLI command below completes silently, yet the server for the instance never changes.
john at vm-dev-john:~/bin$ openstack server migrate <instanceID> --live <newServerName>
[Silent failure]
john at vm-dev-john:~/bin$ openstack server show <instancID>
[Still running on original server]
Note that I *can* successfully Migrate, both using the Horizon Dashboard and the CLI. What fails is Live Migration. I just have no idea why, and no error is displayed in the Action Log for the instance.
For reference, the instance is an m1.small with 2GB of RAM, 1 VCPU, and a 20GB Cinder disk volume attached on /dev/vda.
Any and all debugging ideas would be most welcome. Without logs I am simply guessing in the dark at this point.
Thanks! Enjoy!
John M. Linebarger, PhD, MBA
Principal Member of Technical Staff
Sandia National Laboratories
(Office) 505-845-8282
(Cell) 505-681-4879
<http://www.sandia.gov/> [https://www.certmetrics.com/api/ob/image/amazon/c/4] <https://www.youracclaim.com/badges/24fe4e43-2f72-4ecc-a11a-783d21dada0f> [https://www.certmetrics.com/api/ob/image/amazon/c/1] <https://www.youracclaim.com/badges/707a5f6f-d919-4daa-bbc4-81b5a779a6cd> [https://www.certmetrics.com/api/ob/image/amazon/c/2] <https://www.youracclaim.com/badges/b52ac093-5415-4493-85c3-b77e055211f5> <https://www.youracclaim.com/badges/d6e5a453-0e61-4dbe-9716-6af51f364710/public_url> <https://www.youracclaim.com/badges/035d4967-f77d-476e-915b-1061bb789ec3/public_url>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210331/33c66222/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Picture (Device Independent Bitmap) 1.jpg
Type: image/jpeg
Size: 2322 bytes
Desc: Picture (Device Independent Bitmap) 1.jpg
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210331/33c66222/attachment-0003.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Picture (Device Independent Bitmap) 2.jpg
Type: image/jpeg
Size: 2395 bytes
Desc: Picture (Device Independent Bitmap) 2.jpg
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210331/33c66222/attachment-0004.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Picture (Device Independent Bitmap) 3.jpg
Type: image/jpeg
Size: 2941 bytes
Desc: Picture (Device Independent Bitmap) 3.jpg
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210331/33c66222/attachment-0005.jpg>
More information about the openstack-discuss
mailing list