Openstack nova live migration pain
This is Openstack train release. Nova live migration is extremely painful to deal with. Everytime we complete a live migration, nova-compute service on the source host would still be up and runnning, but it would always fail to further increment the report_count database field in nova database, host heartbeat purpose, thus controller would think the hypervisor host failed via its is_up() function checking on report_count and updated_at fields iirc. So we end up having to manually migrate one vm at a time, then restart service, then manually migrate the next vm .. Any ideas? I already tried setting debug=True for nova.conf even for database tracing, but thus far I could not find any obvious error message. Each time the live migration (no shared storage) would succeed, but each time we will have to restart nova-compute service. This is so bad .. Any suggestions on this would be highly appreciated. Thanks, Hai
I also tried to switch from using "Database ServiceGroup driver" to using "Memcache ServiceGroup driver" per doc at https://docs.openstack.org/nova/rocky/admin/service-groups.html, by modifying the following entry in nova.conf for all hosts, including controller host: servicegroup_driver = "mc" memcached_servers = <None> service_down_time = 60 But it failed, the controller host could not see any host's nova-compute service is actually up.. So I am stuck with having to use database servicegroup driver .. On Fri, Sep 10, 2021 at 10:53 AM hai wu <haiwu.us@gmail.com> wrote:
This is Openstack train release. Nova live migration is extremely painful to deal with. Everytime we complete a live migration, nova-compute service on the source host would still be up and runnning, but it would always fail to further increment the report_count database field in nova database, host heartbeat purpose, thus controller would think the hypervisor host failed via its is_up() function checking on report_count and updated_at fields iirc. So we end up having to manually migrate one vm at a time, then restart service, then manually migrate the next vm ..
Any ideas? I already tried setting debug=True for nova.conf even for database tracing, but thus far I could not find any obvious error message. Each time the live migration (no shared storage) would succeed, but each time we will have to restart nova-compute service. This is so bad ..
Any suggestions on this would be highly appreciated.
Thanks, Hai
On 9/10/21 5:53 PM, hai wu wrote:
This is Openstack train release. Nova live migration is extremely painful to deal with. Everytime we complete a live migration, nova-compute service on the source host would still be up and runnning, but it would always fail to further increment the report_count database field in nova database, host heartbeat purpose, thus controller would think the hypervisor host failed via its is_up() function checking on report_count and updated_at fields iirc. So we end up having to manually migrate one vm at a time, then restart service, then manually migrate the next vm ..
Any ideas? I already tried setting debug=True for nova.conf even for database tracing, but thus far I could not find any obvious error message. Each time the live migration (no shared storage) would succeed, but each time we will have to restart nova-compute service. This is so bad ..
Any suggestions on this would be highly appreciated.
Thanks, Hai
Hi, I experienced the same kind of trouble with Nova as well, but this seems to be fixed in the more recent versions. What is the exact point release that you're running? Cheers, Thomas Goirand (zigo)
From controller, this command is saying 4.0.0:
$ openstack --version openstack 4.0.0 Is this one what you are looking for? Currently using Debian native openstack deb packages for this train release. On Fri, Sep 10, 2021 at 2:39 PM Thomas Goirand <zigo@debian.org> wrote:
On 9/10/21 5:53 PM, hai wu wrote:
This is Openstack train release. Nova live migration is extremely painful to deal with. Everytime we complete a live migration, nova-compute service on the source host would still be up and runnning, but it would always fail to further increment the report_count database field in nova database, host heartbeat purpose, thus controller would think the hypervisor host failed via its is_up() function checking on report_count and updated_at fields iirc. So we end up having to manually migrate one vm at a time, then restart service, then manually migrate the next vm ..
Any ideas? I already tried setting debug=True for nova.conf even for database tracing, but thus far I could not find any obvious error message. Each time the live migration (no shared storage) would succeed, but each time we will have to restart nova-compute service. This is so bad ..
Any suggestions on this would be highly appreciated.
Thanks, Hai
Hi,
I experienced the same kind of trouble with Nova as well, but this seems to be fixed in the more recent versions. What is the exact point release that you're running?
Cheers,
Thomas Goirand (zigo)
On 2021-09-10 14:50:08 -0500 (-0500), hai wu wrote:
From controller, this command is saying 4.0.0:
$ openstack --version openstack 4.0.0
Is this one what you are looking for? Currently using Debian native openstack deb packages for this train release. [...]
That's reporting the version of OpenStackClient you have installed. More likely you should be looking at something like the output of `dpkg -l python3-nova` to see what version of Nova you have installed on the controller. -- Jeremy Stanley
Here is what we use for python3-nova: $ dpkg -l python3-nova Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-============-============-================================= ii python3-nova 2:18.1.0-6 all OpenStack Compute - libraries Here is what we have for all things with 'nova' string in the package name: $ dpkg -l | grep nova ii nova-common 2:18.1.0-6 all OpenStack Compute - common files ii nova-compute 2:18.1.0-6 all OpenStack Compute - compute node ii nova-compute-kvm 2:18.1.0-6 all OpenStack Compute - compute node (KVM) ii python3-nova 2:18.1.0-6 all OpenStack Compute - libraries ii python3-novaclient 2:15.1.0-1~bpo10+1 all client library for OpenStack Compute API - 3.x On Fri, Sep 10, 2021 at 3:08 PM Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2021-09-10 14:50:08 -0500 (-0500), hai wu wrote:
From controller, this command is saying 4.0.0:
$ openstack --version openstack 4.0.0
Is this one what you are looking for? Currently using Debian native openstack deb packages for this train release. [...]
That's reporting the version of OpenStackClient you have installed. More likely you should be looking at something like the output of `dpkg -l python3-nova` to see what version of Nova you have installed on the controller. -- Jeremy Stanley
On 2021-09-10 15:16:46 -0500 (-0500), hai wu wrote:
Here is what we use for python3-nova:
$ dpkg -l python3-nova [...] ii python3-nova 2:18.1.0-6 all OpenStack Compute - libraries [...]
In your first message you stated "this is Openstack train release," but that's a version of Nova from Rocky (originally released a year before Train). It looks like you're using packages which come as part of normal Debian 10 (buster), rather than using something like buster-train-backports from osbpo.debian.net. What made you think it was OpenStack Train, and could this be part of the problem? Do you maybe have a mixed deployment with some components from Train and some from Rocky? -- Jeremy Stanley
I am new to the current environment. I just asked around, and you are correct: we had to use a mix of both buster-train-backports and buster, because some of the packages were breaking systems scripts, also there was one python-nova that caused the rabbitmq to get into an endless loop... Is it possible to just upgrade this package to some later package from Train, which does not have this bug, in order to work around this particular buggy state? we could do some tests here, we do have a test system, which is built the same way, and it is suffering from the exact same live migration issue .. On Fri, Sep 10, 2021 at 3:33 PM Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2021-09-10 15:16:46 -0500 (-0500), hai wu wrote:
Here is what we use for python3-nova:
$ dpkg -l python3-nova [...] ii python3-nova 2:18.1.0-6 all OpenStack Compute - libraries [...]
In your first message you stated "this is Openstack train release," but that's a version of Nova from Rocky (originally released a year before Train). It looks like you're using packages which come as part of normal Debian 10 (buster), rather than using something like buster-train-backports from osbpo.debian.net.
What made you think it was OpenStack Train, and could this be part of the problem? Do you maybe have a mixed deployment with some components from Train and some from Rocky? -- Jeremy Stanley
On 9/10/21 11:19 PM, hai wu wrote:
I am new to the current environment. I just asked around, and you are correct: we had to use a mix of both buster-train-backports and buster, because some of the packages were breaking systems scripts, also there was one python-nova that caused the rabbitmq to get into an endless loop...
Hi, Could you be more precise please? What package is breaking what system scripts? I never heard of something like this... I never heard about what you wrote on "python-nova that caused the rabbitmq to get into an endless loop" either. We've been running Nova from the Train release for quite some time now, and we never experienced this. Also, we did have the issue you described with Nova from Rocky, but it went away after upgrading to Train.
Is it possible to just upgrade this package to some later package from Train, which does not have this bug, in order to work around this particular buggy state? we could do some tests here, we do have a test system, which is built the same way, and it is suffering from the exact same live migration issue ..
Yes you can upgrade, but not directly from Rocky to Train, you must update your control plane to Stein first, do the db-sync (ie: upgrade your db schema), then upgrade to Train and re-do the db-sync again. In fact, we did this in production. While it took 6 hours, it worked well, without a single issue. BTW, it's nice that we get in touch, though I regret you didn't do it earlier about the other issues you described. We (the Debian OpenStack package maintainers) don't bite and are happy to help! :) Cheers, Thomas Goirand (zigo)
Thanks a lot! I am not sure about the details on the issues hit before, it might have something to do with certain internally available Debian native python3 packages conflicting with the ones from buster-train-backports. I am not sure about the current state for our environment, maybe its db schema is already the one for Train release. Is there any way to verify (not to change anything) that, in order to see the current state (like issuing some sql query to see what's in place for its db schema, and issuing some dpkg command to see what kind of openstack related packages are currently in place .. )? I am hoping that we might already have db schema for Train release, and we just need to upgrade some python3 packages from buster-train-backports .. I understand that some db schema might come from certain packages, and if those packages are not matching the ones from buster-train-backports, then we might have to go the db-sync path later. Thanks, Hai On Sat, Sep 11, 2021 at 8:55 AM Thomas Goirand <zigo@debian.org> wrote:
On 9/10/21 11:19 PM, hai wu wrote:
I am new to the current environment. I just asked around, and you are correct: we had to use a mix of both buster-train-backports and buster, because some of the packages were breaking systems scripts, also there was one python-nova that caused the rabbitmq to get into an endless loop...
Hi,
Could you be more precise please? What package is breaking what system scripts? I never heard of something like this... I never heard about what you wrote on "python-nova that caused the rabbitmq to get into an endless loop" either.
We've been running Nova from the Train release for quite some time now, and we never experienced this.
Also, we did have the issue you described with Nova from Rocky, but it went away after upgrading to Train.
Is it possible to just upgrade this package to some later package from Train, which does not have this bug, in order to work around this particular buggy state? we could do some tests here, we do have a test system, which is built the same way, and it is suffering from the exact same live migration issue ..
Yes you can upgrade, but not directly from Rocky to Train, you must update your control plane to Stein first, do the db-sync (ie: upgrade your db schema), then upgrade to Train and re-do the db-sync again.
In fact, we did this in production. While it took 6 hours, it worked well, without a single issue.
BTW, it's nice that we get in touch, though I regret you didn't do it earlier about the other issues you described. We (the Debian OpenStack package maintainers) don't bite and are happy to help! :)
Cheers,
Thomas Goirand (zigo)
On 9/11/21 8:17 PM, hai wu wrote:
Is there any way to verify (not to change anything) that, in order to see the current state (like issuing some sql query to see what's in place for its db schema, and issuing some dpkg command to see what kind of openstack related packages are currently in place .. )?
Some more familiar than me with the nova db schema should be able to answer this question. However...
I am hoping that we might already have db schema for Train release, and we just need to upgrade some python3 packages from buster-train-backports .. I understand that some db schema might come from certain packages, and if those packages are not matching the ones from buster-train-backports, then we might have to go the db-sync path later.
... it doesn't really mater, the db-sync thingy is supposed to be idempotent, so you can: - upgrade to stein - run the db-sync in stein - upgrade to train - run the db-sync in train If you've already done the Train db-sync, then the above db-sync will do nothing and that's it... you'll ave performed a working upgrade anyways. Cheers, Thomas Goirand (zigo)
On Sun, 2021-09-12 at 21:12 +0200, Thomas Goirand wrote:
On 9/11/21 8:17 PM, hai wu wrote:
Is there any way to verify (not to change anything) that, in order to see the current state (like issuing some sql query to see what's in place for its db schema, and issuing some dpkg command to see what kind of openstack related packages are currently in place .. )?
Some more familiar than me with the nova db schema should be able to answer this question. However...
I am hoping that we might already have db schema for Train release, and we just need to upgrade some python3 packages from buster-train-backports .. I understand that some db schema might come from certain packages, and if those packages are not matching the ones from buster-train-backports, then we might have to go the db-sync path later.
... it doesn't really mater, the db-sync thingy is supposed to be idempotent, so you can: - upgrade to stein - run the db-sync in stein - upgrade to train - run the db-sync in train
If you've already done the Train db-sync, then the above db-sync will do nothing and that's it... you'll ave performed a working upgrade anyways. yep running db sync should be safe if you have already run it.
also nova technially does not support mix version with greate then 1 upstream version. e.g. running some nova rocky compents with other using train. you can if you really know what your doing make this work in some cases but its not advised and entirely untested upstream. if the rpc version are pinned correctly on all nodes and the contolers are a newer version the the compute and all contolers are the same version it technially can fucntion correctly. but each contoller must run exactly the same version and some feature like the numa live migration code will automatically disabel its self untill all compute services are on train. upgrading compute nodes before contolers is entirely unsupported and not intended to work. so i would strongly suggest you try and align all host to train and see if that resolve your issue. the compute agent locking up when live migration happend had 2 cause that i know if in the past one was a long running io operation that blocked the main thread and the other was due to not proprly proxying all libvirt object which again caused the main thread to block on a call to libvirt. both were fixed so its proably that your rocky nodes are just missing those fixes.
Cheers,
Thomas Goirand (zigo)
On 9/10/21 9:50 PM, hai wu wrote:
From controller, this command is saying 4.0.0:
$ openstack --version openstack 4.0.0
Is this one what you are looking for? Currently using Debian native openstack deb packages for this train release.
What you've displayed above is the version of python3-openstackclient. Please login into the controller machine and do: dpkg -l python3-nova Cheers, Thomas Goirand (zigo)
participants (4)
-
hai wu
-
Jeremy Stanley
-
Sean Mooney
-
Thomas Goirand