Live migration fails

Szabo, Istvan (Agoda) Istvan.Szabo at agoda.com
Wed Apr 28 03:36:09 UTC 2021


Hi,

Answering your question:
1. no, still the old compute node name is in the list
2. This is the log about that machine ID on that day: https://justpaste.it/7usq5 
Maybe this network client issue is the cause on the destination host? You can also see that the memory copy took too long and started many time from the beginning. 

You said you don't support changes in the db. 
Actually I change this value of a compute node that we are planning to drain and of course I try to avoid to touch.
Also it is I think not the compute node entry in the db, it is in the instance table in the nova db. Have a look the screenshot of the entry please: https://i.ibb.co/WKB3sGM/Capture.png

We are using the openstack-commands not the API calls but I guess the result is the same.

When it was in active migrating state and we tried it we got this:

1. nova live-migration-force-complete 6b3c5ef1-293a-426d-89e5-230f59c2d06f 870
ERROR (BadRequest): Migration 870 state of instance 6b3c5ef1-293a-426d-89e5-230f59c2d06f is completed. Cannot force complete while the migration is in this state. (HTTP 400) (Request-ID: req-e409114b-c4ec-4f25-8ff0-d9dc34460bc9)

2. After I restarted on the source compute node the nova-compute service and it puts the machine to error state, however the vm is still running. 
nova live-migration-force-complete 6b3c5ef1-293a-426d-89e5-230f59c2d06f 870
ERROR (Conflict): Cannot 'force_complete' instance 6b3c5ef1-293a-426d-89e5-230f59c2d06f while it is in vm_state error (HTTP 409) (Request-ID: req-2f0a2fcd-62be-44ae-bafc-cac952a63c82)

3. After changed back to active and tried the migration, and I got this:
nova live-migration-force-complete 6b3c5ef1-293a-426d-89e5-230f59c2d06f 870
ERROR (Conflict): Instance 6b3c5ef1-293a-426d-89e5-230f59c2d06f is in an invalid state for 'force_complete' (HTTP 409) (Request-ID: req-3d1631e9-973a-4549-a998-fb1d3d95572a)

Just to summarize your options:

1.
if the instance.host matches the host on which it is now rungin then you should be able to set the status and taskstate back to active/migrating respectivly. at which point you can force complete the migration.

This is  not our case unfortunately ☹

2.
if the vm is running correctly on the destiatnion host and its host and the instance.host is set correctly it might just be simpler to updte the migration record to complete and ensure the task state is set to none on the instance.

Does it has to be done in the DB? Haven't really find option for this to update the migration record. Or you mean the force complete?

3.
if the instace.host still has the source host but its running on the dest host then you should update it to refelct the correct host then mark the migration as complete.

I guess it is the force complete also right? Change in the instance table the host and node and force complete?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo at agoda.com
---------------------------------------------------

-----Original Message-----
From: Sean Mooney <smooney at redhat.com> 
Sent: Tuesday, April 27, 2021 5:54 PM
To: openstack-discuss at lists.openstack.org
Subject: Re: Live migration fails

On Tue, 2021-04-27 at 09:17 +0000, Szabo, Istvan (Agoda) wrote:
> Hi,
> 
> We are trying to live migrate instances out from compute nodes and tries to automate but seems like can't really do, when the migration stuck. 
> Let me explain the issue a bit:
> 
> 1. We initiate live migration
> 2. live migration finished, the machine disappeared from the /var/lib/nova/instances/<server id> directory on the source server.
> 3. but when I query or see in horizon it stucked in migrating phase. We collected information like migration id and we try to force it but it is already finished, and can't force to complete.
> 4. I've restarted the nova service on the source node, it just make the machine to error phase, and the force not working also.
> 5. I changed the state from error to active but that one also can't force complete.
> 
> What can I do to change the name of the compute node in the DB?
> 

you should not change the name of the compute node in the db.
we do not support changing the compute node name if it has instances on it.
if you ment in the migration record you also should not change it as the resouces woudl not be claimed correctly.

>  How can I force it without touching the db?
> 
i dont think you can fix it without touching the db.

so if the vm is removed form the source node there are 2 things you chould check
1 is the instance.host set to the dest host where it is now running
2 if you look in the logs was there an error in post live migrate.

baiscaly what i think was the most likely issue is that an operation in post live migrate failed before the migations recored was set to complete.

the precondiotns for force complete are
The server OS-EXT-STS:vm_state value must be active and the server OS-EXT-STS:task_state value must be migrating.
https://docs.openstack.org/api-ref/compute/?expanded=force-migration-complete-action-force-complete-action-detail#force-migration-complete-action-force-complete-action

if the instance.host matches the host on which it is now rungin then you should be able to set the status and taskstate back to active/migrating respectivly. at which point you can force complete the migration.

if the vm is running correctly on the destiatnion host and its host and the instance.host is set correctly it might just be simpler to updte the migration record to complete and ensure the task state is set to none on the instance.

if the instace.host still has the source host but its running on the dest host then you should update it to refelct the correct host then mark the migration as complete.

all of the above will require at least some db modifcations.

>  

> 
> The goal is to automate the compute node draining as less as possible user intervention. 
> 
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo at agoda.com
> ---------------------------------------------------
> 
> -----Original Message-----
> From: Szabo, Istvan (Agoda) <Istvan.Szabo at agoda.com>
> Sent: Friday, April 23, 2021 9:13 AM
> To: Sean Mooney <smooney at redhat.com>; 
> openstack-discuss at lists.openstack.org
> Subject: RE: Live migration fails
> 
> My /etc/hostname has only short name.
> The nova.conf host value is also short name.
> The host has been selected by the scheduler: nova live-migration 
> --block-migrate 1517a2ac-3b51-4d8d-80b3-89a5614d1ae0
> 
> What has been changed is in the instances table in the nova DB the node field of the vm. So actually I don't change the compute host value just edited the VM value actually.
> 
> Istvan Szabo
> Senior Infrastructure Engineer
> ---------------------------------------------------
> Agoda Services Co., Ltd.
> e: istvan.szabo at agoda.com
> ---------------------------------------------------
> 
> -----Original Message-----
> From: Sean Mooney <smooney at redhat.com>
> Sent: Thursday, April 22, 2021 4:13 PM
> To: openstack-discuss at lists.openstack.org
> Subject: Re: Live migration fails
> 
> On Thu, 2021-04-22 at 06:01 +0000, Eugen Block wrote:
> > Yeah, the column "node" has the FQDN in my DB, too, only "host" is 
> > the short name. The question is how did the short name get into the "node"
> > column, but it will probably be difficult to get to the bottom of that.
> well by default we do not expect to have FQDNs in either filed.
> novas default  for both is the hostname of the host which will be the short name not the fqdn unless you set an fqdn in /etc/hostname which is not generally the recommended pratice.
> 
> nova in general does nto support changing the hostname(/etc/hostname) of a host and you should avoid changeing the "host" value in the nova.conf too.
> 
> changing these values can result in the creation fo addtional placment RP, compute service records and compute nodes and that can result in hard to fix situation wehre old vms are using one set of resouce and new vms are using the updated ones.
> 
> so you should not modify either value in the db.
> 
> did you perhaps specify a host when live migrating and just pass the wrong value or was the host selected by the scheduler.
> > 
> > 
> > Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo at agoda.com>:
> > 
> > > I think I found the issue, in the instances nova db in the node 
> > > column the compute node name somehow changed to short hostname. It 
> > > works fith FQDN but it doesn't work with short ☹ I hope I didn't 
> > > mess-up anything if I change to FQDN to make it work.
> > > 
> > > Istvan Szabo
> > > Senior Infrastructure Engineer
> > > ---------------------------------------------------
> > > Agoda Services Co., Ltd.
> > > e: istvan.szabo at agoda.com
> > > ---------------------------------------------------
> > > 
> > > -----Original Message-----
> > > From: Szabo, Istvan (Agoda) <Istvan.Szabo at agoda.com>
> > > Sent: Thursday, April 22, 2021 11:19 AM
> > > To: Eugen Block <eblock at nde.ag>
> > > Cc: openstack-discuss at lists.openstack.org
> > > Subject: RE: Live migration fails
> > > 
> > > Sorry, in the log I haven't commented out the servername ☹ it is
> > > xy-osfecn-40250
> > > 
> > > Istvan Szabo
> > > Senior Infrastructure Engineer
> > > ---------------------------------------------------
> > > Agoda Services Co., Ltd.
> > > e: istvan.szabo at agoda.com
> > > ---------------------------------------------------
> > > 
> > > -----Original Message-----
> > > From: Eugen Block <eblock at nde.ag>
> > > Sent: Wednesday, April 21, 2021 5:37 PM
> > > To: Szabo, Istvan (Agoda) <Istvan.Szabo at agoda.com>
> > > Cc: openstack-discuss at lists.openstack.org
> > > Subject: Re: Live migration fails
> > > 
> > > The error message seems correct, I can't find am-osfecn-4025 
> > > either in the list of compute nodes. Can you check in the database 
> > > if there's an active instance (or several) allocated to that 
> > > compute node? In that case you would need to correct the 
> > > allocation in order for the migration to work.
> > > 
> > > 
> > > Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo at agoda.com>:
> > > 
> > > > Sure:
> > > > 
> > > > https://jpst.it/2u3uh
> > > > 
> > > > These are the one where can't live migrate:
> > > > xy-osfecn-40250
> > > > xy-osfecn-40281
> > > > xy-osfecn-40290
> > > > xy-osbecn-40073
> > > > xy-osfecn-40238
> > > > 
> > > > The compute service are disabled on these because we don't want 
> > > > anybody spawn a vm on these anymore so want to evacuate all vms.
> > > > 
> > > > Istvan Szabo
> > > > Senior Infrastructure Engineer
> > > > ---------------------------------------------------
> > > > Agoda Services Co., Ltd.
> > > > e: istvan.szabo at agoda.com
> > > > ---------------------------------------------------
> > > > 
> > > > -----Original Message-----
> > > > From: Eugen Block <eblock at nde.ag>
> > > > Sent: Wednesday, April 21, 2021 3:26 PM
> > > > To: openstack-discuss at lists.openstack.org
> > > > Subject: Re: Live migration fails
> > > > 
> > > > Hi,
> > > > 
> > > > can you share the output of these commands?
> > > > 
> > > > nova-manage cell_v2 list_hosts
> > > > openstack compute service list
> > > > 
> > > > 
> > > > Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo at agoda.com>:
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I have couple of compute nodes where the live migration fails 
> > > > > with existing vms.
> > > > > When I quickly spawn a vm and try live migration it works so I 
> > > > > assume shouldn't be a big problem with the compute node.
> > > > > However I have many existing vms where it fails with a 
> > > > > servername not found.
> > > > > 
> > > > > /var/log/nova/nova-conductor.log:2021-04-21 14:47:12.605 
> > > > > 227612 ERROR nova.conductor.tasks.migrate
> > > > > [req-f4067a26-a233-4673-8c07-9a8a290980b0
> > > > > dce35e6eceea4312bb0baa0510cef363 
> > > > > ca7e35079f4440c78bd9870724b9638b - default default] [instance:
> > > > > 1517a2ac-3b51-4d8d-80b3-89a5614d1ae0]
> > > > > Unable to find record for source node servername on servername:
> > > > > ComputeHostNotFound: Compute host servername could not be found.
> > > > > /var/log/nova/nova-conductor.log:2021-04-21 14:47:12.605 
> > > > > 227612 WARNING nova.scheduler.utils
> > > > > [req-f4067a26-a233-4673-8c07-9a8a290980b0
> > > > > dce35e6eceea4312bb0baa0510cef363 
> > > > > ca7e35079f4440c78bd9870724b9638b - default default] Failed to
> > > > > compute_task_migrate_server: Compute host servername could not 
> > > > > be found.: ComputeHostNotFound: Compute host servername could not be found.
> > > > > /var/log/nova/nova-conductor.log:2021-04-21 14:47:12.605 
> > > > > 227612 WARNING nova.scheduler.utils
> > > > > [req-f4067a26-a233-4673-8c07-9a8a290980b0
> > > > > dce35e6eceea4312bb0baa0510cef363 
> > > > > ca7e35079f4440c78bd9870724b9638b - default default] [instance:
> > > > > 1517a2ac-3b51-4d8d-80b3-89a5614d1ae0]
> > > > > Setting instance to ACTIVE state.: ComputeHostNotFound: 
> > > > > Compute host servername could not be found.
> > > > > /var/log/nova/nova-conductor.log:2021-04-21 14:47:12.672 
> > > > > 227612 ERROR oslo_messaging.rpc.server
> > > > > [req-f4067a26-a233-4673-8c07-9a8a290980b0
> > > > > dce35e6eceea4312bb0baa0510cef363 
> > > > > ca7e35079f4440c78bd9870724b9638b - default default] Exception during message handling:
> > > > > ComputeHostNotFound: Compute host am-osfecn-4025
> > > > > 
> > > > > Tried with this command:
> > > > > 
> > > > > nova live-migration --block-migrate id.
> > > > > 
> > > > > Any idea?
> > > > > 
> > > > > Thank you.
> > > > > 
> > > > > ________________________________ This message is confidential 
> > > > > and is for the sole use of the intended recipient(s). It may 
> > > > > also be privileged or otherwise protected by copyright or 
> > > > > other legal rules. If you have received it by mistake please 
> > > > > let us know by reply email and delete it from your system. It 
> > > > > is prohibited to copy this message or disclose its content to anyone.
> > > > > Any confidentiality or privilege is not waived or lost by any 
> > > > > mistaken delivery or unauthorized disclosure of the message. 
> > > > > All messages sent to and from Agoda may be monitored to ensure 
> > > > > compliance with company policies, to protect the company's 
> > > > > interests and to remove potential malware. Electronic messages 
> > > > > may be intercepted, amended, lost or deleted, or contain viruses.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > ________________________________ This message is confidential 
> > > > and is for the sole use of the intended recipient(s). It may 
> > > > also be privileged or otherwise protected by copyright or other 
> > > > legal rules. If you have received it by mistake please let us 
> > > > know by reply email and delete it from your system. It is 
> > > > prohibited to copy this message or disclose its content to anyone.
> > > > Any confidentiality or privilege is not waived or lost by any 
> > > > mistaken delivery or unauthorized disclosure of the message. All 
> > > > messages sent to and from Agoda may be monitored to ensure 
> > > > compliance with company policies, to protect the company's 
> > > > interests and to remove potential malware. Electronic messages 
> > > > may be intercepted, amended, lost or deleted, or contain viruses.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > ________________________________
> > > This message is confidential and is for the sole use of the 
> > > intended recipient(s). It may also be privileged or otherwise 
> > > protected by copyright or other legal rules. If you have received 
> > > it by mistake please let us know by reply email and delete it from 
> > > your system. It is prohibited to copy this message or disclose its 
> > > content to anyone. Any confidentiality or privilege is not waived 
> > > or lost by any mistaken delivery or unauthorized disclosure of the 
> > > message. All messages sent to and from Agoda may be monitored to 
> > > ensure compliance with company policies, to protect the company's 
> > > interests and to remove potential malware. Electronic messages may 
> > > be intercepted, amended, lost or deleted, or contain viruses.
> > 
> > 
> > 
> > 
> 
> 
> 
> 
> ________________________________
> This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.





More information about the openstack-discuss mailing list