[nova] Nova evacuate issue
Hi, We’re running Openstack Rocky on a high-availability setup with neutron openvswitch. The setup has roughly 50 compute nodes and 2 controller nodes. We’ve run into an issue when we’re trying to evacuate a dead compute node where the first instance evacuate goes through, but the second one fails (we evacuate our instances one by one). The reason why the second one fails seems to be because Neutron is trying to plug the port back on the dead compute, as nova instructs it to do. Here’s an example of nova-api log output after compute22 died and we’ve been trying to evacuate an instance. f3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-unplugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22 2021-01-06 13:31:31.750 2858 INFO nova.osapi_compute.wsgi.server [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] 10.30.1.224 "POST /v2.1/os-server-external-events HTTP/1.1" status: 200 len: 1091 time: 0.4987640 2021-01-06 13:31:40.145 2863 INFO nova.osapi_compute.wsgi.server [req-abaac9df-7338-4d10-9326-4006021ff54d 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1 HTTP/1.1" status: 302 len: 318 time: 0.0072701 2021-01-06 13:31:40.156 2863 INFO nova.osapi_compute.wsgi.server [req-c393e74b-a118-4a98-8a83-be6007913dc0 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/ HTTP/1.1" status: 200 len: 789 time: 0.0070350 2021-01-06 13:31:43.289 2865 INFO nova.osapi_compute.wsgi.server [req-b87268b7-a673-44c1-9162-f9564647ec33 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc HTTP/1.1" status: 200 len: 5654 time: 2.7543190 2021-01-06 13:31:43.413 2863 INFO nova.osapi_compute.wsgi.server [req-4cab23ba-c5cb-4dda-bf42-bc452d004783 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc/os-volume_attachments HTTP/1.1" status: 200 len: 770 time: 0.1135709 2021-01-06 13:31:43.883 2865 INFO nova.osapi_compute.wsgi.server [req-f5e5a586-65f3-4798-b03b-98e01326a00b 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/flavors/574a7152-f079-4337-b1eb-b7eca4370b73 HTTP/1.1" status: 200 len: 877 time: 0.5751688 2021-01-06 13:31:47.194 2864 INFO nova.api.openstack.compute.server_external_events [req-7e639b1f-8408-4e8e-9bb8-54588290edfe 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22 As you can see, Nova "creates an event" as the virtual interface is unplugged but then immediately creates another event to plug the virtual interface in the same compute node that is dead. However, at the same time, the instance is being created on another compute node. Is this a known bug? I have not found anything about this in the bug database. Additionally, I am not able to reproduce in our staging environment which is smaller and running on Stein. Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada TEL : +1.514.802.1644 - Poste : 2644 FAX : +1.514.612.0678 CA/US : 1.855.774.4678 FR : 01 76 60 41 43 UK : 0808 189 0423
On 06-01-21 18:26:24, Jean-Philippe Méthot wrote:
Hi,
We’re running Openstack Rocky on a high-availability setup with neutron openvswitch. The setup has roughly 50 compute nodes and 2 controller nodes. We’ve run into an issue when we’re trying to evacuate a dead compute node where the first instance evacuate goes through, but the second one fails (we evacuate our instances one by one). The reason why the second one fails seems to be because Neutron is trying to plug the port back on the dead compute, as nova instructs it to do. Here’s an example of nova-api log output after compute22 died and we’ve been trying to evacuate an instance.
f3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-unplugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22 2021-01-06 13:31:31.750 2858 INFO nova.osapi_compute.wsgi.server [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] 10.30.1.224 "POST /v2.1/os-server-external-events HTTP/1.1" status: 200 len: 1091 time: 0.4987640 2021-01-06 13:31:40.145 2863 INFO nova.osapi_compute.wsgi.server [req-abaac9df-7338-4d10-9326-4006021ff54d 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1 HTTP/1.1" status: 302 len: 318 time: 0.0072701 2021-01-06 13:31:40.156 2863 INFO nova.osapi_compute.wsgi.server [req-c393e74b-a118-4a98-8a83-be6007913dc0 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/ HTTP/1.1" status: 200 len: 789 time: 0.0070350 2021-01-06 13:31:43.289 2865 INFO nova.osapi_compute.wsgi.server [req-b87268b7-a673-44c1-9162-f9564647ec33 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc HTTP/1.1" status: 200 len: 5654 time: 2.7543190 2021-01-06 13:31:43.413 2863 INFO nova.osapi_compute.wsgi.server [req-4cab23ba-c5cb-4dda-bf42-bc452d004783 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc/os-volume_attachments HTTP/1.1" status: 200 len: 770 time: 0.1135709 2021-01-06 13:31:43.883 2865 INFO nova.osapi_compute.wsgi.server [req-f5e5a586-65f3-4798-b03b-98e01326a00b 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/flavors/574a7152-f079-4337-b1eb-b7eca4370b73 HTTP/1.1" status: 200 len: 877 time: 0.5751688 2021-01-06 13:31:47.194 2864 INFO nova.api.openstack.compute.server_external_events [req-7e639b1f-8408-4e8e-9bb8-54588290edfe 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22
As you can see, Nova "creates an event" as the virtual interface is unplugged but then immediately creates another event to plug the virtual interface in the same compute node that is dead. However, at the same time, the instance is being created on another compute node. Is this a known bug? I have not found anything about this in the bug database. Additionally, I am not able to reproduce in our staging environment which is smaller and running on Stein.
Would you be able to trace an example evacuation request fully and pastebin it somewhere using `openstack server event list $instance [1]` output to determine the request-id etc? Feel free to also open a bug about this and we can just triage there instead of the ML. The fact that q-api has sent the network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 to n-api suggests that the q-agt is actually alive on compute22, was that the case? Note that a pre-condition of calling the evacuation API is that the source host has been fenced [2]. That all said I wonder if this is somehow related too the following stein change: https://review.opendev.org/c/openstack/nova/+/603844 Cheers, Lee [1] https://docs.openstack.org/python-openstackclient/rocky/cli/command-objects/... [2] https://docs.openstack.org/api-ref/compute/?expanded=evacuate-server-evacuat... -- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
Considering that this issue happened in our production environment, it’s not exactly possible to try to reproduce without shutting down servers that are currently in use. That said, If the current logs I have are enough, I will try opening a bug on the bugtracker. Compute22, the source host, was completely dead. It refused to boot up through IPMI. It is possible that that stein fix prevented me from reproducing the problem in my staging environment (production is on rocky, staging is on stein). Also, it may be important to note that our neutron is split, as we use neutron-rpc-server to answer rpc calls. It’s also HA, as we have two controllers with neutron-rpc-server and the api running (and that won’t work anymore when we upgrade production to stein, but that’s another problem entirely and probably off-topic here). Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada
Le 7 janv. 2021 à 09:26, Lee Yarwood <lyarwood@redhat.com> a écrit :
Would you be able to trace an example evacuation request fully and pastebin it somewhere using `openstack server event list $instance [1]` output to determine the request-id etc? Feel free to also open a bug about this and we can just triage there instead of the ML.
The fact that q-api has sent the network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 to n-api suggests that the q-agt is actually alive on compute22, was that the case? Note that a pre-condition of calling the evacuation API is that the source host has been fenced [2].
That all said I wonder if this is somehow related too the following stein change:
https://review.opendev.org/c/openstack/nova/+/603844 <https://review.opendev.org/c/openstack/nova/+/603844>
That would be great to have debug log level, it's easier to troubleshoot migration issues. DVD - written my phone, please ignore the tpyos On Thu., Jan. 7, 2021, 1:27 p.m. Jean-Philippe Méthot, < jp.methot@planethoster.info> wrote:
Considering that this issue happened in our production environment, it’s not exactly possible to try to reproduce without shutting down servers that are currently in use. That said, If the current logs I have are enough, I will try opening a bug on the bugtracker.
Compute22, the source host, was completely dead. It refused to boot up through IPMI.
It is possible that that stein fix prevented me from reproducing the problem in my staging environment (production is on rocky, staging is on stein).
Also, it may be important to note that our neutron is split, as we use neutron-rpc-server to answer rpc calls. It’s also HA, as we have two controllers with neutron-rpc-server and the api running (and that won’t work anymore when we upgrade production to stein, but that’s another problem entirely and probably off-topic here).
Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada
Le 7 janv. 2021 à 09:26, Lee Yarwood <lyarwood@redhat.com> a écrit :
Would you be able to trace an example evacuation request fully and pastebin it somewhere using `openstack server event list $instance [1]` output to determine the request-id etc? Feel free to also open a bug about this and we can just triage there instead of the ML.
The fact that q-api has sent the network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 to n-api suggests that the q-agt is actually alive on compute22, was that the case? Note that a pre-condition of calling the evacuation API is that the source host has been fenced [2].
That all said I wonder if this is somehow related too the following stein change:
On 07-01-21 11:58:05, Jean-Philippe Méthot wrote:
Considering that this issue happened in our production environment, it’s not exactly possible to try to reproduce without shutting down servers that are currently in use. That said, If the current logs I have are enough, I will try opening a bug on the bugtracker.
Yup appreciate that, if you still have logs then using the event list to determine the request-id for the evacuation and then providing any n-api/n-cpu logs referencing that request-id in the bug would be great. Lots more detail in the following doc: https://docs.openstack.org/api-guide/compute/faults.html
Compute22, the source host, was completely dead. It refused to boot up through IPMI.
ACK.
It is possible that that stein fix prevented me from reproducing the problem in my staging environment (production is on rocky, staging is on stein).
Also, it may be important to note that our neutron is split, as we use neutron-rpc-server to answer rpc calls. It’s also HA, as we have two controllers with neutron-rpc-server and the api running (and that won’t work anymore when we upgrade production to stein, but that’s another problem entirely and probably off-topic here).
I doubt that played a part, we've fixed many many bugs with Nova's evacuation logic over the releases so for now I'm going to assume it's something within Nova.
Le 7 janv. 2021 à 09:26, Lee Yarwood <lyarwood@redhat.com> a écrit :
Would you be able to trace an example evacuation request fully and pastebin it somewhere using `openstack server event list $instance [1]` output to determine the request-id etc? Feel free to also open a bug about this and we can just triage there instead of the ML.
The fact that q-api has sent the network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 to n-api suggests that the q-agt is actually alive on compute22, was that the case? Note that a pre-condition of calling the evacuation API is that the source host has been fenced [2].
That all said I wonder if this is somehow related too the following stein change:
https://review.opendev.org/c/openstack/nova/+/603844 <https://review.opendev.org/c/openstack/nova/+/603844>
-- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
I was not able to find anything in the event list, possibly because the instance was recreated so its ID doesn’t exist anymore? Anyway, I did just create a bug report with as much info as I could, which is not much more than what I already posted in this mail chain. Hopefully we can get somewhere with this : https://bugs.launchpad.net/nova/+bug/1911474 <https://bugs.launchpad.net/nova/+bug/1911474> Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada
participants (3)
-
David Vallee Delisle
-
Jean-Philippe Méthot
-
Lee Yarwood