We’re running Openstack Rocky on a high-availability setup with neutron openvswitch. The setup has roughly 50 compute nodes and 2 controller nodes. We’ve run into an issue when we’re trying to evacuate a dead compute node where the first instance evacuate goes through, but the second one fails (we evacuate our instances one by one). The reason why the second one fails seems to be because Neutron is trying to plug the port back on the dead compute, as nova instructs it to do. Here’s an example of nova-api log output after compute22 died and we’ve been trying to evacuate an instance.
f3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-unplugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22
2021-01-06 13:31:31.750 2858 INFO nova.osapi_compute.wsgi.server [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] 10.30.1.224 "POST /v2.1/os-server-external-events HTTP/1.1" status: 200 len: 1091 time: 0.4987640
2021-01-06 13:31:40.145 2863 INFO nova.osapi_compute.wsgi.server [req-abaac9df-7338-4d10-9326-4006021ff54d 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1 HTTP/1.1" status: 302 len: 318 time: 0.0072701
2021-01-06 13:31:40.156 2863 INFO nova.osapi_compute.wsgi.server [req-c393e74b-a118-4a98-8a83-be6007913dc0 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/ HTTP/1.1" status: 200 len: 789 time: 0.0070350
2021-01-06 13:31:43.289 2865 INFO nova.osapi_compute.wsgi.server [req-b87268b7-a673-44c1-9162-f9564647ec33 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc HTTP/1.1" status: 200 len: 5654 time: 2.7543190
2021-01-06 13:31:43.413 2863 INFO nova.osapi_compute.wsgi.server [req-4cab23ba-c5cb-4dda-bf42-bc452d004783 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc/os-volume_attachments HTTP/1.1" status: 200 len: 770 time: 0.1135709
2021-01-06 13:31:43.883 2865 INFO nova.osapi_compute.wsgi.server [req-f5e5a586-65f3-4798-b03b-98e01326a00b 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/flavors/574a7152-f079-4337-b1eb-b7eca4370b73 HTTP/1.1" status: 200 len: 877 time: 0.5751688
2021-01-06 13:31:47.194 2864 INFO nova.api.openstack.compute.server_external_events [req-7e639b1f-8408-4e8e-9bb8-54588290edfe 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22
As you can see, Nova "creates an event" as the virtual interface is unplugged but then immediately creates another event to plug the virtual interface in the same compute node that is dead. However, at the same time, the instance is being created on another compute node. Is this a known bug? I have not found anything about this in the bug database. Additionally, I am not able to reproduce in our staging environment which is smaller and running on Stein.