[nova] Nova evacuate issue

Lee Yarwood lyarwood at redhat.com
Thu Jan 7 14:26:38 UTC 2021


On 06-01-21 18:26:24, Jean-Philippe Méthot wrote:
> Hi,
> 
> We’re running Openstack Rocky on a high-availability setup with
> neutron openvswitch. The setup has roughly 50 compute nodes and 2
> controller nodes. We’ve run into an issue when we’re trying to
> evacuate a dead compute node where the first instance evacuate goes
> through, but the second one fails (we evacuate our instances one by
> one). The reason why the second one fails seems to be because Neutron
> is trying to plug the port back on the dead compute, as nova instructs
> it to do. Here’s an example of nova-api log output after compute22
> died and we’ve been trying to evacuate an instance.
> 
> f3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-unplugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22
> 2021-01-06 13:31:31.750 2858 INFO nova.osapi_compute.wsgi.server [req-4f9b3e17-1a9d-48f0-961a-bbabdf922ad6 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] 10.30.1.224 "POST /v2.1/os-server-external-events HTTP/1.1" status: 200 len: 1091 time: 0.4987640
> 2021-01-06 13:31:40.145 2863 INFO nova.osapi_compute.wsgi.server [req-abaac9df-7338-4d10-9326-4006021ff54d 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1 HTTP/1.1" status: 302 len: 318 time: 0.0072701
> 2021-01-06 13:31:40.156 2863 INFO nova.osapi_compute.wsgi.server [req-c393e74b-a118-4a98-8a83-be6007913dc0 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/ HTTP/1.1" status: 200 len: 789 time: 0.0070350
> 2021-01-06 13:31:43.289 2865 INFO nova.osapi_compute.wsgi.server [req-b87268b7-a673-44c1-9162-f9564647ec33 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc HTTP/1.1" status: 200 len: 5654 time: 2.7543190
> 2021-01-06 13:31:43.413 2863 INFO nova.osapi_compute.wsgi.server [req-4cab23ba-c5cb-4dda-bf42-bc452d004783 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/servers/4aeb7761-cb23-4c51-93dd-79b55afbc7dc/os-volume_attachments HTTP/1.1" status: 200 len: 770 time: 0.1135709
> 2021-01-06 13:31:43.883 2865 INFO nova.osapi_compute.wsgi.server [req-f5e5a586-65f3-4798-b03b-98e01326a00b 6cb55894e59c47b3800f97a27c9c4ee9 ccfa9d8d76b8409f8c5a8d71ce32625a - default default] 10.30.1.224 "GET /v2.1/flavors/574a7152-f079-4337-b1eb-b7eca4370b73 HTTP/1.1" status: 200 len: 877 time: 0.5751688
> 2021-01-06 13:31:47.194 2864 INFO nova.api.openstack.compute.server_external_events [req-7e639b1f-8408-4e8e-9bb8-54588290edfe 0d0ef3839ca64f58ac779f6f810758c0 61e62a49d34a44f9b1161a338a7f1fdd - default default] Creating event network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 for instance 4aeb7761-cb23-4c51-93dd-79b55afbc7dc on compute22
> 
> As you can see, Nova "creates an event" as the virtual interface is
> unplugged but then immediately creates another event to plug the
> virtual interface in the same compute node that is dead. However, at
> the same time, the instance is being created on another compute node.
> Is this a known bug? I have not found anything about this in the bug
> database. Additionally, I am not able to reproduce in our staging
> environment which is smaller and running on Stein.

Would you be able to trace an example evacuation request fully and
pastebin it somewhere using `openstack server event list $instance [1]`
output to determine the request-id etc? Feel free to also open a bug
about this and we can just triage there instead of the ML.

The fact that q-api has sent the
network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 to n-api
suggests that the q-agt is actually alive on compute22, was that the
case? Note that a pre-condition of calling the evacuation API is that
the source host has been fenced [2].

That all said I wonder if this is somehow related too the following
stein change:

https://review.opendev.org/c/openstack/nova/+/603844

Cheers,

Lee

[1] https://docs.openstack.org/python-openstackclient/rocky/cli/command-objects/server-event.html#server-event-list
[2] https://docs.openstack.org/api-ref/compute/?expanded=evacuate-server-evacuate-action-detail#evacuate-server-evacuate-action

-- 
Lee Yarwood                 A5D1 9385 88CB 7E5F BE64  6618 BCA6 6E33 F672 2D76
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210107/c67c4fde/attachment-0001.sig>


More information about the openstack-discuss mailing list