Considering that this issue happened in our production environment, it’s not exactly possible to try to reproduce without shutting down servers that are currently in use. That said, If the current logs I have are enough, I will try opening a bug on the bugtracker. Compute22, the source host, was completely dead. It refused to boot up through IPMI. It is possible that that stein fix prevented me from reproducing the problem in my staging environment (production is on rocky, staging is on stein). Also, it may be important to note that our neutron is split, as we use neutron-rpc-server to answer rpc calls. It’s also HA, as we have two controllers with neutron-rpc-server and the api running (and that won’t work anymore when we upgrade production to stein, but that’s another problem entirely and probably off-topic here). Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. 4414-4416 Louis B Mayer Laval, QC, H7P 0G1, Canada
Le 7 janv. 2021 à 09:26, Lee Yarwood <lyarwood@redhat.com> a écrit :
Would you be able to trace an example evacuation request fully and pastebin it somewhere using `openstack server event list $instance [1]` output to determine the request-id etc? Feel free to also open a bug about this and we can just triage there instead of the ML.
The fact that q-api has sent the network-vif-plugged:80371c01-930d-4ea2-9d28-14438e948b65 to n-api suggests that the q-agt is actually alive on compute22, was that the case? Note that a pre-condition of calling the evacuation API is that the source host has been fenced [2].
That all said I wonder if this is somehow related too the following stein change:
https://review.opendev.org/c/openstack/nova/+/603844 <https://review.opendev.org/c/openstack/nova/+/603844>