<div dir="ltr">I've seen failures with port bindings when rabbitmq was not in a good state. Messages between services transit through Rabbit so Nova/Neutron might not be able to follow the flow correctly.<div><br></div><div>Can you double check that rabbit is good to go?</div><div><ul><li>rabbitmqctl cluster_status</li><li>rabbitmqctl list_queues</li></ul><div>I would also recommend turning the logs to DEBUG for all the services and trying to follow a server create request-id.</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <<a href="mailto:eblock@nde.ag">eblock@nde.ag</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi *,<br>

<br>

I have a kind of strange case which I'm trying to solve for hours, I  <br>

could use some fresh ideas.<br>

It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes  <br>

are managed by pacemaker, the third controller will join soon. There  <br>

are around 16 compute nodes at the moment.<br>

This two-node-control plane works well, except if there are unplanned  <br>

outages. Since the last outage of one control node we struggle to  <br>

revive neutron (I believe neutron is the issue here). I'll try to  <br>

focus on the main issue here, let me know if more details are required.<br>

After the failed node was back online all openstack agents show as  <br>

"up" (openstack compute service list, openstack network agent list).  <br>

Running VMs don't seem to be impacted (as far as I can tell). But we  <br>

can't create new instances in existing networks, and since we use  <br>

Octavia we also can't (re)build any LBs at the moment. When I create a  <br>

new test network the instance spawns successfully and is active within  <br>

a few seconds. For existing networks we get the famous "port binding  <br>

failed" from nova-compute.log. But I see the port being created, it  <br>

just can't be attached to the instance. One more strange thing: I  <br>

don't see any entries in the nova-scheduler.log or nova-conductor.log  <br>

for the successfully built instance, except for the recently mentioned  <br>

etcd3gw message from nova-conductor, but this didn't impact the  <br>

instance creation yet.<br>

We have investigated this for hours, we have rebooted both control  <br>

nodes multiple times in order to kill any remaining processes. The  <br>

galera DB seems fine, rabbitmq also behaves normally (I think), we  <br>

tried multiple times to put one node in standby to only have one node  <br>

to look at which also didn't help.<br>

So basically we restarted everything multiple times on the control  <br>

nodes and also nova-compute and openvswitch-agent on all compute  <br>

nodes, the issue is still not resolved.<br>

Does anyone have further ideas to resolve this? I'd be happy to  <br>

provide more details, just let me know what you need.<br>

<br>

Happy Easter!<br>

Eugen<br>

<br>

<br>

</blockquote></div>