<div dir="ltr">I've seen failures with port bindings when rabbitmq was not in a good state. Messages between services transit through Rabbit so Nova/Neutron might not be able to follow the flow correctly.<div><br></div><div>Can you double check that rabbit is good to go?</div><div><ul><li>rabbitmqctl cluster_status</li><li>rabbitmqctl list_queues</li></ul><div>I would also recommend turning the logs to DEBUG for all the services and trying to follow a server create request-id.</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <<a href="mailto:eblock@nde.ag">eblock@nde.ag</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi *,<br>
<br>
I have a kind of strange case which I'm trying to solve for hours, I <br>
could use some fresh ideas.<br>
It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes <br>
are managed by pacemaker, the third controller will join soon. There <br>
are around 16 compute nodes at the moment.<br>
This two-node-control plane works well, except if there are unplanned <br>
outages. Since the last outage of one control node we struggle to <br>
revive neutron (I believe neutron is the issue here). I'll try to <br>
focus on the main issue here, let me know if more details are required.<br>
After the failed node was back online all openstack agents show as <br>
"up" (openstack compute service list, openstack network agent list). <br>
Running VMs don't seem to be impacted (as far as I can tell). But we <br>
can't create new instances in existing networks, and since we use <br>
Octavia we also can't (re)build any LBs at the moment. When I create a <br>
new test network the instance spawns successfully and is active within <br>
a few seconds. For existing networks we get the famous "port binding <br>
failed" from nova-compute.log. But I see the port being created, it <br>
just can't be attached to the instance. One more strange thing: I <br>
don't see any entries in the nova-scheduler.log or nova-conductor.log <br>
for the successfully built instance, except for the recently mentioned <br>
etcd3gw message from nova-conductor, but this didn't impact the <br>
instance creation yet.<br>
We have investigated this for hours, we have rebooted both control <br>
nodes multiple times in order to kill any remaining processes. The <br>
galera DB seems fine, rabbitmq also behaves normally (I think), we <br>
tried multiple times to put one node in standby to only have one node <br>
to look at which also didn't help.<br>
So basically we restarted everything multiple times on the control <br>
nodes and also nova-compute and openvswitch-agent on all compute <br>
nodes, the issue is still not resolved.<br>
Does anyone have further ideas to resolve this? I'd be happy to <br>
provide more details, just let me know what you need.<br>
<br>
Happy Easter!<br>
Eugen<br>
<br>
<br>
</blockquote></div>