I've seen failures with port bindings when rabbitmq was not in a good state. Messages between services transit through Rabbit so Nova/Neutron might not be able to follow the flow correctly.
On Sat, 2022-04-16 at 09:30 -0400, Laurent Dumont wrote: that is not quite right. inter service message happen via http rest apis. intra service comunication happens via rabbit. nova never calls neutron over rabbit nor does neutron call nova over rabbit however it is ture that rabit issue can somethime cause prort bingin issues. if you are using ml2/ovs the agent report/heatbeat can be lost form the perspective of the neutron server and it can consider the service down. if the agent is "down" then the ml2/ovs mech driver will refuse to bind the prot. assuming the agent is up in the db the requst to bidn the port never actully transits rabbitmq. the comptue node makes a http request to the neturon-server which host the api endpoing and executes the ml2 drivers. the ml2/ovs dirver only uses info form the neutron db which it access directly. the neutron server debug logs shoudl have records for bidning request which shoudl detail why the port binding failed. it shoudl show each loaded ml2 driver beign tried in sequence ot bind the port and if it cant log the reason why. i would start by checking that the ovs l2 agents show as up in the db/api then find a port id for one of the failed port bidngins and trace the debug logs for the port bdining in the neutorn server logs for the error and if you find one post it here.
Can you double check that rabbit is good to go?
- rabbitmqctl cluster_status - rabbitmqctl list_queues
I would also recommend turning the logs to DEBUG for all the services and trying to follow a server create request-id.
On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <eblock@nde.ag> wrote:
Hi *,
I have a kind of strange case which I'm trying to solve for hours, I could use some fresh ideas. It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes are managed by pacemaker, the third controller will join soon. There are around 16 compute nodes at the moment. This two-node-control plane works well, except if there are unplanned outages. Since the last outage of one control node we struggle to revive neutron (I believe neutron is the issue here). I'll try to focus on the main issue here, let me know if more details are required. After the failed node was back online all openstack agents show as "up" (openstack compute service list, openstack network agent list). Running VMs don't seem to be impacted (as far as I can tell). But we can't create new instances in existing networks, and since we use Octavia we also can't (re)build any LBs at the moment. When I create a new test network the instance spawns successfully and is active within a few seconds. For existing networks we get the famous "port binding failed" from nova-compute.log. But I see the port being created, it just can't be attached to the instance. One more strange thing: I don't see any entries in the nova-scheduler.log or nova-conductor.log for the successfully built instance, except for the recently mentioned etcd3gw message from nova-conductor, but this didn't impact the instance creation yet. We have investigated this for hours, we have rebooted both control nodes multiple times in order to kill any remaining processes. The galera DB seems fine, rabbitmq also behaves normally (I think), we tried multiple times to put one node in standby to only have one node to look at which also didn't help. So basically we restarted everything multiple times on the control nodes and also nova-compute and openvswitch-agent on all compute nodes, the issue is still not resolved. Does anyone have further ideas to resolve this? I'd be happy to provide more details, just let me know what you need.
Happy Easter! Eugen