[neutron][nova] port binding fails for existing networks
laurentfdumont at gmail.com
Sat Apr 16 13:30:40 UTC 2022
I've seen failures with port bindings when rabbitmq was not in a good
state. Messages between services transit through Rabbit so Nova/Neutron
might not be able to follow the flow correctly.
Can you double check that rabbit is good to go?
- rabbitmqctl cluster_status
- rabbitmqctl list_queues
I would also recommend turning the logs to DEBUG for all the services and
trying to follow a server create request-id.
On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <eblock at nde.ag> wrote:
> Hi *,
> I have a kind of strange case which I'm trying to solve for hours, I
> could use some fresh ideas.
> It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes
> are managed by pacemaker, the third controller will join soon. There
> are around 16 compute nodes at the moment.
> This two-node-control plane works well, except if there are unplanned
> outages. Since the last outage of one control node we struggle to
> revive neutron (I believe neutron is the issue here). I'll try to
> focus on the main issue here, let me know if more details are required.
> After the failed node was back online all openstack agents show as
> "up" (openstack compute service list, openstack network agent list).
> Running VMs don't seem to be impacted (as far as I can tell). But we
> can't create new instances in existing networks, and since we use
> Octavia we also can't (re)build any LBs at the moment. When I create a
> new test network the instance spawns successfully and is active within
> a few seconds. For existing networks we get the famous "port binding
> failed" from nova-compute.log. But I see the port being created, it
> just can't be attached to the instance. One more strange thing: I
> don't see any entries in the nova-scheduler.log or nova-conductor.log
> for the successfully built instance, except for the recently mentioned
> etcd3gw message from nova-conductor, but this didn't impact the
> instance creation yet.
> We have investigated this for hours, we have rebooted both control
> nodes multiple times in order to kill any remaining processes. The
> galera DB seems fine, rabbitmq also behaves normally (I think), we
> tried multiple times to put one node in standby to only have one node
> to look at which also didn't help.
> So basically we restarted everything multiple times on the control
> nodes and also nova-compute and openvswitch-agent on all compute
> nodes, the issue is still not resolved.
> Does anyone have further ideas to resolve this? I'd be happy to
> provide more details, just let me know what you need.
> Happy Easter!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the openstack-discuss