[neutron][nova] port binding fails for existing networks

Laurent Dumont laurentfdumont at gmail.com
Sat Apr 16 23:33:15 UTC 2022


You can probably try each one in turn. Might be an issue with one of the
two.

On Sat, Apr 16, 2022 at 6:23 PM Eugen Block <eblock at nde.ag> wrote:

> Thank you both for your comments, I appreciate it!
> Before digging into the logs I tried again with one of the two control
> nodes disabled. But I didn't disable all services, only apache,
> memcached, neutron, nova and octavia so all my requests would go to
> the active control node but rabbit and galera would be in sync. This
> already seemed to clean things up somehow, now I was able to launch
> instances and LBs into an active state. Awesome! Then I started the
> mentioned services on the other control node again and things stopped
> working. Note that this setup worked for months and we have another
> cloud with two control nodes which works like a charm for years now.
> The only significant thing I noticed while switching back to one
> active neutron/nova/octavia node was this message from the
> neutron-dhcp-agent.log:
>
> 2022-04-16 23:59:29.180 36882 ERROR neutron_lib.rpc
> [req-905aecd6-ff22-4549-a0cb-ef5259692f5d - - - - -] Timeout in RPC
> method get_active_networks_info. Waiting for 510 seconds before next
> attempt. If the server is not down, consider increasing the
> rpc_response_timeout option as Neutron server(s) may be overloaded and
> unable to respond quickly enough.:
> oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a
> reply to message ID 6676c45f5b0c42af8e34f8fb4aba3aca
>
> I'll need to take a closer look for more of these messages after the
> weekend, but more importantly why we can't seem to reenable the second
> node. I'll enable debug logs then and hopefully find a trace to the
> root cause.
> If you have other comments please don't hesitate, I thankful for any ideas.
>
> Thanks!
> Eugen
>
>
> Zitat von Sean Mooney <smooney at redhat.com>:
>
> > On Sat, 2022-04-16 at 09:30 -0400, Laurent Dumont wrote:
> >> I've seen failures with port bindings when rabbitmq was not in a good
> >> state. Messages between services transit through Rabbit so Nova/Neutron
> >> might not be able to follow the flow correctly.
> > that is not quite right.
> >
> > inter service message happen via http rest apis.
> > intra service comunication happens via rabbit.
> > nova never calls neutron over rabbit nor does neutron call nova over
> rabbit
> >
> > however it is ture that rabit issue can somethime cause prort bingin
> issues.
> > if you are using ml2/ovs the agent report/heatbeat can be lost form
> > the perspective of the neutron server
> > and it can consider the service down. if the agent is "down" then
> > the ml2/ovs mech driver will refuse to
> > bind the prot.
> >
> > assuming the agent is up in the db the requst to bidn the port never
> > actully transits rabbitmq.
> >
> > the comptue node makes a http request to the neturon-server which
> > host the api endpoing and executes the ml2 drivers.
> > the ml2/ovs dirver only uses info form the neutron db which it
> > access directly.
> >
> > the neutron server debug logs shoudl have records for bidning
> > request which shoudl detail why the port binding failed.
> > it shoudl show each loaded ml2 driver beign tried in sequence ot
> > bind the port and if it cant log the reason why.
> >
> > i would start by checking that the ovs l2 agents show as up in the db/api
> > then find a port id for one of the failed port bidngins and trace
> > the debug logs for the port bdining in the neutorn server
> > logs for the error and if you find one post it here.
> >
> >>
> >> Can you double check that rabbit is good to go?
> >>
> >>    - rabbitmqctl cluster_status
> >>    - rabbitmqctl list_queues
> >>
> >> I would also recommend turning the logs to DEBUG for all the services
> and
> >> trying to follow a server create request-id.
> >>
> >> On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <eblock at nde.ag> wrote:
> >>
> >> > Hi *,
> >> >
> >> > I have a kind of strange case which I'm trying to solve for hours, I
> >> > could use some fresh ideas.
> >> > It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes
> >> > are managed by pacemaker, the third controller will join soon. There
> >> > are around 16 compute nodes at the moment.
> >> > This two-node-control plane works well, except if there are unplanned
> >> > outages. Since the last outage of one control node we struggle to
> >> > revive neutron (I believe neutron is the issue here). I'll try to
> >> > focus on the main issue here, let me know if more details are
> required.
> >> > After the failed node was back online all openstack agents show as
> >> > "up" (openstack compute service list, openstack network agent list).
> >> > Running VMs don't seem to be impacted (as far as I can tell). But we
> >> > can't create new instances in existing networks, and since we use
> >> > Octavia we also can't (re)build any LBs at the moment. When I create a
> >> > new test network the instance spawns successfully and is active within
> >> > a few seconds. For existing networks we get the famous "port binding
> >> > failed" from nova-compute.log. But I see the port being created, it
> >> > just can't be attached to the instance. One more strange thing: I
> >> > don't see any entries in the nova-scheduler.log or nova-conductor.log
> >> > for the successfully built instance, except for the recently mentioned
> >> > etcd3gw message from nova-conductor, but this didn't impact the
> >> > instance creation yet.
> >> > We have investigated this for hours, we have rebooted both control
> >> > nodes multiple times in order to kill any remaining processes. The
> >> > galera DB seems fine, rabbitmq also behaves normally (I think), we
> >> > tried multiple times to put one node in standby to only have one node
> >> > to look at which also didn't help.
> >> > So basically we restarted everything multiple times on the control
> >> > nodes and also nova-compute and openvswitch-agent on all compute
> >> > nodes, the issue is still not resolved.
> >> > Does anyone have further ideas to resolve this? I'd be happy to
> >> > provide more details, just let me know what you need.
> >> >
> >> > Happy Easter!
> >> > Eugen
> >> >
> >> >
> >> >
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20220416/bd5c7f6d/attachment.htm>


More information about the openstack-discuss mailing list