Open Stack

Tue Apr 19 08:08:33 UTC 2022

I have an interesting update on this. For the last two days I let the  
cloud work in a degraded (and unmanaged) state wrt pacemaker, meaning  
I stopped apache, memcached, neutron, openvswitch, nova and octavia  
services on one control node. Today I wanted to start more services  
one by one, hoping to find the responsible one. But everything works  
fine, after each new service I tried to launch an instance and all  
attempts were successful. So I disabled the pacemaker maintenance mode  
and retried, and still everything worked. I assume that some of the  
services still had some cached references to the disabled control node  
and couldn't recover, does that make sense? On the other hand, we  
rebooted both control nodes a few times, I expected that to clean up  
anything like that. So while the issue seems to be resolved I still  
have no idea what went wrong. :-( Anyway, I hope to bring the third  
node online this week so we'll hopefully be more resilient against  
control node failure.
Thanks again for your comments!
Eugen

Zitat von Laurent Dumont <laurentfdumont at gmail.com>:

> You can probably try each one in turn. Might be an issue with one of the
> two.
>
> On Sat, Apr 16, 2022 at 6:23 PM Eugen Block <eblock at nde.ag> wrote:
>
>> Thank you both for your comments, I appreciate it!
>> Before digging into the logs I tried again with one of the two control
>> nodes disabled. But I didn't disable all services, only apache,
>> memcached, neutron, nova and octavia so all my requests would go to
>> the active control node but rabbit and galera would be in sync. This
>> already seemed to clean things up somehow, now I was able to launch
>> instances and LBs into an active state. Awesome! Then I started the
>> mentioned services on the other control node again and things stopped
>> working. Note that this setup worked for months and we have another
>> cloud with two control nodes which works like a charm for years now.
>> The only significant thing I noticed while switching back to one
>> active neutron/nova/octavia node was this message from the
>> neutron-dhcp-agent.log:
>>
>> 2022-04-16 23:59:29.180 36882 ERROR neutron_lib.rpc
>> [req-905aecd6-ff22-4549-a0cb-ef5259692f5d - - - - -] Timeout in RPC
>> method get_active_networks_info. Waiting for 510 seconds before next
>> attempt. If the server is not down, consider increasing the
>> rpc_response_timeout option as Neutron server(s) may be overloaded and
>> unable to respond quickly enough.:
>> oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a
>> reply to message ID 6676c45f5b0c42af8e34f8fb4aba3aca
>>
>> I'll need to take a closer look for more of these messages after the
>> weekend, but more importantly why we can't seem to reenable the second
>> node. I'll enable debug logs then and hopefully find a trace to the
>> root cause.
>> If you have other comments please don't hesitate, I thankful for any ideas.
>>
>> Thanks!
>> Eugen
>>
>>
>> Zitat von Sean Mooney <smooney at redhat.com>:
>>
>> > On Sat, 2022-04-16 at 09:30 -0400, Laurent Dumont wrote:
>> >> I've seen failures with port bindings when rabbitmq was not in a good
>> >> state. Messages between services transit through Rabbit so Nova/Neutron
>> >> might not be able to follow the flow correctly.
>> > that is not quite right.
>> >
>> > inter service message happen via http rest apis.
>> > intra service comunication happens via rabbit.
>> > nova never calls neutron over rabbit nor does neutron call nova over
>> rabbit
>> >
>> > however it is ture that rabit issue can somethime cause prort bingin
>> issues.
>> > if you are using ml2/ovs the agent report/heatbeat can be lost form
>> > the perspective of the neutron server
>> > and it can consider the service down. if the agent is "down" then
>> > the ml2/ovs mech driver will refuse to
>> > bind the prot.
>> >
>> > assuming the agent is up in the db the requst to bidn the port never
>> > actully transits rabbitmq.
>> >
>> > the comptue node makes a http request to the neturon-server which
>> > host the api endpoing and executes the ml2 drivers.
>> > the ml2/ovs dirver only uses info form the neutron db which it
>> > access directly.
>> >
>> > the neutron server debug logs shoudl have records for bidning
>> > request which shoudl detail why the port binding failed.
>> > it shoudl show each loaded ml2 driver beign tried in sequence ot
>> > bind the port and if it cant log the reason why.
>> >
>> > i would start by checking that the ovs l2 agents show as up in the db/api
>> > then find a port id for one of the failed port bidngins and trace
>> > the debug logs for the port bdining in the neutorn server
>> > logs for the error and if you find one post it here.
>> >
>> >>
>> >> Can you double check that rabbit is good to go?
>> >>
>> >>    - rabbitmqctl cluster_status
>> >>    - rabbitmqctl list_queues
>> >>
>> >> I would also recommend turning the logs to DEBUG for all the services
>> and
>> >> trying to follow a server create request-id.
>> >>
>> >> On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <eblock at nde.ag> wrote:
>> >>
>> >> > Hi *,
>> >> >
>> >> > I have a kind of strange case which I'm trying to solve for hours, I
>> >> > could use some fresh ideas.
>> >> > It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes
>> >> > are managed by pacemaker, the third controller will join soon. There
>> >> > are around 16 compute nodes at the moment.
>> >> > This two-node-control plane works well, except if there are unplanned
>> >> > outages. Since the last outage of one control node we struggle to
>> >> > revive neutron (I believe neutron is the issue here). I'll try to
>> >> > focus on the main issue here, let me know if more details are
>> required.
>> >> > After the failed node was back online all openstack agents show as
>> >> > "up" (openstack compute service list, openstack network agent list).
>> >> > Running VMs don't seem to be impacted (as far as I can tell). But we
>> >> > can't create new instances in existing networks, and since we use
>> >> > Octavia we also can't (re)build any LBs at the moment. When I create a
>> >> > new test network the instance spawns successfully and is active within
>> >> > a few seconds. For existing networks we get the famous "port binding
>> >> > failed" from nova-compute.log. But I see the port being created, it
>> >> > just can't be attached to the instance. One more strange thing: I
>> >> > don't see any entries in the nova-scheduler.log or nova-conductor.log
>> >> > for the successfully built instance, except for the recently mentioned
>> >> > etcd3gw message from nova-conductor, but this didn't impact the
>> >> > instance creation yet.
>> >> > We have investigated this for hours, we have rebooted both control
>> >> > nodes multiple times in order to kill any remaining processes. The
>> >> > galera DB seems fine, rabbitmq also behaves normally (I think), we
>> >> > tried multiple times to put one node in standby to only have one node
>> >> > to look at which also didn't help.
>> >> > So basically we restarted everything multiple times on the control
>> >> > nodes and also nova-compute and openvswitch-agent on all compute
>> >> > nodes, the issue is still not resolved.
>> >> > Does anyone have further ideas to resolve this? I'd be happy to
>> >> > provide more details, just let me know what you need.
>> >> >
>> >> > Happy Easter!
>> >> > Eugen
>> >> >
>> >> >
>> >> >
>>
>>
>>
>>

Open Stack

[neutron][nova] port binding fails for existing networks

OpenStack

Community

Documentation

Branding & Legal