I have an interesting update on this. For the last two days I let the cloud work in a degraded (and unmanaged) state wrt pacemaker, meaning I stopped apache, memcached, neutron, openvswitch, nova and octavia services on one control node. Today I wanted to start more services one by one, hoping to find the responsible one. But everything works fine, after each new service I tried to launch an instance and all attempts were successful. So I disabled the pacemaker maintenance mode and retried, and still everything worked. I assume that some of the services still had some cached references to the disabled control node and couldn't recover, does that make sense? On the other hand, we rebooted both control nodes a few times, I expected that to clean up anything like that. So while the issue seems to be resolved I still have no idea what went wrong. :-( Anyway, I hope to bring the third node online this week so we'll hopefully be more resilient against control node failure. Thanks again for your comments! Eugen Zitat von Laurent Dumont <laurentfdumont@gmail.com>:
You can probably try each one in turn. Might be an issue with one of the two.
On Sat, Apr 16, 2022 at 6:23 PM Eugen Block <eblock@nde.ag> wrote:
Thank you both for your comments, I appreciate it! Before digging into the logs I tried again with one of the two control nodes disabled. But I didn't disable all services, only apache, memcached, neutron, nova and octavia so all my requests would go to the active control node but rabbit and galera would be in sync. This already seemed to clean things up somehow, now I was able to launch instances and LBs into an active state. Awesome! Then I started the mentioned services on the other control node again and things stopped working. Note that this setup worked for months and we have another cloud with two control nodes which works like a charm for years now. The only significant thing I noticed while switching back to one active neutron/nova/octavia node was this message from the neutron-dhcp-agent.log:
2022-04-16 23:59:29.180 36882 ERROR neutron_lib.rpc [req-905aecd6-ff22-4549-a0cb-ef5259692f5d - - - - -] Timeout in RPC method get_active_networks_info. Waiting for 510 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6676c45f5b0c42af8e34f8fb4aba3aca
I'll need to take a closer look for more of these messages after the weekend, but more importantly why we can't seem to reenable the second node. I'll enable debug logs then and hopefully find a trace to the root cause. If you have other comments please don't hesitate, I thankful for any ideas.
Thanks! Eugen
Zitat von Sean Mooney <smooney@redhat.com>:
I've seen failures with port bindings when rabbitmq was not in a good state. Messages between services transit through Rabbit so Nova/Neutron might not be able to follow the flow correctly.
On Sat, 2022-04-16 at 09:30 -0400, Laurent Dumont wrote: that is not quite right.
inter service message happen via http rest apis. intra service comunication happens via rabbit. nova never calls neutron over rabbit nor does neutron call nova over rabbit
however it is ture that rabit issue can somethime cause prort bingin issues. if you are using ml2/ovs the agent report/heatbeat can be lost form the perspective of the neutron server and it can consider the service down. if the agent is "down" then the ml2/ovs mech driver will refuse to bind the prot.
assuming the agent is up in the db the requst to bidn the port never actully transits rabbitmq.
the comptue node makes a http request to the neturon-server which host the api endpoing and executes the ml2 drivers. the ml2/ovs dirver only uses info form the neutron db which it access directly.
the neutron server debug logs shoudl have records for bidning request which shoudl detail why the port binding failed. it shoudl show each loaded ml2 driver beign tried in sequence ot bind the port and if it cant log the reason why.
i would start by checking that the ovs l2 agents show as up in the db/api then find a port id for one of the failed port bidngins and trace the debug logs for the port bdining in the neutorn server logs for the error and if you find one post it here.
Can you double check that rabbit is good to go?
- rabbitmqctl cluster_status - rabbitmqctl list_queues
I would also recommend turning the logs to DEBUG for all the services
and
trying to follow a server create request-id.
On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <eblock@nde.ag> wrote:
Hi *,
I have a kind of strange case which I'm trying to solve for hours, I could use some fresh ideas. It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes are managed by pacemaker, the third controller will join soon. There are around 16 compute nodes at the moment. This two-node-control plane works well, except if there are unplanned outages. Since the last outage of one control node we struggle to revive neutron (I believe neutron is the issue here). I'll try to focus on the main issue here, let me know if more details are required. After the failed node was back online all openstack agents show as "up" (openstack compute service list, openstack network agent list). Running VMs don't seem to be impacted (as far as I can tell). But we can't create new instances in existing networks, and since we use Octavia we also can't (re)build any LBs at the moment. When I create a new test network the instance spawns successfully and is active within a few seconds. For existing networks we get the famous "port binding failed" from nova-compute.log. But I see the port being created, it just can't be attached to the instance. One more strange thing: I don't see any entries in the nova-scheduler.log or nova-conductor.log for the successfully built instance, except for the recently mentioned etcd3gw message from nova-conductor, but this didn't impact the instance creation yet. We have investigated this for hours, we have rebooted both control nodes multiple times in order to kill any remaining processes. The galera DB seems fine, rabbitmq also behaves normally (I think), we tried multiple times to put one node in standby to only have one node to look at which also didn't help. So basically we restarted everything multiple times on the control nodes and also nova-compute and openvswitch-agent on all compute nodes, the issue is still not resolved. Does anyone have further ideas to resolve this? I'd be happy to provide more details, just let me know what you need.
Happy Easter! Eugen