<div dir="ltr">You can probably try each one in turn. Might be an issue with one of the two.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Apr 16, 2022 at 6:23 PM Eugen Block <<a href="mailto:eblock@nde.ag">eblock@nde.ag</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thank you both for your comments, I appreciate it!<br>

Before digging into the logs I tried again with one of the two control  <br>

nodes disabled. But I didn't disable all services, only apache,  <br>

memcached, neutron, nova and octavia so all my requests would go to  <br>

the active control node but rabbit and galera would be in sync. This  <br>

already seemed to clean things up somehow, now I was able to launch  <br>

instances and LBs into an active state. Awesome! Then I started the  <br>

mentioned services on the other control node again and things stopped  <br>

working. Note that this setup worked for months and we have another  <br>

cloud with two control nodes which works like a charm for years now.<br>

The only significant thing I noticed while switching back to one  <br>

active neutron/nova/octavia node was this message from the  <br>

neutron-dhcp-agent.log:<br>

<br>

2022-04-16 23:59:29.180 36882 ERROR neutron_lib.rpc  <br>

[req-905aecd6-ff22-4549-a0cb-ef5259692f5d - - - - -] Timeout in RPC  <br>

method get_active_networks_info. Waiting for 510 seconds before next  <br>

attempt. If the server is not down, consider increasing the  <br>

rpc_response_timeout option as Neutron server(s) may be overloaded and  <br>

unable to respond quickly enough.:  <br>

oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a  <br>

reply to message ID 6676c45f5b0c42af8e34f8fb4aba3aca<br>

<br>

I'll need to take a closer look for more of these messages after the  <br>

weekend, but more importantly why we can't seem to reenable the second  <br>

node. I'll enable debug logs then and hopefully find a trace to the  <br>

root cause.<br>

If you have other comments please don't hesitate, I thankful for any ideas.<br>

<br>

Thanks!<br>

Eugen<br>

<br>

<br>

Zitat von Sean Mooney <<a href="mailto:smooney@redhat.com" target="_blank">smooney@redhat.com</a>>:<br>

<br>

> On Sat, 2022-04-16 at 09:30 -0400, Laurent Dumont wrote:<br>

>> I've seen failures with port bindings when rabbitmq was not in a good<br>

>> state. Messages between services transit through Rabbit so Nova/Neutron<br>

>> might not be able to follow the flow correctly.<br>

> that is not quite right.<br>

><br>

> inter service message happen via http rest apis.<br>

> intra service comunication happens via rabbit.<br>

> nova never calls neutron over rabbit nor does neutron call nova over rabbit<br>

><br>

> however it is ture that rabit issue can somethime cause prort bingin issues.<br>

> if you are using ml2/ovs the agent report/heatbeat can be lost form  <br>

> the perspective of the neutron server<br>

> and it can consider the service down. if the agent is "down" then  <br>

> the ml2/ovs mech driver will refuse to<br>

> bind the prot.<br>

><br>

> assuming the agent is up in the db the requst to bidn the port never  <br>

> actully transits rabbitmq.<br>

><br>

> the comptue node makes a http request to the neturon-server which  <br>

> host the api endpoing and executes the ml2 drivers.<br>

> the ml2/ovs dirver only uses info form the neutron db which it  <br>

> access directly.<br>

><br>

> the neutron server debug logs shoudl have records for bidning  <br>

> request which shoudl detail why the port binding failed.<br>

> it shoudl show each loaded ml2 driver beign tried in sequence ot  <br>

> bind the port and if it cant log the reason why.<br>

><br>

> i would start by checking that the ovs l2 agents show as up in the db/api<br>

> then find a port id for one of the failed port bidngins and trace  <br>

> the debug logs for the port bdining in the neutorn server<br>

> logs for the error and if you find one post it here.<br>

><br>

>><br>

>> Can you double check that rabbit is good to go?<br>

>><br>

>>    - rabbitmqctl cluster_status<br>

>>    - rabbitmqctl list_queues<br>

>><br>

>> I would also recommend turning the logs to DEBUG for all the services and<br>

>> trying to follow a server create request-id.<br>

>><br>

>> On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <<a href="mailto:eblock@nde.ag" target="_blank">eblock@nde.ag</a>> wrote:<br>

>><br>

>> > Hi *,<br>

>> ><br>

>> > I have a kind of strange case which I'm trying to solve for hours, I<br>

>> > could use some fresh ideas.<br>

>> > It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes<br>

>> > are managed by pacemaker, the third controller will join soon. There<br>

>> > are around 16 compute nodes at the moment.<br>

>> > This two-node-control plane works well, except if there are unplanned<br>

>> > outages. Since the last outage of one control node we struggle to<br>

>> > revive neutron (I believe neutron is the issue here). I'll try to<br>

>> > focus on the main issue here, let me know if more details are required.<br>

>> > After the failed node was back online all openstack agents show as<br>

>> > "up" (openstack compute service list, openstack network agent list).<br>

>> > Running VMs don't seem to be impacted (as far as I can tell). But we<br>

>> > can't create new instances in existing networks, and since we use<br>

>> > Octavia we also can't (re)build any LBs at the moment. When I create a<br>

>> > new test network the instance spawns successfully and is active within<br>

>> > a few seconds. For existing networks we get the famous "port binding<br>

>> > failed" from nova-compute.log. But I see the port being created, it<br>

>> > just can't be attached to the instance. One more strange thing: I<br>

>> > don't see any entries in the nova-scheduler.log or nova-conductor.log<br>

>> > for the successfully built instance, except for the recently mentioned<br>

>> > etcd3gw message from nova-conductor, but this didn't impact the<br>

>> > instance creation yet.<br>

>> > We have investigated this for hours, we have rebooted both control<br>

>> > nodes multiple times in order to kill any remaining processes. The<br>

>> > galera DB seems fine, rabbitmq also behaves normally (I think), we<br>

>> > tried multiple times to put one node in standby to only have one node<br>

>> > to look at which also didn't help.<br>

>> > So basically we restarted everything multiple times on the control<br>

>> > nodes and also nova-compute and openvswitch-agent on all compute<br>

>> > nodes, the issue is still not resolved.<br>

>> > Does anyone have further ideas to resolve this? I'd be happy to<br>

>> > provide more details, just let me know what you need.<br>

>> ><br>

>> > Happy Easter!<br>

>> > Eugen<br>

>> ><br>

>> ><br>

>> ><br>

<br>

<br>

<br>

</blockquote></div>