<div dir="ltr">You can probably try each one in turn. Might be an issue with one of the two.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Apr 16, 2022 at 6:23 PM Eugen Block <<a href="mailto:eblock@nde.ag">eblock@nde.ag</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thank you both for your comments, I appreciate it!<br>
Before digging into the logs I tried again with one of the two control <br>
nodes disabled. But I didn't disable all services, only apache, <br>
memcached, neutron, nova and octavia so all my requests would go to <br>
the active control node but rabbit and galera would be in sync. This <br>
already seemed to clean things up somehow, now I was able to launch <br>
instances and LBs into an active state. Awesome! Then I started the <br>
mentioned services on the other control node again and things stopped <br>
working. Note that this setup worked for months and we have another <br>
cloud with two control nodes which works like a charm for years now.<br>
The only significant thing I noticed while switching back to one <br>
active neutron/nova/octavia node was this message from the <br>
neutron-dhcp-agent.log:<br>
<br>
2022-04-16 23:59:29.180 36882 ERROR neutron_lib.rpc <br>
[req-905aecd6-ff22-4549-a0cb-ef5259692f5d - - - - -] Timeout in RPC <br>
method get_active_networks_info. Waiting for 510 seconds before next <br>
attempt. If the server is not down, consider increasing the <br>
rpc_response_timeout option as Neutron server(s) may be overloaded and <br>
unable to respond quickly enough.: <br>
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a <br>
reply to message ID 6676c45f5b0c42af8e34f8fb4aba3aca<br>
<br>
I'll need to take a closer look for more of these messages after the <br>
weekend, but more importantly why we can't seem to reenable the second <br>
node. I'll enable debug logs then and hopefully find a trace to the <br>
root cause.<br>
If you have other comments please don't hesitate, I thankful for any ideas.<br>
<br>
Thanks!<br>
Eugen<br>
<br>
<br>
Zitat von Sean Mooney <<a href="mailto:smooney@redhat.com" target="_blank">smooney@redhat.com</a>>:<br>
<br>
> On Sat, 2022-04-16 at 09:30 -0400, Laurent Dumont wrote:<br>
>> I've seen failures with port bindings when rabbitmq was not in a good<br>
>> state. Messages between services transit through Rabbit so Nova/Neutron<br>
>> might not be able to follow the flow correctly.<br>
> that is not quite right.<br>
><br>
> inter service message happen via http rest apis.<br>
> intra service comunication happens via rabbit.<br>
> nova never calls neutron over rabbit nor does neutron call nova over rabbit<br>
><br>
> however it is ture that rabit issue can somethime cause prort bingin issues.<br>
> if you are using ml2/ovs the agent report/heatbeat can be lost form <br>
> the perspective of the neutron server<br>
> and it can consider the service down. if the agent is "down" then <br>
> the ml2/ovs mech driver will refuse to<br>
> bind the prot.<br>
><br>
> assuming the agent is up in the db the requst to bidn the port never <br>
> actully transits rabbitmq.<br>
><br>
> the comptue node makes a http request to the neturon-server which <br>
> host the api endpoing and executes the ml2 drivers.<br>
> the ml2/ovs dirver only uses info form the neutron db which it <br>
> access directly.<br>
><br>
> the neutron server debug logs shoudl have records for bidning <br>
> request which shoudl detail why the port binding failed.<br>
> it shoudl show each loaded ml2 driver beign tried in sequence ot <br>
> bind the port and if it cant log the reason why.<br>
><br>
> i would start by checking that the ovs l2 agents show as up in the db/api<br>
> then find a port id for one of the failed port bidngins and trace <br>
> the debug logs for the port bdining in the neutorn server<br>
> logs for the error and if you find one post it here.<br>
><br>
>><br>
>> Can you double check that rabbit is good to go?<br>
>><br>
>> - rabbitmqctl cluster_status<br>
>> - rabbitmqctl list_queues<br>
>><br>
>> I would also recommend turning the logs to DEBUG for all the services and<br>
>> trying to follow a server create request-id.<br>
>><br>
>> On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <<a href="mailto:eblock@nde.ag" target="_blank">eblock@nde.ag</a>> wrote:<br>
>><br>
>> > Hi *,<br>
>> ><br>
>> > I have a kind of strange case which I'm trying to solve for hours, I<br>
>> > could use some fresh ideas.<br>
>> > It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes<br>
>> > are managed by pacemaker, the third controller will join soon. There<br>
>> > are around 16 compute nodes at the moment.<br>
>> > This two-node-control plane works well, except if there are unplanned<br>
>> > outages. Since the last outage of one control node we struggle to<br>
>> > revive neutron (I believe neutron is the issue here). I'll try to<br>
>> > focus on the main issue here, let me know if more details are required.<br>
>> > After the failed node was back online all openstack agents show as<br>
>> > "up" (openstack compute service list, openstack network agent list).<br>
>> > Running VMs don't seem to be impacted (as far as I can tell). But we<br>
>> > can't create new instances in existing networks, and since we use<br>
>> > Octavia we also can't (re)build any LBs at the moment. When I create a<br>
>> > new test network the instance spawns successfully and is active within<br>
>> > a few seconds. For existing networks we get the famous "port binding<br>
>> > failed" from nova-compute.log. But I see the port being created, it<br>
>> > just can't be attached to the instance. One more strange thing: I<br>
>> > don't see any entries in the nova-scheduler.log or nova-conductor.log<br>
>> > for the successfully built instance, except for the recently mentioned<br>
>> > etcd3gw message from nova-conductor, but this didn't impact the<br>
>> > instance creation yet.<br>
>> > We have investigated this for hours, we have rebooted both control<br>
>> > nodes multiple times in order to kill any remaining processes. The<br>
>> > galera DB seems fine, rabbitmq also behaves normally (I think), we<br>
>> > tried multiple times to put one node in standby to only have one node<br>
>> > to look at which also didn't help.<br>
>> > So basically we restarted everything multiple times on the control<br>
>> > nodes and also nova-compute and openvswitch-agent on all compute<br>
>> > nodes, the issue is still not resolved.<br>
>> > Does anyone have further ideas to resolve this? I'd be happy to<br>
>> > provide more details, just let me know what you need.<br>
>> ><br>
>> > Happy Easter!<br>
>> > Eugen<br>
>> ><br>
>> ><br>
>> ><br>
<br>
<br>
<br>
</blockquote></div>