You can probably try each one in turn. Might be an issue with one of the two.

On Sat, Apr 16, 2022 at 6:23 PM Eugen Block <eblock@nde.ag> wrote:
Thank you both for your comments, I appreciate it!
Before digging into the logs I tried again with one of the two control 
nodes disabled. But I didn't disable all services, only apache, 
memcached, neutron, nova and octavia so all my requests would go to 
the active control node but rabbit and galera would be in sync. This 
already seemed to clean things up somehow, now I was able to launch 
instances and LBs into an active state. Awesome! Then I started the 
mentioned services on the other control node again and things stopped 
working. Note that this setup worked for months and we have another 
cloud with two control nodes which works like a charm for years now.
The only significant thing I noticed while switching back to one 
active neutron/nova/octavia node was this message from the 
neutron-dhcp-agent.log:

2022-04-16 23:59:29.180 36882 ERROR neutron_lib.rpc 
[req-905aecd6-ff22-4549-a0cb-ef5259692f5d - - - - -] Timeout in RPC 
method get_active_networks_info. Waiting for 510 seconds before next 
attempt. If the server is not down, consider increasing the 
rpc_response_timeout option as Neutron server(s) may be overloaded and 
unable to respond quickly enough.: 
oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a 
reply to message ID 6676c45f5b0c42af8e34f8fb4aba3aca

I'll need to take a closer look for more of these messages after the 
weekend, but more importantly why we can't seem to reenable the second 
node. I'll enable debug logs then and hopefully find a trace to the 
root cause.
If you have other comments please don't hesitate, I thankful for any ideas.

Thanks!
Eugen


Zitat von Sean Mooney <smooney@redhat.com>:

> On Sat, 2022-04-16 at 09:30 -0400, Laurent Dumont wrote:
>> I've seen failures with port bindings when rabbitmq was not in a good
>> state. Messages between services transit through Rabbit so Nova/Neutron
>> might not be able to follow the flow correctly.
> that is not quite right.
>
> inter service message happen via http rest apis.
> intra service comunication happens via rabbit.
> nova never calls neutron over rabbit nor does neutron call nova over rabbit
>
> however it is ture that rabit issue can somethime cause prort bingin issues.
> if you are using ml2/ovs the agent report/heatbeat can be lost form 
> the perspective of the neutron server
> and it can consider the service down. if the agent is "down" then 
> the ml2/ovs mech driver will refuse to
> bind the prot.
>
> assuming the agent is up in the db the requst to bidn the port never 
> actully transits rabbitmq.
>
> the comptue node makes a http request to the neturon-server which 
> host the api endpoing and executes the ml2 drivers.
> the ml2/ovs dirver only uses info form the neutron db which it 
> access directly.
>
> the neutron server debug logs shoudl have records for bidning 
> request which shoudl detail why the port binding failed.
> it shoudl show each loaded ml2 driver beign tried in sequence ot 
> bind the port and if it cant log the reason why.
>
> i would start by checking that the ovs l2 agents show as up in the db/api
> then find a port id for one of the failed port bidngins and trace 
> the debug logs for the port bdining in the neutorn server
> logs for the error and if you find one post it here.
>
>>
>> Can you double check that rabbit is good to go?
>>
>>    - rabbitmqctl cluster_status
>>    - rabbitmqctl list_queues
>>
>> I would also recommend turning the logs to DEBUG for all the services and
>> trying to follow a server create request-id.
>>
>> On Sat, Apr 16, 2022 at 4:44 AM Eugen Block <eblock@nde.ag> wrote:
>>
>> > Hi *,
>> >
>> > I have a kind of strange case which I'm trying to solve for hours, I
>> > could use some fresh ideas.
>> > It's a HA cloud (Victoria) deployed by Salt and the 2 control nodes
>> > are managed by pacemaker, the third controller will join soon. There
>> > are around 16 compute nodes at the moment.
>> > This two-node-control plane works well, except if there are unplanned
>> > outages. Since the last outage of one control node we struggle to
>> > revive neutron (I believe neutron is the issue here). I'll try to
>> > focus on the main issue here, let me know if more details are required.
>> > After the failed node was back online all openstack agents show as
>> > "up" (openstack compute service list, openstack network agent list).
>> > Running VMs don't seem to be impacted (as far as I can tell). But we
>> > can't create new instances in existing networks, and since we use
>> > Octavia we also can't (re)build any LBs at the moment. When I create a
>> > new test network the instance spawns successfully and is active within
>> > a few seconds. For existing networks we get the famous "port binding
>> > failed" from nova-compute.log. But I see the port being created, it
>> > just can't be attached to the instance. One more strange thing: I
>> > don't see any entries in the nova-scheduler.log or nova-conductor.log
>> > for the successfully built instance, except for the recently mentioned
>> > etcd3gw message from nova-conductor, but this didn't impact the
>> > instance creation yet.
>> > We have investigated this for hours, we have rebooted both control
>> > nodes multiple times in order to kill any remaining processes. The
>> > galera DB seems fine, rabbitmq also behaves normally (I think), we
>> > tried multiple times to put one node in standby to only have one node
>> > to look at which also didn't help.
>> > So basically we restarted everything multiple times on the control
>> > nodes and also nova-compute and openvswitch-agent on all compute
>> > nodes, the issue is still not resolved.
>> > Does anyone have further ideas to resolve this? I'd be happy to
>> > provide more details, just let me know what you need.
>> >
>> > Happy Easter!
>> > Eugen
>> >
>> >
>> >