[openstack-helm] health-probes timeout for neutron agents (l3 and dhcp) and nova compute
Hi, We’re facing an issue when deploying OpenStack 2024.1 with OpenStack Helm and OVS, the health probes fail for neutron-l3-agent, neutron-dhcp-agent and nova-compute while the internal RPC communication between the nova and neutron components look fine: Neutron: openstack network agent list +--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+ | 433c2b07-5c82-4545-991d-6d0de6044ae9 | DHCP agent | fig-virt-intdev-alberto-node-0 | nova | :-) | UP | neutron-dhcp-agent | | 56c050fa-e58e-46a6-8191-0d0d481de246 | Open vSwitch agent | fig-virt-intdev-alberto-node-0 | None | :-) | UP | neutron-openvswitch-agent | | 5ea28388-3fac-4ebb-bce4-6b210f8a3766 | Metadata agent | fig-virt-intdev-alberto-node-0 | None | :-) | UP | neutron-metadata-agent | | 6a4506ca-6fe4-489a-b370-6e426db40d50 | DHCP agent | fig-virt-intdev-alberto-node-1 | nova | :-) | UP | neutron-dhcp-agent | | 772a1d3a-b432-4ac3-8dfe-d3affc6f226d | Metadata agent | fig-virt-intdev-alberto-node-1 | None | :-) | UP | neutron-metadata-agent | | 7bd8360c-468a-4473-a470-abfb1ab435da | Open vSwitch agent | fig-virt-intdev-alberto-node-1 | None | :-) | UP | neutron-openvswitch-agent | | 9d362a63-5740-44ed-ac77-4548eee59205 | L3 agent | fig-virt-intdev-alberto-node-0 | nova | :-) | UP | neutron-l3-agent | | a2cc9dc4-1600-4c5d-a250-2e43cf062cf3 | L3 agent | fig-virt-intdev-alberto-node-1 | nova | :-) | UP | neutron-l3-agent | +--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+ The neutron-l3 and neutron-dhcp pods are not ready: k -n openstack get pods -l application=neutron |grep 'l3\|dhcp' neutron-dhcp-agent-default-fw5rm 0/1 Running 0 127m neutron-dhcp-agent-default-vh7c4 0/1 Running 0 127m neutron-l3-agent-default-4h5tp 0/1 Running 0 24m neutron-l3-agent-default-bdngv 0/1 Running 0 24m Because the health probes fail: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 4m38s (x119 over 127m) kubelet Readiness probe failed: Health probe timed out. Agent is down or response timed out When trying to execute the health probe manually from the container, the probe fails with a timeout after 60 seconds (RPC_PROBE_TIMEOUT=60) neutron@fig-virt-intdev-alberto-node-1:/$ python /tmp/health-probe.py --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --agent-queue-name dhcp_agent --use-fqdn Health probe timed out. Agent is down or response timed out The situation for nova is quite similar: openstack compute service list +--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+ | ID | Binary | Host | Zone | Status | State | Updated At | +--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+ | eaf4684c-88a4-47dd-b340-b7dbebf148ba | nova-conductor | nova-conductor-679d57f977-zbq92 | internal | enabled | up | 2025-03-27T13:11:33.000000 | | 9164d63f-608c-44cc-8a92-2f2bac07c622 | nova-scheduler | nova-scheduler-845f87b9b5-vq9p5 | internal | enabled | up | 2025-03-27T13:11:36.000000 | | 464cd394-8b26-49ad-8c82-f014ca00ec7e | nova-compute | fig-virt-intdev-alberto-node-1 | nova | enabled | up | 2025-03-27T13:11:33.000000 | | aa7e5f6d-2533-4953-bb8d-9f9bfb7fc6bd | nova-compute | fig-virt-intdev-alberto-node-0 | nova | enabled | up | 2025-03-27T13:11:33.000000 | +--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+ And the nova-compute pods are not ready because the health probes fail: k -n openstack get pods -l component=compute NAME READY STATUS RESTARTS AGE nova-compute-default-t8w7x 1/2 Running 1 (70m ago) 3h31m nova-compute-default-vz7nf 1/2 Running 1 (70m ago) 3h31m To be sure that the OpenStack RPC is working fine, the health probes have been deleted for nova-compute, then nova-cell-setup is executed and the hypervisors are ready. Some VMs have been launched in this environment successfully (nova-compute), getting IP from the DHCP agent or adding floating IPs (l3-agent) and the OpenStack behaviour is normal. So I’m confused about the health-probes failing. Any idea on what can be wrong or what to check is welcome. Thanks! Alberto Note: This is a Kubernetes cluster running on two VMs (OpenStack). This virtual environment has been redeployed several times with some modifications and the issue is fully reproducible: the probes for nova-compute, neutron-l3-agent and neutron-dhcp-agent always fail.
participants (1)
-
Alberto Molina Coballes