[openstack-helm] health-probes timeout for neutron agents (l3 and dhcp) and nova compute

27 Mar 2025

      Hi,

We’re facing an issue when deploying OpenStack 2024.1 with OpenStack Helm and OVS, the health probes fail for neutron-l3-agent, neutron-dhcp-agent and nova-compute while the internal RPC communication between the nova and neutron components look fine:

Neutron:

openstack network agent list
+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+
| ID                                   | Agent Type         | Host                           | Availability Zone | Alive | State | Binary                    |
+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+
| 433c2b07-5c82-4545-991d-6d0de6044ae9 | DHCP agent         | fig-virt-intdev-alberto-node-0 | nova              | :-)   | UP    | neutron-dhcp-agent        |
| 56c050fa-e58e-46a6-8191-0d0d481de246 | Open vSwitch agent | fig-virt-intdev-alberto-node-0 | None              | :-)   | UP    | neutron-openvswitch-agent |
| 5ea28388-3fac-4ebb-bce4-6b210f8a3766 | Metadata agent     | fig-virt-intdev-alberto-node-0 | None              | :-)   | UP    | neutron-metadata-agent    |
| 6a4506ca-6fe4-489a-b370-6e426db40d50 | DHCP agent         | fig-virt-intdev-alberto-node-1 | nova              | :-)   | UP    | neutron-dhcp-agent        |
| 772a1d3a-b432-4ac3-8dfe-d3affc6f226d | Metadata agent     | fig-virt-intdev-alberto-node-1 | None              | :-)   | UP    | neutron-metadata-agent    |
| 7bd8360c-468a-4473-a470-abfb1ab435da | Open vSwitch agent | fig-virt-intdev-alberto-node-1 | None              | :-)   | UP    | neutron-openvswitch-agent |
| 9d362a63-5740-44ed-ac77-4548eee59205 | L3 agent           | fig-virt-intdev-alberto-node-0 | nova              | :-)   | UP    | neutron-l3-agent          |
| a2cc9dc4-1600-4c5d-a250-2e43cf062cf3 | L3 agent           | fig-virt-intdev-alberto-node-1 | nova              | :-)   | UP    | neutron-l3-agent          |
+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+

The neutron-l3 and neutron-dhcp pods are not ready:

k -n openstack get pods -l application=neutron |grep 'l3\|dhcp'
neutron-dhcp-agent-default-fw5rm           0/1     Running     0               127m
neutron-dhcp-agent-default-vh7c4           0/1     Running     0               127m
neutron-l3-agent-default-4h5tp             0/1     Running     0               24m
neutron-l3-agent-default-bdngv             0/1     Running     0               24m

Because the health probes fail:

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  4m38s (x119 over 127m)  kubelet  Readiness probe failed: Health probe timed out. Agent is down or response timed out

When trying to execute the health probe manually from the container, the probe fails with a timeout after 60 seconds (RPC_PROBE_TIMEOUT=60)

neutron@fig-virt-intdev-alberto-node-1:/$ python /tmp/health-probe.py --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --agent-queue-name dhcp_agent --use-fqdn
Health probe timed out. Agent is down or response timed out

The situation for nova is quite similar:

openstack compute service list
+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+
| ID                                   | Binary         | Host                            | Zone     | Status  | State | Updated At                 |
+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+
| eaf4684c-88a4-47dd-b340-b7dbebf148ba | nova-conductor | nova-conductor-679d57f977-zbq92 | internal | enabled | up    | 2025-03-27T13:11:33.000000 |
| 9164d63f-608c-44cc-8a92-2f2bac07c622 | nova-scheduler | nova-scheduler-845f87b9b5-vq9p5 | internal | enabled | up    | 2025-03-27T13:11:36.000000 |
| 464cd394-8b26-49ad-8c82-f014ca00ec7e | nova-compute   | fig-virt-intdev-alberto-node-1  | nova     | enabled | up    | 2025-03-27T13:11:33.000000 |
| aa7e5f6d-2533-4953-bb8d-9f9bfb7fc6bd | nova-compute   | fig-virt-intdev-alberto-node-0  | nova     | enabled | up    | 2025-03-27T13:11:33.000000 |
+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+

And the nova-compute pods are not ready because the health probes fail:

k -n openstack get pods -l component=compute
NAME                         READY   STATUS    RESTARTS      AGE
nova-compute-default-t8w7x   1/2     Running   1 (70m ago)   3h31m
nova-compute-default-vz7nf   1/2     Running   1 (70m ago)   3h31m

To be sure that the OpenStack RPC is working fine, the health probes have been deleted for nova-compute, then nova-cell-setup is executed and the hypervisors are ready.

Some VMs have been launched in this environment successfully (nova-compute), getting IP from the DHCP agent or adding floating IPs (l3-agent) and the OpenStack behaviour is normal. So I’m confused about the health-probes failing.

Any idea on what can be wrong or what to check is welcome.

Thanks!
Alberto

Note: This is a Kubernetes cluster running on two VMs (OpenStack). This virtual environment has been redeployed several times with some modifications and the issue is fully reproducible: the probes for nova-compute, neutron-l3-agent and neutron-dhcp-agent always fail.

[openstack-helm] health-probes timeout for neutron agents (l3 and dhcp) and nova compute

Alberto Molina Coballes