Hi,
We’re facing an issue when deploying OpenStack 2024.1 with OpenStack Helm and OVS, the health probes fail for neutron-l3-agent, neutron-dhcp-agent and nova-compute while
the internal RPC communication between the nova and neutron components look fine:
Neutron:
openstack network agent list
+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+
| 433c2b07-5c82-4545-991d-6d0de6044ae9 | DHCP agent | fig-virt-intdev-alberto-node-0 | nova | :-) | UP | neutron-dhcp-agent |
| 56c050fa-e58e-46a6-8191-0d0d481de246 | Open vSwitch agent | fig-virt-intdev-alberto-node-0 | None | :-) | UP | neutron-openvswitch-agent |
| 5ea28388-3fac-4ebb-bce4-6b210f8a3766 | Metadata agent | fig-virt-intdev-alberto-node-0 | None | :-) | UP | neutron-metadata-agent |
| 6a4506ca-6fe4-489a-b370-6e426db40d50 | DHCP agent | fig-virt-intdev-alberto-node-1 | nova | :-) | UP | neutron-dhcp-agent |
| 772a1d3a-b432-4ac3-8dfe-d3affc6f226d | Metadata agent | fig-virt-intdev-alberto-node-1 | None | :-) | UP | neutron-metadata-agent |
| 7bd8360c-468a-4473-a470-abfb1ab435da | Open vSwitch agent | fig-virt-intdev-alberto-node-1 | None | :-) | UP | neutron-openvswitch-agent |
| 9d362a63-5740-44ed-ac77-4548eee59205 | L3 agent | fig-virt-intdev-alberto-node-0 | nova | :-) | UP | neutron-l3-agent |
| a2cc9dc4-1600-4c5d-a250-2e43cf062cf3 | L3 agent | fig-virt-intdev-alberto-node-1 | nova | :-) | UP | neutron-l3-agent |
+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+
The neutron-l3 and neutron-dhcp pods are not ready:
k -n openstack get pods -l application=neutron |grep 'l3\|dhcp'
neutron-dhcp-agent-default-fw5rm 0/1 Running 0 127m
neutron-dhcp-agent-default-vh7c4 0/1 Running 0 127m
neutron-l3-agent-default-4h5tp 0/1 Running 0 24m
neutron-l3-agent-default-bdngv 0/1 Running 0 24m
Because the health probes fail:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 4m38s (x119 over 127m) kubelet Readiness probe failed: Health probe timed out. Agent is down or response timed out
When trying to execute the health probe manually from the container, the probe fails with a timeout after 60 seconds (RPC_PROBE_TIMEOUT=60)
neutron@fig-virt-intdev-alberto-node-1:/$ python /tmp/health-probe.py --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --agent-queue-name dhcp_agent
--use-fqdn
Health probe timed out. Agent is down or response timed out
The situation for nova is quite similar:
openstack compute service list
+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+
| eaf4684c-88a4-47dd-b340-b7dbebf148ba | nova-conductor | nova-conductor-679d57f977-zbq92 | internal | enabled | up | 2025-03-27T13:11:33.000000 |
| 9164d63f-608c-44cc-8a92-2f2bac07c622 | nova-scheduler | nova-scheduler-845f87b9b5-vq9p5 | internal | enabled | up | 2025-03-27T13:11:36.000000 |
| 464cd394-8b26-49ad-8c82-f014ca00ec7e | nova-compute | fig-virt-intdev-alberto-node-1 | nova | enabled | up | 2025-03-27T13:11:33.000000 |
| aa7e5f6d-2533-4953-bb8d-9f9bfb7fc6bd | nova-compute | fig-virt-intdev-alberto-node-0 | nova | enabled | up | 2025-03-27T13:11:33.000000 |
+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+
And the nova-compute pods are not ready because the health probes fail:
k -n openstack get pods -l component=compute
NAME READY STATUS RESTARTS AGE
nova-compute-default-t8w7x 1/2 Running 1 (70m ago) 3h31m
nova-compute-default-vz7nf 1/2 Running 1 (70m ago) 3h31m
To be sure that the OpenStack RPC is working fine, the health probes have been deleted for nova-compute, then nova-cell-setup is executed and the hypervisors are ready.
Some VMs have been launched in this environment successfully (nova-compute), getting IP from the DHCP agent or adding floating IPs (l3-agent) and the OpenStack behaviour
is normal. So I’m confused about the health-probes failing.
Any idea on what can be wrong or what to check is welcome.
Thanks!
Alberto
Note: This is a Kubernetes cluster running on two VMs (OpenStack). This virtual environment has been redeployed several times with some modifications and the issue is
fully reproducible: the probes for nova-compute, neutron-l3-agent and neutron-dhcp-agent always fail.