Hi,

 

We’re facing an issue when deploying OpenStack 2024.1 with OpenStack Helm and OVS, the health probes fail for neutron-l3-agent, neutron-dhcp-agent and nova-compute while the internal RPC communication between the nova and neutron components look fine:

 

Neutron:

 

openstack network agent list

+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+

| ID                                   | Agent Type         | Host                           | Availability Zone | Alive | State | Binary                    |

+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+

| 433c2b07-5c82-4545-991d-6d0de6044ae9 | DHCP agent         | fig-virt-intdev-alberto-node-0 | nova              | :-)   | UP    | neutron-dhcp-agent        |

| 56c050fa-e58e-46a6-8191-0d0d481de246 | Open vSwitch agent | fig-virt-intdev-alberto-node-0 | None              | :-)   | UP    | neutron-openvswitch-agent |

| 5ea28388-3fac-4ebb-bce4-6b210f8a3766 | Metadata agent     | fig-virt-intdev-alberto-node-0 | None              | :-)   | UP    | neutron-metadata-agent    |

| 6a4506ca-6fe4-489a-b370-6e426db40d50 | DHCP agent         | fig-virt-intdev-alberto-node-1 | nova              | :-)   | UP    | neutron-dhcp-agent        |

| 772a1d3a-b432-4ac3-8dfe-d3affc6f226d | Metadata agent     | fig-virt-intdev-alberto-node-1 | None              | :-)   | UP    | neutron-metadata-agent    |

| 7bd8360c-468a-4473-a470-abfb1ab435da | Open vSwitch agent | fig-virt-intdev-alberto-node-1 | None              | :-)   | UP    | neutron-openvswitch-agent |

| 9d362a63-5740-44ed-ac77-4548eee59205 | L3 agent           | fig-virt-intdev-alberto-node-0 | nova              | :-)   | UP    | neutron-l3-agent          |

| a2cc9dc4-1600-4c5d-a250-2e43cf062cf3 | L3 agent           | fig-virt-intdev-alberto-node-1 | nova              | :-)   | UP    | neutron-l3-agent          |

+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+

 

The neutron-l3 and neutron-dhcp pods are not ready:

 

k -n openstack get pods -l application=neutron |grep 'l3\|dhcp'

neutron-dhcp-agent-default-fw5rm           0/1     Running     0               127m

neutron-dhcp-agent-default-vh7c4           0/1     Running     0               127m

neutron-l3-agent-default-4h5tp             0/1     Running     0               24m

neutron-l3-agent-default-bdngv             0/1     Running     0               24m

 

Because the health probes fail:

 

Events:

  Type     Reason     Age                     From     Message

  ----     ------     ----                    ----     -------

  Warning  Unhealthy  4m38s (x119 over 127m)  kubelet  Readiness probe failed: Health probe timed out. Agent is down or response timed out

 

When trying to execute the health probe manually from the container, the probe fails with a timeout after 60 seconds (RPC_PROBE_TIMEOUT=60)

 

neutron@fig-virt-intdev-alberto-node-1:/$ python /tmp/health-probe.py --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --agent-queue-name dhcp_agent --use-fqdn

Health probe timed out. Agent is down or response timed out

 

The situation for nova is quite similar:

 

openstack compute service list

+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+

| ID                                   | Binary         | Host                            | Zone     | Status  | State | Updated At                 |

+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+

| eaf4684c-88a4-47dd-b340-b7dbebf148ba | nova-conductor | nova-conductor-679d57f977-zbq92 | internal | enabled | up    | 2025-03-27T13:11:33.000000 |

| 9164d63f-608c-44cc-8a92-2f2bac07c622 | nova-scheduler | nova-scheduler-845f87b9b5-vq9p5 | internal | enabled | up    | 2025-03-27T13:11:36.000000 |

| 464cd394-8b26-49ad-8c82-f014ca00ec7e | nova-compute   | fig-virt-intdev-alberto-node-1  | nova     | enabled | up    | 2025-03-27T13:11:33.000000 |

| aa7e5f6d-2533-4953-bb8d-9f9bfb7fc6bd | nova-compute   | fig-virt-intdev-alberto-node-0  | nova     | enabled | up    | 2025-03-27T13:11:33.000000 |

+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+

 

And the nova-compute pods are not ready because the health probes fail:

 

k -n openstack get pods -l component=compute

NAME                         READY   STATUS    RESTARTS      AGE

nova-compute-default-t8w7x   1/2     Running   1 (70m ago)   3h31m

nova-compute-default-vz7nf   1/2     Running   1 (70m ago)   3h31m

 

To be sure that the OpenStack RPC is working fine, the health probes have been deleted for nova-compute, then nova-cell-setup is executed and the hypervisors are ready.

 

Some VMs have been launched in this environment successfully (nova-compute), getting IP from the DHCP agent or adding floating IPs (l3-agent) and the OpenStack behaviour is normal. So I’m confused about the health-probes failing.

 

Any idea on what can be wrong or what to check is welcome.

 

Thanks!

Alberto

 

Note: This is a Kubernetes cluster running on two VMs (OpenStack). This virtual environment has been redeployed several times with some modifications and the issue is fully reproducible: the probes for nova-compute, neutron-l3-agent and neutron-dhcp-agent always fail.