Hi,

We’re facing an issue when deploying OpenStack 2024.1 with OpenStack Helm and OVS, the health probes fail for neutron-l3-agent, neutron-dhcp-agent and nova-compute while the internal RPC communication between the nova and neutron components look fine:

Neutron:

openstack network agent list

+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+

+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+

+--------------------------------------+--------------------+--------------------------------+-------------------+-------+-------+---------------------------+

The neutron-l3 and neutron-dhcp pods are not ready:

k -n openstack get pods -l application=neutron |grep 'l3\|dhcp'

neutron-dhcp-agent-default-fw5rm 0/1 Running 0 127m

neutron-dhcp-agent-default-vh7c4 0/1 Running 0 127m

neutron-l3-agent-default-4h5tp 0/1 Running 0 24m

neutron-l3-agent-default-bdngv 0/1 Running 0 24m

Because the health probes fail:

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Warning Unhealthy 4m38s (x119 over 127m) kubelet Readiness probe failed: Health probe timed out. Agent is down or response timed out

When trying to execute the health probe manually from the container, the probe fails with a timeout after 60 seconds (RPC_PROBE_TIMEOUT=60)

neutron@fig-virt-intdev-alberto-node-1:/$ python /tmp/health-probe.py --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/dhcp_agent.ini --agent-queue-name dhcp_agent --use-fqdn

Health probe timed out. Agent is down or response timed out

The situation for nova is quite similar:

openstack compute service list

+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+

+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+

+--------------------------------------+----------------+---------------------------------+----------+---------+-------+----------------------------+

And the nova-compute pods are not ready because the health probes fail:

k -n openstack get pods -l component=compute

NAME READY STATUS RESTARTS AGE

nova-compute-default-t8w7x 1/2 Running 1 (70m ago) 3h31m

nova-compute-default-vz7nf 1/2 Running 1 (70m ago) 3h31m

To be sure that the OpenStack RPC is working fine, the health probes have been deleted for nova-compute, then nova-cell-setup is executed and the hypervisors are ready.

Some VMs have been launched in this environment successfully (nova-compute), getting IP from the DHCP agent or adding floating IPs (l3-agent) and the OpenStack behaviour is normal. So I’m confused about the health-probes failing.

Any idea on what can be wrong or what to check is welcome.

Thanks!

Alberto

Note: This is a Kubernetes cluster running on two VMs (OpenStack). This virtual environment has been redeployed several times with some modifications and the issue is fully reproducible: the probes for nova-compute, neutron-l3-agent and neutron-dhcp-agent always fail.