[octavia][nova][neutron] failed to failover loadbalancer
Hello stackers! release: zed deploy-tool: kolla-ansible After an outage, all the load balancers from my cluster came to provisioning status ERROR because the o-hm0 interface was unavailable on controller nodes. I've created them again and tried a failover on load balancers. The octavia-worker logs show that failover was completed successfully, but the provisioning status remains with ERROR. Looking into nova-compute logs, I see that an exception was raised os_vif.exception.NetworkInterfaceNotFound: Network interface qvob8ac0f5f-46 not found The instance id for that interface matches the new amphora id. I'll attach the nova-compute logs here[1]. In neutron-server logs, there is no ERROR or WARNING that I can found, the only suspicious entry is Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE because it is not bound. _port_provisioned /var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360 Can anyone give me a light? [1] https://pastebin.com/hV3xgu0e
Digging into logs, I found that entries 2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id': 'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch agent', 'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx, 'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17, 4, 25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4), 'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53), 'description': None, 'resources_synced': None, 'availability_zone': None, 'alive': False, ERROR neutron.plugins.ml2.managers [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type normal using segments and I came up here[1]. I think it is a problem with RabbitMQ, which is reporting for neutron that the OVS agent is not alive although the container has a "healthy" status When I run "openstack network agent list" the output is inconsistent. One time, it shows that some agents are not alive, and another time it shows other agents as alive/not alive. Is running a rolling restart in RabbitMQ the way? Has anyone already faced this problem? [1] https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/m... Regards. Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <winiciusab12@gmail.com> escreveu:
Hello stackers!
release: zed deploy-tool: kolla-ansible
After an outage, all the load balancers from my cluster came to provisioning status ERROR because the o-hm0 interface was unavailable on controller nodes. I've created them again and tried a failover on load balancers. The octavia-worker logs show that failover was completed successfully, but the provisioning status remains with ERROR. Looking into nova-compute logs, I see that an exception was raised
os_vif.exception.NetworkInterfaceNotFound: Network interface qvob8ac0f5f-46 not found
The instance id for that interface matches the new amphora id. I'll attach the nova-compute logs here[1]. In neutron-server logs, there is no ERROR or WARNING that I can found, the only suspicious entry is
Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE because it is not bound. _port_provisioned /var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
Can anyone give me a light?
I would probably start with restarting the neutron agents. I’ve come across comparable situations a couple of times, at least it seems like it. Usually I stop all agents, neutron-server last. Then start neutron-server first, then the other agents. But if you had a wider outage, restarting rabbitmq might be necessary as well. Check rabbitmqctl cluster_status and its logs to determine if rabbitmq is okay. Zitat von Winicius Allan <winiciusab12@gmail.com>:
Digging into logs, I found that entries
2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id': 'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch agent', 'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx, 'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17, 4, 25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4), 'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53), 'description': None, 'resources_synced': None, 'availability_zone': None, 'alive': False,
ERROR neutron.plugins.ml2.managers [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type normal using segments
and I came up here[1]. I think it is a problem with RabbitMQ, which is reporting for neutron that the OVS agent is not alive although the container has a "healthy" status
When I run "openstack network agent list" the output is inconsistent. One time, it shows that some agents are not alive, and another time it shows other agents as alive/not alive. Is running a rolling restart in RabbitMQ the way? Has anyone already faced this problem?
[1] https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/m...
Regards.
Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <winiciusab12@gmail.com> escreveu:
Hello stackers!
release: zed deploy-tool: kolla-ansible
After an outage, all the load balancers from my cluster came to provisioning status ERROR because the o-hm0 interface was unavailable on controller nodes. I've created them again and tried a failover on load balancers. The octavia-worker logs show that failover was completed successfully, but the provisioning status remains with ERROR. Looking into nova-compute logs, I see that an exception was raised
os_vif.exception.NetworkInterfaceNotFound: Network interface qvob8ac0f5f-46 not found
The instance id for that interface matches the new amphora id. I'll attach the nova-compute logs here[1]. In neutron-server logs, there is no ERROR or WARNING that I can found, the only suspicious entry is
Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE because it is not bound. _port_provisioned /var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
Can anyone give me a light?
I’ve noticed that Nova agents are behaving similarly (some agents are reporting that they're down but weren't). Is the process of restarting the agents the same for the Neutron? Maybe the nova-api is the one last to be stopped and the first to start. Also, checking *rabbitmqctl cluster_status* the cluster is ok. Em qua., 19 de mar. de 2025 às 18:39, Eugen Block <eblock@nde.ag> escreveu:
I would probably start with restarting the neutron agents. I’ve come across comparable situations a couple of times, at least it seems like it. Usually I stop all agents, neutron-server last. Then start neutron-server first, then the other agents. But if you had a wider outage, restarting rabbitmq might be necessary as well. Check rabbitmqctl cluster_status and its logs to determine if rabbitmq is okay.
Zitat von Winicius Allan <winiciusab12@gmail.com>:
Digging into logs, I found that entries
2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id': 'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch agent', 'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx, 'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17, 4, 25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4), 'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53), 'description': None, 'resources_synced': None, 'availability_zone': None, 'alive': False,
ERROR neutron.plugins.ml2.managers [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type normal using segments
and I came up here[1]. I think it is a problem with RabbitMQ, which is reporting for neutron that the OVS agent is not alive although the container has a "healthy" status
When I run "openstack network agent list" the output is inconsistent. One time, it shows that some agents are not alive, and another time it shows other agents as alive/not alive. Is running a rolling restart in RabbitMQ the way? Has anyone already faced this problem?
[1]
https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/m...
Regards.
Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <
winiciusab12@gmail.com>
escreveu:
Hello stackers!
release: zed deploy-tool: kolla-ansible
After an outage, all the load balancers from my cluster came to provisioning status ERROR because the o-hm0 interface was unavailable on controller nodes. I've created them again and tried a failover on load balancers. The octavia-worker logs show that failover was completed successfully, but the provisioning status remains with ERROR. Looking into nova-compute logs, I see that an exception was raised
os_vif.exception.NetworkInterfaceNotFound: Network interface qvob8ac0f5f-46 not found
The instance id for that interface matches the new amphora id. I'll attach the nova-compute logs here[1]. In neutron-server logs, there is no ERROR or WARNING that I can found, the only suspicious entry is
Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE because it is not bound. _port_provisioned
/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
Can anyone give me a light?
Hi, I haven't experienced a situation where it was required to restart nova in a specific order, only with neutron under heavy load. So I'd say you can restart nova without any order. Regarding rabbitmq, cluster_status is one thing, but did you inspect the logs as well? Do you see any rabbit issues or just the neutron/nova agents disconnecting and reconnecting? Years ago I helped a customer debugging issues after an outage. I believe it was a single control node at that time. IIRC, neutron had caused a OOM killer. So we restarted all kinds of services, but nothing really worked. We were debugging for around 10 hours straight, to no avail. We then decided to reboot the control node, and the issues were gone. So we still didn't really know what was going on, but what I've learned from that is, a reboot can be much easier than trying to fix some broken processes. ;-) But at the same time, I try to avoid reboots as much as possible if I see a chance of fixing things without a reboot. :-) Zitat von Winicius Allan <winiciusab12@gmail.com>:
I’ve noticed that Nova agents are behaving similarly (some agents are reporting that they're down but weren't). Is the process of restarting the agents the same for the Neutron? Maybe the nova-api is the one last to be stopped and the first to start.
Also, checking *rabbitmqctl cluster_status* the cluster is ok.
Em qua., 19 de mar. de 2025 às 18:39, Eugen Block <eblock@nde.ag> escreveu:
I would probably start with restarting the neutron agents. I’ve come across comparable situations a couple of times, at least it seems like it. Usually I stop all agents, neutron-server last. Then start neutron-server first, then the other agents. But if you had a wider outage, restarting rabbitmq might be necessary as well. Check rabbitmqctl cluster_status and its logs to determine if rabbitmq is okay.
Zitat von Winicius Allan <winiciusab12@gmail.com>:
Digging into logs, I found that entries
2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id': 'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch agent', 'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx, 'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17, 4, 25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4), 'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53), 'description': None, 'resources_synced': None, 'availability_zone': None, 'alive': False,
ERROR neutron.plugins.ml2.managers [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type normal using segments
and I came up here[1]. I think it is a problem with RabbitMQ, which is reporting for neutron that the OVS agent is not alive although the container has a "healthy" status
When I run "openstack network agent list" the output is inconsistent. One time, it shows that some agents are not alive, and another time it shows other agents as alive/not alive. Is running a rolling restart in RabbitMQ the way? Has anyone already faced this problem?
[1]
https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/m...
Regards.
Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <
winiciusab12@gmail.com>
escreveu:
Hello stackers!
release: zed deploy-tool: kolla-ansible
After an outage, all the load balancers from my cluster came to provisioning status ERROR because the o-hm0 interface was unavailable on controller nodes. I've created them again and tried a failover on load balancers. The octavia-worker logs show that failover was completed successfully, but the provisioning status remains with ERROR. Looking into nova-compute logs, I see that an exception was raised
os_vif.exception.NetworkInterfaceNotFound: Network interface qvob8ac0f5f-46 not found
The instance id for that interface matches the new amphora id. I'll attach the nova-compute logs here[1]. In neutron-server logs, there is no ERROR or WARNING that I can found, the only suspicious entry is
Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE because it is not bound. _port_provisioned
/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
Can anyone give me a light?
Hello, sorry for the late response Digging in database entries for compute agents heartbeat, I checked that "updated_at" timestamps are inconsistent. So, seeing the datetime on infra nodes and computes, I saw that were with different times. The resolution was to sync them using a NTP server. Since nova and neutron agents are sending the requests for "dead agents", the requested resource came to an error status. On Thu, Mar 20, 2025, 08:44 Eugen Block <eblock@nde.ag> wrote:
Hi,
I haven't experienced a situation where it was required to restart nova in a specific order, only with neutron under heavy load. So I'd say you can restart nova without any order. Regarding rabbitmq, cluster_status is one thing, but did you inspect the logs as well? Do you see any rabbit issues or just the neutron/nova agents disconnecting and reconnecting?
Years ago I helped a customer debugging issues after an outage. I believe it was a single control node at that time. IIRC, neutron had caused a OOM killer. So we restarted all kinds of services, but nothing really worked. We were debugging for around 10 hours straight, to no avail. We then decided to reboot the control node, and the issues were gone. So we still didn't really know what was going on, but what I've learned from that is, a reboot can be much easier than trying to fix some broken processes. ;-) But at the same time, I try to avoid reboots as much as possible if I see a chance of fixing things without a reboot. :-)
Zitat von Winicius Allan <winiciusab12@gmail.com>:
I’ve noticed that Nova agents are behaving similarly (some agents are reporting that they're down but weren't). Is the process of restarting the agents the same for the Neutron? Maybe the nova-api is the one last to be stopped and the first to start.
Also, checking *rabbitmqctl cluster_status* the cluster is ok.
Em qua., 19 de mar. de 2025 às 18:39, Eugen Block <eblock@nde.ag> escreveu:
I would probably start with restarting the neutron agents. I’ve come across comparable situations a couple of times, at least it seems like it. Usually I stop all agents, neutron-server last. Then start neutron-server first, then the other agents. But if you had a wider outage, restarting rabbitmq might be necessary as well. Check rabbitmqctl cluster_status and its logs to determine if rabbitmq is okay.
Zitat von Winicius Allan <winiciusab12@gmail.com>:
Digging into logs, I found that entries
2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id': 'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch agent', 'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx, 'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17, 4, 25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4), 'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53), 'description': None, 'resources_synced': None, 'availability_zone': None, 'alive': False,
ERROR neutron.plugins.ml2.managers [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type normal using segments
and I came up here[1]. I think it is a problem with RabbitMQ, which is reporting for neutron that the OVS agent is not alive although the container has a "healthy" status
When I run "openstack network agent list" the output is inconsistent. One time, it shows that some agents are not alive, and another time it shows other agents as alive/not alive. Is running a rolling restart in RabbitMQ the way? Has anyone already faced this problem?
[1]
https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/m...
Regards.
Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <
winiciusab12@gmail.com>
escreveu:
Hello stackers!
release: zed deploy-tool: kolla-ansible
After an outage, all the load balancers from my cluster came to provisioning status ERROR because the o-hm0 interface was unavailable on controller nodes. I've created them again and tried a failover on load balancers. The octavia-worker logs show that failover was completed successfully, but the provisioning status remains with ERROR. Looking into nova-compute logs, I see that an exception was raised
os_vif.exception.NetworkInterfaceNotFound: Network interface qvob8ac0f5f-46 not found
The instance id for that interface matches the new amphora id. I'll attach the nova-compute logs here[1]. In neutron-server logs, there is no ERROR or WARNING that I can found, the only suspicious entry is
Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE because it is not bound. _port_provisioned
/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
Can anyone give me a light?
Thanks for the update. I'm wondering why you encounter this now, I'd expect it to hit you quite soon after deployment since a proper time sync is quite important. But it's good you figured that out. Zitat von Winicius Allan <winiciusab12@gmail.com>:
Hello, sorry for the late response
Digging in database entries for compute agents heartbeat, I checked that "updated_at" timestamps are inconsistent. So, seeing the datetime on infra nodes and computes, I saw that were with different times. The resolution was to sync them using a NTP server.
Since nova and neutron agents are sending the requests for "dead agents", the requested resource came to an error status.
On Thu, Mar 20, 2025, 08:44 Eugen Block <eblock@nde.ag> wrote:
Hi,
I haven't experienced a situation where it was required to restart nova in a specific order, only with neutron under heavy load. So I'd say you can restart nova without any order. Regarding rabbitmq, cluster_status is one thing, but did you inspect the logs as well? Do you see any rabbit issues or just the neutron/nova agents disconnecting and reconnecting?
Years ago I helped a customer debugging issues after an outage. I believe it was a single control node at that time. IIRC, neutron had caused a OOM killer. So we restarted all kinds of services, but nothing really worked. We were debugging for around 10 hours straight, to no avail. We then decided to reboot the control node, and the issues were gone. So we still didn't really know what was going on, but what I've learned from that is, a reboot can be much easier than trying to fix some broken processes. ;-) But at the same time, I try to avoid reboots as much as possible if I see a chance of fixing things without a reboot. :-)
Zitat von Winicius Allan <winiciusab12@gmail.com>:
I’ve noticed that Nova agents are behaving similarly (some agents are reporting that they're down but weren't). Is the process of restarting the agents the same for the Neutron? Maybe the nova-api is the one last to be stopped and the first to start.
Also, checking *rabbitmqctl cluster_status* the cluster is ok.
Em qua., 19 de mar. de 2025 às 18:39, Eugen Block <eblock@nde.ag> escreveu:
I would probably start with restarting the neutron agents. I’ve come across comparable situations a couple of times, at least it seems like it. Usually I stop all agents, neutron-server last. Then start neutron-server first, then the other agents. But if you had a wider outage, restarting rabbitmq might be necessary as well. Check rabbitmqctl cluster_status and its logs to determine if rabbitmq is okay.
Zitat von Winicius Allan <winiciusab12@gmail.com>:
Digging into logs, I found that entries
2025-03-19 20:43:37.834 26 WARNING neutron.plugins.ml2.drivers.mech_agent [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Refusing to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 to dead agent: {'id': 'ab9612a8-0a92-4545-a088-9ea0dd1e527b', 'agent_type': 'Open vSwitch agent', 'binary': 'neutron-openvswitch-agent', 'topic': 'N/A', 'host': xx, 'admin_state_up': True, 'created_at': datetime.datetime(2024, 6, 12, 17, 4, 25), 'started_at': datetime.datetime(2025, 3, 15, 21, 1, 4), 'heartbeat_timestamp': datetime.datetime(2025, 3, 19, 20, 41, 53), 'description': None, 'resources_synced': None, 'availability_zone': None, 'alive': False,
ERROR neutron.plugins.ml2.managers [req-f4e98255-be23-4d7a-9d4a-c680b1320bdf req-c4a562cf-817f-4d7d-be1c-907e47c4c940 aa8b08700bdd4acc99a8e4a33180f764 e8866ffd910d4a08bdb347aedd80cdf1 - - default default] Failed to bind port 6961fd23-9edd-4dc7-99ad-88213965c796 on host lsd-srv-115 for vnic_type normal using segments
and I came up here[1]. I think it is a problem with RabbitMQ, which is reporting for neutron that the OVS agent is not alive although the container has a "healthy" status
When I run "openstack network agent list" the output is inconsistent. One time, it shows that some agents are not alive, and another time it shows other agents as alive/not alive. Is running a rolling restart in RabbitMQ the way? Has anyone already faced this problem?
[1]
https://github.com/openstack/neutron/blob/unmaintained/zed/neutron/plugins/m...
Regards.
Em qua., 19 de mar. de 2025 às 10:08, Winicius Allan <
winiciusab12@gmail.com>
escreveu:
Hello stackers!
release: zed deploy-tool: kolla-ansible
After an outage, all the load balancers from my cluster came to provisioning status ERROR because the o-hm0 interface was unavailable on controller nodes. I've created them again and tried a failover on load balancers. The octavia-worker logs show that failover was completed successfully, but the provisioning status remains with ERROR. Looking into nova-compute logs, I see that an exception was raised
os_vif.exception.NetworkInterfaceNotFound: Network interface qvob8ac0f5f-46 not found
The instance id for that interface matches the new amphora id. I'll attach the nova-compute logs here[1]. In neutron-server logs, there is no ERROR or WARNING that I can found, the only suspicious entry is
Port b8ac0f5f-4613-4b95-9690-7cf59f739fd0 cannot update to ACTIVE because it is not bound. _port_provisioned
/var/lib/kolla/venv/lib/python3.10/site-packages/neutron/plugins/ml2/plugin.py:360
Can anyone give me a light?
participants (2)
-
Eugen Block
-
Winicius Allan