[neutron][ops][upgrade] l3-agent failure during upgrade to Zed
Hi *, maybe I missed some announcement or something, but usually, I read the release notes [0] before upgrading our OpenStack cloud. I didn't notice anything regarding DB schema upgrades. And after the upgrade from Yoga to Zed in a test environment went well, I tried the same in our production today. Note that I didn't have a router in my test cloud, so that's probably why I didn't notice anything. Unfortunately, there has been a schema change, that's why the l3-agent failed to start properly with this error: 2025-03-21 12:29:14.527 846393 CRITICAL neutron [None req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled error: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'") Indeed, the upgraded control node didn't have "external_port" anymore in /usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py, while the not yet upgraded control node did. So the situation could only be resolved by proceeding with the upgrade. But that meant an interruption for our virtual routers, causing floating IPs to be unreachable for a couple of minutes. Note that we're using highly-available routers. I thought about setting "no-ha" for each router, but that can only be done for disabled routers, which is not an option, of course. And it doesn't really fit into the "rolling upgrade" concept, which has worked great so far. Since we moved to Ubuntu last September (while still on Victoria), we've been able to upgrade to Yoga without any issues. And while the interruption today was not too critical, I was still surprised that such an important change didn't even make it into the Zed release notes. Was that a mistake or did I miss something? Are there other places I need to check before attempting an upgrade? Thanks, Eugen [0] https://docs.openstack.org/releasenotes/neutron/zed.html
Hello, We upgrade in a very specific order as mentioned in [1], so first database expand, then all neutron-server applications is upgraded first, then contract, before any agents. [1] https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html /Tobias
On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote:
Hi *,
maybe I missed some announcement or something, but usually, I read the release notes [0] before upgrading our OpenStack cloud. I didn't notice anything regarding DB schema upgrades. And after the upgrade from Yoga to Zed in a test environment went well, I tried the same in our production today. Note that I didn't have a router in my test cloud, so that's probably why I didn't notice anything.
Unfortunately, there has been a schema change, that's why the l3-agent failed to start properly with this error:
2025-03-21 12:29:14.527 846393 CRITICAL neutron [None req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled error: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
Indeed, the upgraded control node didn't have "external_port" anymore in /usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py, while the not yet upgraded control node did. So the situation could only be resolved by proceeding with the upgrade. But that meant an interruption for our virtual routers, causing floating IPs to be unreachable for a couple of minutes.
Note that we're using highly-available routers. I thought about setting "no-ha" for each router, but that can only be done for disabled routers, which is not an option, of course. And it doesn't really fit into the "rolling upgrade" concept, which has worked great so far. Since we moved to Ubuntu last September (while still on Victoria), we've been able to upgrade to Yoga without any issues.
And while the interruption today was not too critical, I was still surprised that such an important change didn't even make it into the Zed release notes. Was that a mistake or did I miss something? Are there other places I need to check before attempting an upgrade?
Thanks, Eugen
[0] https://docs.openstack.org/releasenotes/neutron/zed.html
Thanks for your quick response, appreciate it! I've read that page as well, but that's been a while. I guess I didn't pay too much attention since the recent upgrades all went well. Until now, I just ran 'apt upgrade' on the first node, which would upgrade all packages, of course, did an expand and the contract command was issued on the last control node. So what would be the ideal way? First upgrade only neutron-server and l2 agents on all control node ('apt upgrade --only-upgrade <neutron-server|openvswitch-agent>'), then expand and contract, and then upgrade the rest of the packages? Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
We upgrade in a very specific order as mentioned in [1], so first database expand, then all neutron-server applications is upgraded first, then contract, before any agents.
[1] https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html
/Tobias
On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote:
Hi *,
maybe I missed some announcement or something, but usually, I read the release notes [0] before upgrading our OpenStack cloud. I didn't notice anything regarding DB schema upgrades. And after the upgrade from Yoga to Zed in a test environment went well, I tried the same in our production today. Note that I didn't have a router in my test cloud, so that's probably why I didn't notice anything.
Unfortunately, there has been a schema change, that's why the l3-agent failed to start properly with this error:
2025-03-21 12:29:14.527 846393 CRITICAL neutron [None req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled error: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
Indeed, the upgraded control node didn't have "external_port" anymore in /usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py, while the not yet upgraded control node did. So the situation could only be resolved by proceeding with the upgrade. But that meant an interruption for our virtual routers, causing floating IPs to be unreachable for a couple of minutes.
Note that we're using highly-available routers. I thought about setting "no-ha" for each router, but that can only be done for disabled routers, which is not an option, of course. And it doesn't really fit into the "rolling upgrade" concept, which has worked great so far. Since we moved to Ubuntu last September (while still on Victoria), we've been able to upgrade to Yoga without any issues.
And while the interruption today was not too critical, I was still surprised that such an important change didn't even make it into the Zed release notes. Was that a mistake or did I miss something? Are there other places I need to check before attempting an upgrade?
Thanks, Eugen
[0] https://docs.openstack.org/releasenotes/neutron/zed.html
Hello, In more detail this is the procedure we’re using and we recently upgraded two times first from Zed to Antelope, then from Antelope to Caracal. - Install new version of Neutron and run database expand - Upgrade neutron-server on all “controller” nodes - Run database contract - Upgrade OVS, L3, Metadata, DHCP agents on network nodes (on controller nodes in some peoples setups) - First OVS and then wait for it to start correctly - Stop DHCP, L3, Metadata (in that order) - Upgrade agents and start in same order as above - Upgrade OVS agent on compute nodes Happy to take feedback if there is improvement possible on the above From what I remember during all these years we’ve only had issues with upgrades twice, once was a keepalived bug and another was when Neutron translated to primary/backup wording for L3 HA which I think could also be that we did a double jump upgrade causing us to miss some translation patch somewhere or similar. /Tobias
On 21 Mar 2025, at 15:44, Eugen Block <eblock@nde.ag> wrote:
Thanks for your quick response, appreciate it! I've read that page as well, but that's been a while. I guess I didn't pay too much attention since the recent upgrades all went well. Until now, I just ran 'apt upgrade' on the first node, which would upgrade all packages, of course, did an expand and the contract command was issued on the last control node.
So what would be the ideal way? First upgrade only neutron-server and l2 agents on all control node ('apt upgrade --only-upgrade <neutron-server|openvswitch-agent>'), then expand and contract, and then upgrade the rest of the packages?
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
We upgrade in a very specific order as mentioned in [1], so first database expand, then all neutron-server applications is upgraded first, then contract, before any agents.
[1] https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html
/Tobias
On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote:
Hi *,
maybe I missed some announcement or something, but usually, I read the release notes [0] before upgrading our OpenStack cloud. I didn't notice anything regarding DB schema upgrades. And after the upgrade from Yoga to Zed in a test environment went well, I tried the same in our production today. Note that I didn't have a router in my test cloud, so that's probably why I didn't notice anything.
Unfortunately, there has been a schema change, that's why the l3-agent failed to start properly with this error:
2025-03-21 12:29:14.527 846393 CRITICAL neutron [None req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled error: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
Indeed, the upgraded control node didn't have "external_port" anymore in /usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py, while the not yet upgraded control node did. So the situation could only be resolved by proceeding with the upgrade. But that meant an interruption for our virtual routers, causing floating IPs to be unreachable for a couple of minutes.
Note that we're using highly-available routers. I thought about setting "no-ha" for each router, but that can only be done for disabled routers, which is not an option, of course. And it doesn't really fit into the "rolling upgrade" concept, which has worked great so far. Since we moved to Ubuntu last September (while still on Victoria), we've been able to upgrade to Yoga without any issues.
And while the interruption today was not too critical, I was still surprised that such an important change didn't even make it into the Zed release notes. Was that a mistake or did I miss something? Are there other places I need to check before attempting an upgrade?
Thanks, Eugen
[0] https://docs.openstack.org/releasenotes/neutron/zed.html
Hi, thanks for sharing! I'll have to adapt my upgrade procedure and test it properly. This could take a while, though. Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
In more detail this is the procedure we’re using and we recently upgraded two times first from Zed to Antelope, then from Antelope to Caracal.
- Install new version of Neutron and run database expand
- Upgrade neutron-server on all “controller” nodes
- Run database contract
- Upgrade OVS, L3, Metadata, DHCP agents on network nodes (on controller nodes in some peoples setups)
- First OVS and then wait for it to start correctly
- Stop DHCP, L3, Metadata (in that order)
- Upgrade agents and start in same order as above
- Upgrade OVS agent on compute nodes
Happy to take feedback if there is improvement possible on the above
From what I remember during all these years we’ve only had issues with upgrades twice, once was a keepalived bug and another was when Neutron translated to primary/backup wording for L3 HA which I think could also be that we did a double jump upgrade causing us to miss some translation patch somewhere or similar.
/Tobias
On 21 Mar 2025, at 15:44, Eugen Block <eblock@nde.ag> wrote:
Thanks for your quick response, appreciate it! I've read that page as well, but that's been a while. I guess I didn't pay too much attention since the recent upgrades all went well. Until now, I just ran 'apt upgrade' on the first node, which would upgrade all packages, of course, did an expand and the contract command was issued on the last control node.
So what would be the ideal way? First upgrade only neutron-server and l2 agents on all control node ('apt upgrade --only-upgrade <neutron-server|openvswitch-agent>'), then expand and contract, and then upgrade the rest of the packages?
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
We upgrade in a very specific order as mentioned in [1], so first database expand, then all neutron-server applications is upgraded first, then contract, before any agents.
[1] https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html
/Tobias
On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote:
Hi *,
maybe I missed some announcement or something, but usually, I read the release notes [0] before upgrading our OpenStack cloud. I didn't notice anything regarding DB schema upgrades. And after the upgrade from Yoga to Zed in a test environment went well, I tried the same in our production today. Note that I didn't have a router in my test cloud, so that's probably why I didn't notice anything.
Unfortunately, there has been a schema change, that's why the l3-agent failed to start properly with this error:
2025-03-21 12:29:14.527 846393 CRITICAL neutron [None req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled error: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
Indeed, the upgraded control node didn't have "external_port" anymore in /usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py, while the not yet upgraded control node did. So the situation could only be resolved by proceeding with the upgrade. But that meant an interruption for our virtual routers, causing floating IPs to be unreachable for a couple of minutes.
Note that we're using highly-available routers. I thought about setting "no-ha" for each router, but that can only be done for disabled routers, which is not an option, of course. And it doesn't really fit into the "rolling upgrade" concept, which has worked great so far. Since we moved to Ubuntu last September (while still on Victoria), we've been able to upgrade to Yoga without any issues.
And while the interruption today was not too critical, I was still surprised that such an important change didn't even make it into the Zed release notes. Was that a mistake or did I miss something? Are there other places I need to check before attempting an upgrade?
Thanks, Eugen
[0] https://docs.openstack.org/releasenotes/neutron/zed.html
It didn't take that long to evaluate. Unfortunately, this approach doesn't work for me. I tried only upgrading the neutron-server package, but there are dependencies for the other neutron agents, so they are upgraded as well. I could reduce the downtime of the L3-agent, though. Since this isn't a recurring issue (upgrades in general, but also db schema changes), we'll stick with our current upgrade procedure. But I'm still voting for adding db schema changes to the release notes. Thanks again, Eugen Zitat von Eugen Block <eblock@nde.ag>:
Hi,
thanks for sharing! I'll have to adapt my upgrade procedure and test it properly. This could take a while, though.
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
In more detail this is the procedure we’re using and we recently upgraded two times first from Zed to Antelope, then from Antelope to Caracal.
- Install new version of Neutron and run database expand
- Upgrade neutron-server on all “controller” nodes
- Run database contract
- Upgrade OVS, L3, Metadata, DHCP agents on network nodes (on controller nodes in some peoples setups)
- First OVS and then wait for it to start correctly
- Stop DHCP, L3, Metadata (in that order)
- Upgrade agents and start in same order as above
- Upgrade OVS agent on compute nodes
Happy to take feedback if there is improvement possible on the above
From what I remember during all these years we’ve only had issues with upgrades twice, once was a keepalived bug and another was when Neutron translated to primary/backup wording for L3 HA which I think could also be that we did a double jump upgrade causing us to miss some translation patch somewhere or similar.
/Tobias
On 21 Mar 2025, at 15:44, Eugen Block <eblock@nde.ag> wrote:
Thanks for your quick response, appreciate it! I've read that page as well, but that's been a while. I guess I didn't pay too much attention since the recent upgrades all went well. Until now, I just ran 'apt upgrade' on the first node, which would upgrade all packages, of course, did an expand and the contract command was issued on the last control node.
So what would be the ideal way? First upgrade only neutron-server and l2 agents on all control node ('apt upgrade --only-upgrade <neutron-server|openvswitch-agent>'), then expand and contract, and then upgrade the rest of the packages?
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
We upgrade in a very specific order as mentioned in [1], so first database expand, then all neutron-server applications is upgraded first, then contract, before any agents.
[1] https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html
/Tobias
On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote:
Hi *,
maybe I missed some announcement or something, but usually, I read the release notes [0] before upgrading our OpenStack cloud. I didn't notice anything regarding DB schema upgrades. And after the upgrade from Yoga to Zed in a test environment went well, I tried the same in our production today. Note that I didn't have a router in my test cloud, so that's probably why I didn't notice anything.
Unfortunately, there has been a schema change, that's why the l3-agent failed to start properly with this error:
2025-03-21 12:29:14.527 846393 CRITICAL neutron [None req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled error: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
Indeed, the upgraded control node didn't have "external_port" anymore in /usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py, while the not yet upgraded control node did. So the situation could only be resolved by proceeding with the upgrade. But that meant an interruption for our virtual routers, causing floating IPs to be unreachable for a couple of minutes.
Note that we're using highly-available routers. I thought about setting "no-ha" for each router, but that can only be done for disabled routers, which is not an option, of course. And it doesn't really fit into the "rolling upgrade" concept, which has worked great so far. Since we moved to Ubuntu last September (while still on Victoria), we've been able to upgrade to Yoga without any issues.
And while the interruption today was not too critical, I was still surprised that such an important change didn't even make it into the Zed release notes. Was that a mistake or did I miss something? Are there other places I need to check before attempting an upgrade?
Thanks, Eugen
[0] https://docs.openstack.org/releasenotes/neutron/zed.html
Hello: The DB schema change is considered in the Neutron DB object [1]. The agents, via RPC, do not receive the raw DB object but a json blob derived from the Neutron DB object. If the target (the agent) expects a lower version, then the json blob is changed. This is why it is not necessary to inform about the DB schema changes between versions. In order to properly debug this issue it would need a traceback of the Neutron API and the L3 agent. Also a reproducer could be useful, including the current environment conditions. What L3 agent call is causing this issue? Regards. [1] https://review.opendev.org/c/openstack/neutron/+/798961/35/neutron/objects/p... On Tue, Mar 25, 2025 at 1:24 PM Eugen Block <eblock@nde.ag> wrote:
It didn't take that long to evaluate. Unfortunately, this approach doesn't work for me. I tried only upgrading the neutron-server package, but there are dependencies for the other neutron agents, so they are upgraded as well. I could reduce the downtime of the L3-agent, though. Since this isn't a recurring issue (upgrades in general, but also db schema changes), we'll stick with our current upgrade procedure.
But I'm still voting for adding db schema changes to the release notes.
Thanks again, Eugen
Zitat von Eugen Block <eblock@nde.ag>:
Hi,
thanks for sharing! I'll have to adapt my upgrade procedure and test it properly. This could take a while, though.
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
In more detail this is the procedure we’re using and we recently upgraded two times first from Zed to Antelope, then from Antelope to Caracal.
- Install new version of Neutron and run database expand
- Upgrade neutron-server on all “controller” nodes
- Run database contract
- Upgrade OVS, L3, Metadata, DHCP agents on network nodes (on controller nodes in some peoples setups)
- First OVS and then wait for it to start correctly
- Stop DHCP, L3, Metadata (in that order)
- Upgrade agents and start in same order as above
- Upgrade OVS agent on compute nodes
Happy to take feedback if there is improvement possible on the above
From what I remember during all these years we’ve only had issues with upgrades twice, once was a keepalived bug and another was when Neutron translated to primary/backup wording for L3 HA which I think could also be that we did a double jump upgrade causing us to miss some translation patch somewhere or similar.
/Tobias
On 21 Mar 2025, at 15:44, Eugen Block <eblock@nde.ag> wrote:
Thanks for your quick response, appreciate it! I've read that page as well, but that's been a while. I guess I didn't pay too much attention since the recent upgrades all went well. Until now, I just ran 'apt upgrade' on the first node, which would upgrade all packages, of course, did an expand and the contract command was issued on the last control node.
So what would be the ideal way? First upgrade only neutron-server and l2 agents on all control node ('apt upgrade --only-upgrade <neutron-server|openvswitch-agent>'), then expand and contract, and then upgrade the rest of the packages?
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
We upgrade in a very specific order as mentioned in [1], so first database expand, then all neutron-server applications is upgraded first, then contract, before any agents.
[1]
https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html
/Tobias
On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote:
Hi *,
maybe I missed some announcement or something, but usually, I read the release notes [0] before upgrading our OpenStack cloud. I didn't notice anything regarding DB schema upgrades. And after the upgrade from Yoga to Zed in a test environment went well, I tried the same in our production today. Note that I didn't have a router in my test cloud, so that's probably why I didn't notice anything.
Unfortunately, there has been a schema change, that's why the l3-agent failed to start properly with this error:
2025-03-21 12:29:14.527 846393 CRITICAL neutron [None req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled error: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
Indeed, the upgraded control node didn't have "external_port" anymore in
/usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py,
while the not yet upgraded control node did. So the situation could only be resolved by proceeding with the upgrade. But that meant an interruption for our virtual routers, causing floating IPs to be unreachable for a couple of minutes.
Note that we're using highly-available routers. I thought about setting "no-ha" for each router, but that can only be done for disabled routers, which is not an option, of course. And it doesn't really fit into the "rolling upgrade" concept, which has worked great so far. Since we moved to Ubuntu last September (while still on Victoria), we've been able to upgrade to Yoga without any issues.
And while the interruption today was not too critical, I was still surprised that such an important change didn't even make it into the Zed release notes. Was that a mistake or did I miss something? Are there other places I need to check before attempting an upgrade?
Thanks, Eugen
[0] https://docs.openstack.org/releasenotes/neutron/zed.html
Some infos about the environment: - 2 control nodes, most services managed by pacemaker - recently upgraded to Ubuntu 22.04 I started the upgrade procedure yesterday (2025-03-24 12:55:55) by stopping all openstack services on the first node (controller02), expanding keystone db, neutron db, db sync for all other services, neutron at exactly: 2025-03-24 13:07:31 neutron-db-manage upgrade --expand At 13:10:39 the services were restarted again. The other control node (controller01) logged the first error in neutron-server.log at: 2025-03-24 13:08:10.356 5804 ERROR neutron.db.agentschedulers_db [req-b5a1c4f0-a28a-4cea-96f7-e27df915bd4c - - - - -] Unexpected exception occurred while removing network 0682ba75-b750-4318-a75e-c92c347c923b from agent 20709aa0-2c55-4a18-8f7f-2c65c1bc1297: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'") The entire stack trace is a bit lengthy, I'll paste it here: https://paste.openstack.org/show/bJAP7rgEXoFH76wC6zSj/ The l3-agent on controller02 was started and failing at: 2025-03-24 13:11:28.942 1238229 INFO neutron.agent.dhcp.agent [None req-21447e37-7f29-455e-a477-7cc85fb570a5 - - - - - -] Agent has just been revived. Scheduling full sync ... 2025-03-24 13:11:31.367 1238229 ERROR neutron.agent.dhcp.agent [None req-b438d86f-a928-43f8-a8db-d256dcb8179b - - - - - -] Unable to disable dhcp for 0682ba75-b750-4318-a75e-c92c347c923b.: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'") stack trace at: https://paste.openstack.org/show/baInRC9qfafXEpaTqBCW/ Then I upgraded the neutron packages on the second node a few minutes later, contracting the db at: 2025-03-24 13:19:55 neutron-db-manage upgrade --contract I seem to be able to reproduce it easily, just need to rollback my VMs to a previous snapshot and run the upgrade procedure again. If you need more information, please let me know. Thanks! Eugen Zitat von Rodolfo Alonso Hernandez <ralonsoh@redhat.com>:
Hello:
The DB schema change is considered in the Neutron DB object [1]. The agents, via RPC, do not receive the raw DB object but a json blob derived from the Neutron DB object. If the target (the agent) expects a lower version, then the json blob is changed. This is why it is not necessary to inform about the DB schema changes between versions.
In order to properly debug this issue it would need a traceback of the Neutron API and the L3 agent. Also a reproducer could be useful, including the current environment conditions. What L3 agent call is causing this issue?
Regards.
[1] https://review.opendev.org/c/openstack/neutron/+/798961/35/neutron/objects/p...
On Tue, Mar 25, 2025 at 1:24 PM Eugen Block <eblock@nde.ag> wrote:
It didn't take that long to evaluate. Unfortunately, this approach doesn't work for me. I tried only upgrading the neutron-server package, but there are dependencies for the other neutron agents, so they are upgraded as well. I could reduce the downtime of the L3-agent, though. Since this isn't a recurring issue (upgrades in general, but also db schema changes), we'll stick with our current upgrade procedure.
But I'm still voting for adding db schema changes to the release notes.
Thanks again, Eugen
Zitat von Eugen Block <eblock@nde.ag>:
Hi,
thanks for sharing! I'll have to adapt my upgrade procedure and test it properly. This could take a while, though.
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
In more detail this is the procedure we’re using and we recently upgraded two times first from Zed to Antelope, then from Antelope to Caracal.
- Install new version of Neutron and run database expand
- Upgrade neutron-server on all “controller” nodes
- Run database contract
- Upgrade OVS, L3, Metadata, DHCP agents on network nodes (on controller nodes in some peoples setups)
- First OVS and then wait for it to start correctly
- Stop DHCP, L3, Metadata (in that order)
- Upgrade agents and start in same order as above
- Upgrade OVS agent on compute nodes
Happy to take feedback if there is improvement possible on the above
From what I remember during all these years we’ve only had issues with upgrades twice, once was a keepalived bug and another was when Neutron translated to primary/backup wording for L3 HA which I think could also be that we did a double jump upgrade causing us to miss some translation patch somewhere or similar.
/Tobias
On 21 Mar 2025, at 15:44, Eugen Block <eblock@nde.ag> wrote:
Thanks for your quick response, appreciate it! I've read that page as well, but that's been a while. I guess I didn't pay too much attention since the recent upgrades all went well. Until now, I just ran 'apt upgrade' on the first node, which would upgrade all packages, of course, did an expand and the contract command was issued on the last control node.
So what would be the ideal way? First upgrade only neutron-server and l2 agents on all control node ('apt upgrade --only-upgrade <neutron-server|openvswitch-agent>'), then expand and contract, and then upgrade the rest of the packages?
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
We upgrade in a very specific order as mentioned in [1], so first database expand, then all neutron-server applications is upgraded first, then contract, before any agents.
[1]
https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html
/Tobias
> On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote: > > Hi *, > > maybe I missed some announcement or something, but usually, I > read the release notes [0] before upgrading our OpenStack cloud. > I didn't notice anything regarding DB schema upgrades. And after > the upgrade from Yoga to Zed in a test environment went well, I > tried the same in our production today. Note that I didn't have > a router in my test cloud, so that's probably why I didn't > notice anything. > > Unfortunately, there has been a schema change, that's why the > l3-agent failed to start properly with this error: > > 2025-03-21 12:29:14.527 846393 CRITICAL neutron [None > req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled > error: oslo_messaging.rpc.client.RemoteError: Remote error: > OperationalError (pymysql.err.OperationalError) (1054, "Unknown > column 'portforwardings.external_port' in 'SELECT'") > > Indeed, the upgraded control node didn't have "external_port" > anymore in >
/usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py,
> while the not yet upgraded control node did. So the situation > could only be resolved by proceeding with the upgrade. But that > meant an interruption for our virtual routers, causing floating > IPs to be unreachable for a couple of minutes. > > Note that we're using highly-available routers. I thought about > setting "no-ha" for each router, but that can only be done for > disabled routers, which is not an option, of course. And it > doesn't really fit into the "rolling upgrade" concept, which has > worked great so far. Since we moved to Ubuntu last September > (while still on Victoria), we've been able to upgrade to Yoga > without any issues. > > And while the interruption today was not too critical, I was > still surprised that such an important change didn't even make > it into the Zed release notes. Was that a mistake or did I miss > something? Are there other places I need to check before > attempting an upgrade? > > Thanks, > Eugen > > [0] https://docs.openstack.org/releasenotes/neutron/zed.html >
Hello Eugen: Ok, I think I found the problem: the DB migration includes a field drop. If that is done with the Neutron API running, it will fail if the DB resource is called from an older API code. In Neutron we stopped the contract migrations but at the same time allowed (**at some risks**) expand migrations with exceptions (L130). I'll open a bug for this, not for this error in particular (it is already in a maintenance branch) but for future DB migrations. Regards. [1] https://review.opendev.org/c/openstack/neutron/+/798961/35/neutron/db/migrat... On Tue, Mar 25, 2025 at 4:04 PM Eugen Block <eblock@nde.ag> wrote:
Some infos about the environment:
- 2 control nodes, most services managed by pacemaker - recently upgraded to Ubuntu 22.04
I started the upgrade procedure yesterday (2025-03-24 12:55:55) by stopping all openstack services on the first node (controller02), expanding keystone db, neutron db, db sync for all other services, neutron at exactly:
2025-03-24 13:07:31 neutron-db-manage upgrade --expand
At 13:10:39 the services were restarted again.
The other control node (controller01) logged the first error in neutron-server.log at:
2025-03-24 13:08:10.356 5804 ERROR neutron.db.agentschedulers_db [req-b5a1c4f0-a28a-4cea-96f7-e27df915bd4c - - - - -] Unexpected exception occurred while removing network 0682ba75-b750-4318-a75e-c92c347c923b from agent 20709aa0-2c55-4a18-8f7f-2c65c1bc1297: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
The entire stack trace is a bit lengthy, I'll paste it here:
https://paste.openstack.org/show/bJAP7rgEXoFH76wC6zSj/
The l3-agent on controller02 was started and failing at:
2025-03-24 13:11:28.942 1238229 INFO neutron.agent.dhcp.agent [None req-21447e37-7f29-455e-a477-7cc85fb570a5 - - - - - -] Agent has just been revived. Scheduling full sync ... 2025-03-24 13:11:31.367 1238229 ERROR neutron.agent.dhcp.agent [None req-b438d86f-a928-43f8-a8db-d256dcb8179b - - - - - -] Unable to disable dhcp for 0682ba75-b750-4318-a75e-c92c347c923b.: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
stack trace at: https://paste.openstack.org/show/baInRC9qfafXEpaTqBCW/
Then I upgraded the neutron packages on the second node a few minutes later, contracting the db at:
2025-03-24 13:19:55 neutron-db-manage upgrade --contract
I seem to be able to reproduce it easily, just need to rollback my VMs to a previous snapshot and run the upgrade procedure again. If you need more information, please let me know.
Thanks! Eugen
Zitat von Rodolfo Alonso Hernandez <ralonsoh@redhat.com>:
Hello:
The DB schema change is considered in the Neutron DB object [1]. The agents, via RPC, do not receive the raw DB object but a json blob derived from the Neutron DB object. If the target (the agent) expects a lower version, then the json blob is changed. This is why it is not necessary to inform about the DB schema changes between versions.
In order to properly debug this issue it would need a traceback of the Neutron API and the L3 agent. Also a reproducer could be useful, including the current environment conditions. What L3 agent call is causing this issue?
Regards.
[1]
https://review.opendev.org/c/openstack/neutron/+/798961/35/neutron/objects/p...
On Tue, Mar 25, 2025 at 1:24 PM Eugen Block <eblock@nde.ag> wrote:
It didn't take that long to evaluate. Unfortunately, this approach doesn't work for me. I tried only upgrading the neutron-server package, but there are dependencies for the other neutron agents, so they are upgraded as well. I could reduce the downtime of the L3-agent, though. Since this isn't a recurring issue (upgrades in general, but also db schema changes), we'll stick with our current upgrade procedure.
But I'm still voting for adding db schema changes to the release notes.
Thanks again, Eugen
Zitat von Eugen Block <eblock@nde.ag>:
Hi,
thanks for sharing! I'll have to adapt my upgrade procedure and test it properly. This could take a while, though.
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
In more detail this is the procedure we’re using and we recently upgraded two times first from Zed to Antelope, then from Antelope to Caracal.
- Install new version of Neutron and run database expand
- Upgrade neutron-server on all “controller” nodes
- Run database contract
- Upgrade OVS, L3, Metadata, DHCP agents on network nodes (on controller nodes in some peoples setups)
- First OVS and then wait for it to start correctly
- Stop DHCP, L3, Metadata (in that order)
- Upgrade agents and start in same order as above
- Upgrade OVS agent on compute nodes
Happy to take feedback if there is improvement possible on the above
From what I remember during all these years we’ve only had issues with upgrades twice, once was a keepalived bug and another was when Neutron translated to primary/backup wording for L3 HA which I think could also be that we did a double jump upgrade causing us to miss some translation patch somewhere or similar.
/Tobias
On 21 Mar 2025, at 15:44, Eugen Block <eblock@nde.ag> wrote:
Thanks for your quick response, appreciate it! I've read that page as well, but that's been a while. I guess I didn't pay too much attention since the recent upgrades all went well. Until now, I just ran 'apt upgrade' on the first node, which would upgrade all packages, of course, did an expand and the contract command was issued on the last control node.
So what would be the ideal way? First upgrade only neutron-server and l2 agents on all control node ('apt upgrade --only-upgrade <neutron-server|openvswitch-agent>'), then expand and contract, and then upgrade the rest of the packages?
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
> Hello, > > We upgrade in a very specific order as mentioned in [1], so first > database expand, then all neutron-server > applications is upgraded first, then contract, before any agents. > > [1] >
https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html
> > /Tobias > >> On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote: >> >> Hi *, >> >> maybe I missed some announcement or something, but usually, I >> read the release notes [0] before upgrading our OpenStack cloud. >> I didn't notice anything regarding DB schema upgrades. And after >> the upgrade from Yoga to Zed in a test environment went well, I >> tried the same in our production today. Note that I didn't have >> a router in my test cloud, so that's probably why I didn't >> notice anything. >> >> Unfortunately, there has been a schema change, that's why the >> l3-agent failed to start properly with this error: >> >> 2025-03-21 12:29:14.527 846393 CRITICAL neutron [None >> req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled >> error: oslo_messaging.rpc.client.RemoteError: Remote error: >> OperationalError (pymysql.err.OperationalError) (1054, "Unknown >> column 'portforwardings.external_port' in 'SELECT'") >> >> Indeed, the upgraded control node didn't have "external_port" >> anymore in >> /usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py, >> while the not yet upgraded control node did. So the situation >> could only be resolved by proceeding with the upgrade. But that >> meant an interruption for our virtual routers, causing floating >> IPs to be unreachable for a couple of minutes. >> >> Note that we're using highly-available routers. I thought about >> setting "no-ha" for each router, but that can only be done for >> disabled routers, which is not an option, of course. And it >> doesn't really fit into the "rolling upgrade" concept, which has >> worked great so far. Since we moved to Ubuntu last September >> (while still on Victoria), we've been able to upgrade to Yoga >> without any issues. >> >> And while the interruption today was not too critical, I was >> still surprised that such an important change didn't even make >> it into the Zed release notes. Was that a mistake or did I miss >> something? Are there other places I need to check before >> attempting an upgrade? >> >> Thanks, >> Eugen >> >> [0] https://docs.openstack.org/releasenotes/neutron/zed.html >>
Sure, I didn't expect a fix for Zed. Thanks for looking into it, and it's amazing that you have already figured out the possible root cause! :-) Thanks! Zitat von Rodolfo Alonso Hernandez <ralonsoh@redhat.com>:
Hello Eugen:
Ok, I think I found the problem: the DB migration includes a field drop. If that is done with the Neutron API running, it will fail if the DB resource is called from an older API code. In Neutron we stopped the contract migrations but at the same time allowed (**at some risks**) expand migrations with exceptions (L130). I'll open a bug for this, not for this error in particular (it is already in a maintenance branch) but for future DB migrations.
Regards.
[1] https://review.opendev.org/c/openstack/neutron/+/798961/35/neutron/db/migrat...
On Tue, Mar 25, 2025 at 4:04 PM Eugen Block <eblock@nde.ag> wrote:
Some infos about the environment:
- 2 control nodes, most services managed by pacemaker - recently upgraded to Ubuntu 22.04
I started the upgrade procedure yesterday (2025-03-24 12:55:55) by stopping all openstack services on the first node (controller02), expanding keystone db, neutron db, db sync for all other services, neutron at exactly:
2025-03-24 13:07:31 neutron-db-manage upgrade --expand
At 13:10:39 the services were restarted again.
The other control node (controller01) logged the first error in neutron-server.log at:
2025-03-24 13:08:10.356 5804 ERROR neutron.db.agentschedulers_db [req-b5a1c4f0-a28a-4cea-96f7-e27df915bd4c - - - - -] Unexpected exception occurred while removing network 0682ba75-b750-4318-a75e-c92c347c923b from agent 20709aa0-2c55-4a18-8f7f-2c65c1bc1297: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
The entire stack trace is a bit lengthy, I'll paste it here:
https://paste.openstack.org/show/bJAP7rgEXoFH76wC6zSj/
The l3-agent on controller02 was started and failing at:
2025-03-24 13:11:28.942 1238229 INFO neutron.agent.dhcp.agent [None req-21447e37-7f29-455e-a477-7cc85fb570a5 - - - - - -] Agent has just been revived. Scheduling full sync ... 2025-03-24 13:11:31.367 1238229 ERROR neutron.agent.dhcp.agent [None req-b438d86f-a928-43f8-a8db-d256dcb8179b - - - - - -] Unable to disable dhcp for 0682ba75-b750-4318-a75e-c92c347c923b.: oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError (pymysql.err.OperationalError) (1054, "Unknown column 'portforwardings.external_port' in 'SELECT'")
stack trace at: https://paste.openstack.org/show/baInRC9qfafXEpaTqBCW/
Then I upgraded the neutron packages on the second node a few minutes later, contracting the db at:
2025-03-24 13:19:55 neutron-db-manage upgrade --contract
I seem to be able to reproduce it easily, just need to rollback my VMs to a previous snapshot and run the upgrade procedure again. If you need more information, please let me know.
Thanks! Eugen
Zitat von Rodolfo Alonso Hernandez <ralonsoh@redhat.com>:
Hello:
The DB schema change is considered in the Neutron DB object [1]. The agents, via RPC, do not receive the raw DB object but a json blob derived from the Neutron DB object. If the target (the agent) expects a lower version, then the json blob is changed. This is why it is not necessary to inform about the DB schema changes between versions.
In order to properly debug this issue it would need a traceback of the Neutron API and the L3 agent. Also a reproducer could be useful, including the current environment conditions. What L3 agent call is causing this issue?
Regards.
[1]
https://review.opendev.org/c/openstack/neutron/+/798961/35/neutron/objects/p...
On Tue, Mar 25, 2025 at 1:24 PM Eugen Block <eblock@nde.ag> wrote:
It didn't take that long to evaluate. Unfortunately, this approach doesn't work for me. I tried only upgrading the neutron-server package, but there are dependencies for the other neutron agents, so they are upgraded as well. I could reduce the downtime of the L3-agent, though. Since this isn't a recurring issue (upgrades in general, but also db schema changes), we'll stick with our current upgrade procedure.
But I'm still voting for adding db schema changes to the release notes.
Thanks again, Eugen
Zitat von Eugen Block <eblock@nde.ag>:
Hi,
thanks for sharing! I'll have to adapt my upgrade procedure and test it properly. This could take a while, though.
Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
Hello,
In more detail this is the procedure we’re using and we recently upgraded two times first from Zed to Antelope, then from Antelope to Caracal.
- Install new version of Neutron and run database expand
- Upgrade neutron-server on all “controller” nodes
- Run database contract
- Upgrade OVS, L3, Metadata, DHCP agents on network nodes (on controller nodes in some peoples setups)
- First OVS and then wait for it to start correctly
- Stop DHCP, L3, Metadata (in that order)
- Upgrade agents and start in same order as above
- Upgrade OVS agent on compute nodes
Happy to take feedback if there is improvement possible on the above
From what I remember during all these years we’ve only had issues with upgrades twice, once was a keepalived bug and another was when Neutron translated to primary/backup wording for L3 HA which I think could also be that we did a double jump upgrade causing us to miss some translation patch somewhere or similar.
/Tobias
> On 21 Mar 2025, at 15:44, Eugen Block <eblock@nde.ag> wrote: > > Thanks for your quick response, appreciate it! > I've read that page as well, but that's been a while. I guess I > didn't pay too much attention since the recent upgrades all went > well. Until now, I just ran 'apt upgrade' on the first node, which > would upgrade all packages, of course, did an expand and the > contract command was issued on the last control node. > > So what would be the ideal way? First upgrade only neutron-server > and l2 agents on all control node ('apt upgrade --only-upgrade > <neutron-server|openvswitch-agent>'), then expand and contract, > and then upgrade the rest of the packages? > > > Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>: > >> Hello, >> >> We upgrade in a very specific order as mentioned in [1], so first >> database expand, then all neutron-server >> applications is upgraded first, then contract, before any agents. >> >> [1] >>
https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html
>> >> /Tobias >> >>> On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote: >>> >>> Hi *, >>> >>> maybe I missed some announcement or something, but usually, I >>> read the release notes [0] before upgrading our OpenStack cloud. >>> I didn't notice anything regarding DB schema upgrades. And after >>> the upgrade from Yoga to Zed in a test environment went well, I >>> tried the same in our production today. Note that I didn't have >>> a router in my test cloud, so that's probably why I didn't >>> notice anything. >>> >>> Unfortunately, there has been a schema change, that's why the >>> l3-agent failed to start properly with this error: >>> >>> 2025-03-21 12:29:14.527 846393 CRITICAL neutron [None >>> req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled >>> error: oslo_messaging.rpc.client.RemoteError: Remote error: >>> OperationalError (pymysql.err.OperationalError) (1054, "Unknown >>> column 'portforwardings.external_port' in 'SELECT'") >>> >>> Indeed, the upgraded control node didn't have "external_port" >>> anymore in >>> /usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py, >>> while the not yet upgraded control node did. So the situation >>> could only be resolved by proceeding with the upgrade. But that >>> meant an interruption for our virtual routers, causing floating >>> IPs to be unreachable for a couple of minutes. >>> >>> Note that we're using highly-available routers. I thought about >>> setting "no-ha" for each router, but that can only be done for >>> disabled routers, which is not an option, of course. And it >>> doesn't really fit into the "rolling upgrade" concept, which has >>> worked great so far. Since we moved to Ubuntu last September >>> (while still on Victoria), we've been able to upgrade to Yoga >>> without any issues. >>> >>> And while the interruption today was not too critical, I was >>> still surprised that such an important change didn't even make >>> it into the Zed release notes. Was that a mistake or did I miss >>> something? Are there other places I need to check before >>> attempting an upgrade? >>> >>> Thanks, >>> Eugen >>> >>> [0] https://docs.openstack.org/releasenotes/neutron/zed.html >>> > > >
participants (3)
-
Eugen Block
-
Rodolfo Alonso Hernandez
-
Tobias Urdin - Binero IT