Hello Eugen:

Ok, I think I found the problem: the DB migration includes a field drop. If that is done with the Neutron API running, it will fail if the DB resource is called from an older API code. In Neutron we stopped the contract migrations but at the same time allowed (**at some risks**) expand migrations with exceptions (L130). I'll open a bug for this, not for this error in particular (it is already in a maintenance branch) but for future DB migrations.

Regards.

[1]https://review.opendev.org/c/openstack/neutron/+/798961/35/neutron/db/migration/alembic_migrations/versions/zed/expand/I43e0b669096_port_forwarding_port_ranges.py

On Tue, Mar 25, 2025 at 4:04 PM Eugen Block <eblock@nde.ag> wrote:
Some infos about the environment:

- 2 control nodes, most services managed by pacemaker
- recently upgraded to Ubuntu 22.04

I started the upgrade procedure yesterday (2025-03-24 12:55:55) by 
stopping all openstack services on the first node (controller02), 
expanding keystone db, neutron db, db sync for all other services, 
neutron at exactly:

2025-03-24 13:07:31 neutron-db-manage upgrade --expand

At 13:10:39 the services were restarted again.

The other control node (controller01) logged the first error in 
neutron-server.log at:

2025-03-24 13:08:10.356 5804 ERROR neutron.db.agentschedulers_db 
[req-b5a1c4f0-a28a-4cea-96f7-e27df915bd4c - - - - -] Unexpected 
exception occurred while removing network 
0682ba75-b750-4318-a75e-c92c347c923b from agent 
20709aa0-2c55-4a18-8f7f-2c65c1bc1297: sqlalchemy.exc.OperationalError: 
(pymysql.err.OperationalError) (1054, "Unknown column 
'portforwardings.external_port' in 'SELECT'")

The entire stack trace is a bit lengthy, I'll paste it here:

https://paste.openstack.org/show/bJAP7rgEXoFH76wC6zSj/

The l3-agent on controller02 was started and failing at:

2025-03-24 13:11:28.942 1238229 INFO neutron.agent.dhcp.agent [None 
req-21447e37-7f29-455e-a477-7cc85fb570a5 - - - - - -] Agent has just 
been revived. Scheduling full sync
...
2025-03-24 13:11:31.367 1238229 ERROR neutron.agent.dhcp.agent [None 
req-b438d86f-a928-43f8-a8db-d256dcb8179b - - - - - -] Unable to 
disable dhcp for 0682ba75-b750-4318-a75e-c92c347c923b.: 
oslo_messaging.rpc.client.RemoteError: Remote error: OperationalError 
(pymysql.err.OperationalError) (1054, "Unknown column 
'portforwardings.external_port' in 'SELECT'")

stack trace at: https://paste.openstack.org/show/baInRC9qfafXEpaTqBCW/

Then I upgraded the neutron packages on the second node a few minutes 
later, contracting the db at:

2025-03-24 13:19:55 neutron-db-manage upgrade --contract

I seem to be able to reproduce it easily, just need to rollback my VMs 
to a previous snapshot and run the upgrade procedure again. If you 
need more information, please let me know.

Thanks!
Eugen

Zitat von Rodolfo Alonso Hernandez <ralonsoh@redhat.com>:

> Hello:
>
> The DB schema change is considered in the Neutron DB object [1]. The
> agents, via RPC, do not receive the raw DB object but a json blob derived
> from the Neutron DB object. If the target (the agent) expects a lower
> version, then the json blob is changed. This is why it is not necessary to
> inform about the DB schema changes between versions.
>
> In order to properly debug this issue it would need a traceback of the
> Neutron API and the L3 agent. Also a reproducer could be useful, including
> the current environment conditions. What L3 agent call is causing this
> issue?
>
> Regards.
>
> [1]
> https://review.opendev.org/c/openstack/neutron/+/798961/35/neutron/objects/port_forwarding.py#144
>
> On Tue, Mar 25, 2025 at 1:24 PM Eugen Block <eblock@nde.ag> wrote:
>
>> It didn't take that long to evaluate. Unfortunately, this approach
>> doesn't work for me. I tried only upgrading the neutron-server
>> package, but there are dependencies for the other neutron agents, so
>> they are upgraded as well. I could reduce the downtime of the
>> L3-agent, though.
>> Since this isn't a recurring issue (upgrades in general, but also db
>> schema changes), we'll stick with our current upgrade procedure.
>>
>> But I'm still voting for adding db schema changes to the release notes.
>>
>> Thanks again,
>> Eugen
>>
>> Zitat von Eugen Block <eblock@nde.ag>:
>>
>> > Hi,
>> >
>> > thanks for sharing!
>> > I'll have to adapt my upgrade procedure and test it properly. This
>> > could take a while, though.
>> >
>> > Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
>> >
>> >> Hello,
>> >>
>> >> In more detail this is the procedure we’re using and we recently
>> >> upgraded two times first from
>> >> Zed to Antelope, then from Antelope to Caracal.
>> >>
>> >> - Install new version of Neutron and run database expand
>> >>
>> >> - Upgrade neutron-server on all “controller” nodes
>> >>
>> >> - Run database contract
>> >>
>> >> - Upgrade OVS, L3, Metadata, DHCP agents on network nodes (on
>> >> controller nodes in some peoples setups)
>> >>
>> >>  - First OVS and then wait for it to start correctly
>> >>
>> >>  - Stop DHCP, L3, Metadata (in that order)
>> >>
>> >>  - Upgrade agents and start in same order as above
>> >>
>> >> - Upgrade OVS agent on compute nodes
>> >>
>> >> Happy to take feedback if there is improvement possible on the above
>> >>
>> >> From what I remember during all these years we’ve only had issues
>> >> with upgrades twice, once
>> >> was a keepalived bug and another was when Neutron translated to
>> >> primary/backup wording
>> >> for L3 HA which I think could also be that we did a double jump
>> >> upgrade causing us to miss
>> >> some translation patch somewhere or similar.
>> >>
>> >> /Tobias
>> >>
>> >>> On 21 Mar 2025, at 15:44, Eugen Block <eblock@nde.ag> wrote:
>> >>>
>> >>> Thanks for your quick response, appreciate it!
>> >>> I've read that page as well, but that's been a while. I guess I
>> >>> didn't pay too much attention since the recent upgrades all went
>> >>> well. Until now, I just ran 'apt upgrade' on the first node, which
>> >>> would upgrade all packages, of course, did an expand and the
>> >>> contract command was issued on the last control node.
>> >>>
>> >>> So what would be the ideal way? First upgrade only neutron-server
>> >>> and l2 agents on all control node ('apt upgrade --only-upgrade
>> >>> <neutron-server|openvswitch-agent>'), then expand and contract,
>> >>> and then upgrade the rest of the packages?
>> >>>
>> >>>
>> >>> Zitat von Tobias Urdin - Binero IT <tobias.urdin@binero.com>:
>> >>>
>> >>>> Hello,
>> >>>>
>> >>>> We upgrade in a very specific order as mentioned in [1], so first
>> >>>> database expand, then all neutron-server
>> >>>> applications is upgraded first, then contract, before any agents.
>> >>>>
>> >>>> [1]
>> >>>>
>> https://docs.openstack.org/neutron/latest/contributor/internals/upgrade.html
>> >>>>
>> >>>> /Tobias
>> >>>>
>> >>>>> On 21 Mar 2025, at 15:12, Eugen Block <eblock@nde.ag> wrote:
>> >>>>>
>> >>>>> Hi *,
>> >>>>>
>> >>>>> maybe I missed some announcement or something, but usually, I
>> >>>>> read the release notes [0] before upgrading our OpenStack cloud.
>> >>>>> I didn't notice anything regarding DB schema upgrades. And after
>> >>>>> the upgrade from Yoga to Zed in a test environment went well, I
>> >>>>> tried the same in our production today. Note that I didn't have
>> >>>>> a router in my test cloud, so that's probably why I didn't
>> >>>>> notice anything.
>> >>>>>
>> >>>>> Unfortunately, there has been a schema change, that's why the
>> >>>>> l3-agent failed to start properly with this error:
>> >>>>>
>> >>>>> 2025-03-21 12:29:14.527 846393 CRITICAL neutron [None
>> >>>>> req-e225ff0a-82e1-473b-9eba-9a11caa7ace7 - - - - - -] Unhandled
>> >>>>> error: oslo_messaging.rpc.client.RemoteError: Remote error:
>> >>>>> OperationalError (pymysql.err.OperationalError) (1054, "Unknown
>> >>>>> column 'portforwardings.external_port' in 'SELECT'")
>> >>>>>
>> >>>>> Indeed, the upgraded control node didn't have "external_port"
>> >>>>> anymore in
>> >>>>>
>> /usr/lib/python3/dist-packages/neutron/db/models/port_forwarding.py,
>> >>>>> while the not yet upgraded control node did. So the situation
>> >>>>> could only be resolved by proceeding with the upgrade. But that
>> >>>>> meant an interruption for our virtual routers, causing floating
>> >>>>> IPs to be unreachable for a couple of minutes.
>> >>>>>
>> >>>>> Note that we're using highly-available routers. I thought about
>> >>>>> setting "no-ha" for each router, but that can only be done for
>> >>>>> disabled routers, which is not an option, of course. And it
>> >>>>> doesn't really fit into the "rolling upgrade" concept, which has
>> >>>>> worked great so far. Since we moved to Ubuntu last September
>> >>>>> (while still on Victoria), we've been able to upgrade to Yoga
>> >>>>> without any issues.
>> >>>>>
>> >>>>> And while the interruption today was not too critical, I was
>> >>>>> still surprised that such an important change didn't even make
>> >>>>> it into the Zed release notes. Was that a mistake or did I miss
>> >>>>> something? Are there other places I need to check before
>> >>>>> attempting an upgrade?
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Eugen
>> >>>>>
>> >>>>> [0] https://docs.openstack.org/releasenotes/neutron/zed.html
>> >>>>>
>> >>>
>> >>>
>> >>>
>>
>>
>>
>>