[openstack-dev] [nova][heat] sqlalchemy-migrate tool to alembic
Mike Bayer
mbayer at redhat.com
Thu May 21 06:57:42 UTC 2015
On 5/14/15 11:40 AM, Mike Bayer wrote:
> The "online schema changes" patch has been abandoned. I regret that
> I was not able to review the full nature of this spec in time to note
> some concerns I have, namely that Alembic does not plan on ever
> acheiving 100% "automation" of migration generation; such a thing is
> not possible and would require a vast amount of development resources
> in any case to constantly keep up with the ever changing features and
> behaviors of all target databases. The online migration spec, AFAICT,
> does not offer any place for manual migrations to be added, and I
> think this will be a major problem. The decisions made by
> "autogenerate" I have always stated should always be manually reviewed
> and corrected, so I'd be very nervous about a system that uses
> autogenerate on the fly and sends those changes directly to a
> production database without any review.
So it looks like Nova has decided at the summit to forge ahead with
online schema migrations. Travel issues prevented me from being
present at the summit and therefore the session where this was being
discussed. But had I been there, a short 40 minute session wouldn't
have been a venue in which I could have organized my thoughts enough to
be any more effective in discussing this feature, so it's probably
better that I wasn't there. As I've mentioned before, the timing of
the blueprint on this feature was just not well synchronized for me, it
being proposed the first month I was working for Red Hat and Openstack
and hardly knew what things were, and as you can see by my comments at
https://review.openstack.org/#/c/102545/9, I was primarily alarmed at
the notion that this system was going to be built entirely on
SQLAlchemy-Migrate internals, a project which one of my primary tasks at
my new job was to get replaced with Alembic. I hardly understood what
the actual proposal was as I was still learning how to install Openstack
at that point, so I really missed being able to dig deeply into it. The
spec went quiet after a few weeks and I mostly forgot about it, until in
November when it suddenly awoke, burned ahead through Christmas and was
approved on Jan 6. Again, terrible timing for me, as my wife gave birth
to our son in late October, and I was pretty much 24/7 dealing with a
newborn, not to mention getting through the holidays. So I missed the
boat on the blueprint entirely.
For now, I have to assume that Nova will go ahead with this. But let me
at least take some effort to explain more fully what I think the problem
with this approach is. I don't think this problem will necessarily be
that big a deal for Nova, at least most of the time; but when it is a
problem, it might be pretty bad. My concern is that the system has no
way at all to provide for manual migration steps, or any control at all
as to how schema migrations proceed; and critically, that it makes no
provisions for the very common case of schema migrations that also need
data to be moved and manipulated as well. The blueprint has no mention
whatsoever regarding how the migrations of data will be handled; not
even within sections such as "Developer impact" or "Testing". Right
now, data migrations are just part of the sqlalchemy-migrate scripts or
Alembic scripts. But with the change that we no longer write such
scripts, nor do we even have a place to put them if we wanted, data
migrations are no longer integrated within this system and have to be
dealt with externally.
It may be the case that Nova has a schema that is no longer in need of
major changes, and we are only talking about adding new columns, new
tables to support new features, and removing some old cruft; but moving
data around is just not going to be needed. But once you build a
system that makes data migrations second class or even non-citizens, you
close the doors on how much you can do with your schema. Big changes
down the road are basically no longer possible without the ability to
also migrate data as the DDL is emitted.
So OK, of course we can still do data migrations. The spec doesn't need
to say anything, it should be obvious that they need to be performed
during the "migrate" phase, in between "expand" and "contract" when you
have both the new tables/columns available as a destination for data and
the old tables/columns still present as the source. As far as what
form they take, we no longer have migration scripts or versions within a
major release, so we have to assume it will be just a big series of
scripts somewhere, tagged towards the major release like "Kilo" or
"Liberty", and it's just a bunch of database code that runs at the same
time we're in "migrate".
I have no doubt that's what Nova will do, and usually it will be fine.
But to illustrate, here's the kind of place that goes wrong in such a
way that is at best pretty annoying and at worst a serious and
error-prone development and testing headache. Let's start with a
hypothetical schema that has some design issues. Two tables "widget"
and "widget_status", where "widget_status" has some kind of information
about a "widget", and it also stores a timestamp, unfortunately as a string:
CREATE TABLE widget (
id INTEGER PRIMARY KEY,
name VARCHAR(30) NOT NULL
)
CREATE TABLE widget_status (
widget_id INTEGER PRIMARY KEY REFERENCES widget(id),
status_flag INTEGER NOT NULL,
modified_date VARCHAR(30)
)
Let's say that two entirely different changes by two different
developers want to accomplish two things: 1. convert "modified_date"
into a new column "modified_timestamp" which is a DATETIME type, not a
string and 2. merge these two tables into one, as the need to JOIN all
the time is non-performant and unnecessary. That is, we'll end up with
this:
CREATE TABLE widget (
id INTEGER PRIMARY KEY,
name VARCHAR(30) NOT NULL
status_flag INTEGER NOT NULL,
modified_timestamp DATETIME
)
Right off, let's keep in mind that when online schema migrations runs,
the fact that there's a #1 and a #2 migration to the schema is lost.
Even though migration #1 will add "modified_timestamp" to
"widget_status", when we run the sum of both #1 and #2, that interim
state of the schema will never exist; no changes are made to
widget_status except for the final DROP, as these changes aren't visible
by just looking at the two endpoints of the schema, which unless I'm
totally misunderstanding, is how online schema changes work.
Developer #1 changes the model such that there's a new column on
"widget_status" called "modified_timestamp" which includes value as a
datetime; this is a new column add that replaces the modified_date
column, and because we need to do a data migration, both columns need to
exist simultaneously while the string-based dates are UPDATEd into the
timestamp column. The developer writes a data migration script that
will transfer this data while the table has both columns. If we look at
the SQL mapped to online schema migration steps, they are:
"expand": ALTER TABLE widget_status ADD COLUMN modified_timestamp DATETIME;
"migrate": UPDATE widget_status SET
modified_timestamp=convert_string_to_datetime(modified_date)
"contract": ALTER TABLE widget_status DROP COLUMN modified_date
The UPDATE statement above is coded into some script somewhere,
"liberty/01_migrate_widget_status_timestamp.py". The developer commits
all this and everything works great.
Developer #2 comes along two months later. She looks at the model and
sees no mention of any column called "modified_date"; indeed, this
column name is not in the source code of the application anywhere at
all, except in that liberty/01_...py script which she isn't looking
at. She makes her changes to the model, moving all the columns of
widget_status into widget and removing the widget_status model. She
also writes a data migration script to copy all the data. If we again
look at the SQL mapped to online schema migration steps, they are:
"expand":
ALTER TABLE widget ADD COLUMN status_flag INTEGER NULL
ALTER TABLE widget ADD COLUMN modified_timestamp DATETIME
"migrate":
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status
WHERE widget_id=widget.id)
UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROM
widget_status WHERE widget_id=widget.id)
"contract":
DROP TABLE widget_status
Let's say above the migrate step is in another script,
"liberty/02_migrate_widget_status_to_widget.py".
Some readers may see where this is going. Before where go there, let me
also note that these schema changes certainly *do* require that the
developer write a "migration". Online schema changes can save us for
very simplistic cases but even for a basic series of operations like the
above, we need a "migration". IMO it isn't a part of the "problem
description", as the blueprint states, that the need for "writing a
migration script" exists; but even if it is, the blueprint does not
provide a solution for this issue for all but the most trivial cases.
Also notice there's really no way to deal with the fact that we'd really
like "widget.status_flag" to be NOT NULL; under a traditional script
model, we'd add the column, populate it, then alter it to be NOT NULL;
online schema migrations removes any place for this to happen, unless if
we consider it to be the "contract" phase. Looking at the current code
I see nothing that attempts to deal with this and the blueprint makes no
mention of this thorny issue.
So with our two model changes, and our two data migration scripts, let's
see what online schema changes does to it. All the "expand" steps are
lumped into a single net migration, and necessarily all occur
automatically, with no ability to intervene or change how they run.
All the "contract" steps, same thing. Which means that the addition of
the column "modified_timestamp" to "widget_status" never happens;
because in the comparison of pre-expand to post-contract, the
"widget_status" table is simply gone. Which means, the fact that we've
dropped this table means now developer #2 has to become aware of
migration 01_migrate_widget_status_timestamp.py, and change that script
as well. Without changing it, this is what runs:
"expand":
ALTER TABLE widget ADD COLUMN status_flag INTEGER NULL;
ALTER TABLE widget ADD COLUMN modified_timestamp DATETIME;
"migrate":
01 -> UPDATE widget_status SET
modified_timestamp=convert_string_to_datetime(modified_date) # -->
fails in all cases, no "modified_timestamp" column is present.
02 ->
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status
WHERE widget_id=widget.id)
UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROM
widget_status WHERE widget_id=widget.id) # --> fails on a pre-migrated
database, no "modified_timestamp" column is present on widget_status
"contract":
DROP TABLE widget_status
I'm not sure by what mechanism the above failure would be discovered in
development. But let's assume that they certainly are, as is the case
now, and in CI we do a "Kilo"->"Liberty" run with data present and the
failure of these scripts is discovered.
The developer of 02_migrate_widget_status_to_widget.py basically has to
remove the 01_migrate_widget_status_timestamp.py script entirely and
merge the work that it does into her migration. That is, instead of
having two independent and isolated data migrations:
01 -> UPDATE widget_status SET
modified_timestamp=convert_string_to_datetime(modified_date)
02 ->
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status
WHERE widget_id=widget.id)
UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROM
widget_status WHERE widget_id=widget.id)
we have to munge the two migrations together, because the state of the
database that 01 was coded towards will no longer ever exist:
01 -> gone
02 -> moved to 01, and changed to:
01 ->
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status
WHERE widget_id=widget.id)
UPDATE widget SET modified_timestamp=(SELECT
convert_string_to_datetime(modified_date) FROM widget_status WHERE
widget_id=widget.id)
The developer of 02 would not likely have a clear idea that this is how
the migration has to be built, unless she carefully reads all migration
scripts preceding hers that refer to the same tables or columns, or when
she sees a full run from "Kilo" -> "Liberty" fail; it strongly indicates
that things like looking at older versions of the code and reading
through history will be part of a typical strategy in order to figure
out the correct steps. Using traditional migration steps, none of this
complexity is needed; data migrations can be coded independently of each
other against an explicit and fixed database state that will always
exist while that data migration runs.
Basically, if we start with a series of traditional schema migrations
and associated data migrations, we could illustrate that as follows:
As -> Ad -> Bs -> Bd -> Cs -> Cd -> Ds -> Dd
Where "Xs" is a schema migration and "Xd" is a data migration. Online
schema changes basically remove all the "s" in between, leaving us with:
Ae -> De -> Ad -> Bd -> Cd -> Dd -> Dc -> Ac
Where "Xe" is an expand and "Xc" is a contract. Without a traditional
versioning model, there is no more schema state for B or C; these states
no longer exist, even though the data migrations, when they were
written, were coded against these states. We've written data
migrations against a schema that ceases to ever exist once a new
migration is added, as below, our script that was written against the
state "Ds" will also no longer have a schema at that state, once we add "E":
Ae -> Ee -> Ad -> Bd -> Cd -> Dd -> Ed -> Ec -> Ac # "De / Dc" has
disappeared
Basically, online schema migrations mean that known database states are
constantly being wiped out, and data migration scripts which were
written against these states are constantly being broken. As the number
of changes increases, the number of scripts potentially broken by losing
the discrete state they were coded against and requiring manual
re-coding increases.
The process of constantly folding over data migrations, written as
discrete sequential steps but only at runtime interpreted against an
entirely different database schema than what is made available during
development, is a process that for a modest series of changes somewhat
tedious and error prone, but for a series of changes that represent more
serious schema refactorings, would quickly become unmanageable. The
net result is that as the things we want to do to our schema leave the
realm of the extremely simple and trivial, online schema migrations
quickly end up *creating* more work to do than it decreases, and this
could lead to a hesitancy to take on the job of doing more comprehensive
schema migrations, thus making technical debt that much more difficult
to clean up.
This is the very long form version of what I've been hypothesizing. If
it is the case that Nova is in a place such that more comprehensive, or
even modest, schema refactorings are assumed to be never needed, and
changes are always going to be small, simple, and involving little to no
data migration, then we're fine. But without any space to write data
migrations against known states of the schema, since those known states
keep disappearing, we give up the ability to make incremental changes to
a database where significant data migrations are needed. I think
that's a very high price to pay.
More information about the OpenStack-dev
mailing list