Open Stack

Thu May 21 06:57:42 UTC 2015

On 5/14/15 11:40 AM, Mike Bayer wrote:
> The "online schema changes" patch has been abandoned.    I regret that 
> I was not able to review the full nature of this spec in time to note 
> some concerns I have, namely that Alembic does not plan on ever 
> acheiving 100% "automation" of migration generation; such a thing is 
> not possible and would require a vast amount of development resources 
> in any case to constantly keep up with the ever changing features and 
> behaviors of all target databases. The online migration spec, AFAICT, 
> does not offer any place for manual migrations to be added, and I 
> think this will be a major problem. The decisions made by 
> "autogenerate" I have always stated should always be manually reviewed 
> and corrected, so I'd be very nervous about a system that uses 
> autogenerate on the fly and sends those changes directly to a 
> production database without any review.

So it looks like Nova has decided at the summit to forge ahead with 
online schema migrations.    Travel issues prevented me from being 
present at the summit and therefore the session where this was being 
discussed.   But had I been there, a short 40 minute session wouldn't 
have been a venue in which I could have organized my thoughts enough to 
be any more effective in discussing this feature, so it's probably 
better that I wasn't there.   As I've mentioned before, the timing of 
the blueprint on this feature was just not well synchronized for me, it 
being proposed the first month I was working for Red Hat and Openstack 
and hardly knew what things were, and as you can see by my comments at 
https://review.openstack.org/#/c/102545/9, I was primarily alarmed at 
the notion that this system was going to be built entirely on 
SQLAlchemy-Migrate internals, a project which one of my primary tasks at 
my new job was to get replaced with Alembic.   I hardly understood what 
the actual proposal was as I was still learning how to install Openstack 
at that point, so I really missed being able to dig deeply into it.  The 
spec went quiet after a few weeks and I mostly forgot about it, until in 
November when it suddenly awoke, burned ahead through Christmas and was 
approved on Jan 6.  Again, terrible timing for me, as my wife gave birth 
to our son in late October, and I was pretty much 24/7 dealing with a 
newborn, not to mention getting through the holidays.  So I missed the 
boat on the blueprint entirely.

For now, I have to assume that Nova will go ahead with this.  But let me 
at least take some effort to explain more fully what I think the problem 
with this approach is.  I don't think this problem will necessarily be 
that big a deal for Nova, at least most of the time; but when it is a 
problem, it might be pretty bad.   My concern is that the system has no 
way at all to provide for manual migration steps, or any control at all 
as to how schema migrations proceed; and critically, that it makes no 
provisions for the very common case of schema migrations that also need 
data to be moved and manipulated as well.   The blueprint has no mention 
whatsoever regarding how the migrations of data will be handled; not 
even within sections such as "Developer impact" or "Testing".   Right 
now, data migrations are just part of the sqlalchemy-migrate scripts or 
Alembic scripts.  But with the change that we no longer write such 
scripts, nor do we even have a place to put them if we wanted, data 
migrations are no longer integrated within this system and have to be 
dealt with externally.

It may be the case that Nova has a schema that is no longer in need of 
major changes, and we are only talking about adding new columns, new 
tables to support new features, and removing some old cruft; but moving 
data around is just not going to be needed.    But once you build a 
system that makes data migrations second class or even non-citizens, you 
close the doors on how much you can do with your schema.   Big changes 
down the road are basically no longer possible without the ability to 
also migrate data as the DDL is emitted.

So OK, of course we can still do data migrations.  The spec doesn't need 
to say anything, it should be obvious that they need to be performed 
during the "migrate" phase, in between "expand" and "contract" when you 
have both the new tables/columns available as a destination for data and 
the old tables/columns still present as the source.   As far as what 
form they take, we no longer have migration scripts or versions within a 
major release, so we have to assume it will be just a big series of 
scripts somewhere, tagged towards the major release like "Kilo" or 
"Liberty", and it's just a bunch of database code that runs at the same 
time we're in "migrate".

I have no doubt that's what Nova will do, and usually it will be fine.   
But to illustrate, here's the kind of place that goes wrong in such a 
way that is at best pretty annoying and at worst a serious and 
error-prone development and testing headache.    Let's start with a 
hypothetical schema that has some design issues.     Two tables "widget" 
and "widget_status", where "widget_status" has some kind of information 
about a "widget", and it also stores a timestamp, unfortunately as a string:

CREATE TABLE widget (
     id INTEGER PRIMARY KEY,
     name VARCHAR(30) NOT NULL
)

CREATE TABLE widget_status (
    widget_id INTEGER PRIMARY KEY REFERENCES widget(id),
    status_flag INTEGER NOT NULL,
    modified_date VARCHAR(30)
)

Let's say that two entirely different changes by two different 
developers want to accomplish two things:  1. convert "modified_date" 
into a new column "modified_timestamp" which is a DATETIME type, not a 
string and 2. merge these two tables into one, as the need to JOIN all 
the time is non-performant and unnecessary.   That is, we'll end up with 
this:

CREATE TABLE widget (
     id INTEGER PRIMARY KEY,
     name VARCHAR(30) NOT NULL
    status_flag INTEGER NOT NULL,
    modified_timestamp DATETIME
)

Right off, let's keep in mind that when online schema migrations runs, 
the fact that there's a #1 and a #2 migration to the schema is lost.   
Even though migration #1 will add "modified_timestamp" to 
"widget_status", when we run the sum of both #1 and #2, that interim 
state of the schema will never exist; no changes are made to 
widget_status except for the final DROP, as these changes aren't visible 
by just looking at the two endpoints of the schema, which unless I'm 
totally misunderstanding, is how online schema changes work.

Developer #1 changes the model such that there's a new column on 
"widget_status" called "modified_timestamp" which includes value as a 
datetime; this is a new column add that replaces the modified_date 
column, and because we need to do a data migration, both columns need to 
exist simultaneously while the string-based dates are UPDATEd into the 
timestamp column.   The developer writes a data migration script that 
will transfer this data while the table has both columns.  If we look at 
the SQL mapped to online schema migration steps, they are:

"expand":  ALTER TABLE widget_status ADD COLUMN modified_timestamp DATETIME;
"migrate":  UPDATE widget_status SET 
modified_timestamp=convert_string_to_datetime(modified_date)
"contract": ALTER TABLE widget_status DROP COLUMN modified_date

The UPDATE statement above is coded into some script somewhere, 
"liberty/01_migrate_widget_status_timestamp.py".  The developer commits 
all this and everything works great.

Developer #2 comes along two months later.  She looks at the model and 
sees no mention of any column called "modified_date"; indeed, this 
column name is not in the source code of the application anywhere at 
all, except in that liberty/01_...py script which she isn't looking 
at.    She makes her changes to the model, moving all the columns of 
widget_status into widget and removing the widget_status model.  She 
also writes a data migration script to copy all the data.   If we again 
look at the SQL mapped to online schema migration steps, they are:

"expand":

ALTER TABLE widget ADD COLUMN status_flag INTEGER NULL
ALTER TABLE widget ADD COLUMN modified_timestamp DATETIME

"migrate":

UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status 
WHERE widget_id=widget.id)
UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROM 
widget_status WHERE widget_id=widget.id)

"contract":

DROP TABLE widget_status

Let's say above the migrate step is in another script, 
"liberty/02_migrate_widget_status_to_widget.py".

Some readers may see where this is going.  Before where go there, let me 
also note that these schema changes certainly *do* require that the 
developer write a "migration".   Online schema changes can save us for 
very simplistic cases but even for a basic series of operations like the 
above, we need a "migration".   IMO it isn't a part of the "problem 
description", as the blueprint states, that the need for "writing a 
migration script" exists; but even if it is, the blueprint does not 
provide a solution for this issue for all but the most trivial cases.  
Also notice there's really no way to deal with the fact that we'd really 
like "widget.status_flag" to be NOT NULL; under a traditional script 
model, we'd add the column, populate it, then alter it to be NOT NULL; 
online schema migrations removes any place for this to happen, unless if 
we consider it to be the "contract" phase.  Looking at the current code 
I see nothing that attempts to deal with this and the blueprint makes no 
mention of this thorny issue.

So with our two model changes, and our two data migration scripts, let's 
see what online schema changes does to it.   All the "expand" steps are 
lumped into a single net migration, and necessarily all occur 
automatically, with no ability to intervene or change how they run.   
All the "contract" steps, same thing.   Which means that the addition of 
the column "modified_timestamp" to "widget_status" never happens; 
because in the comparison of pre-expand to post-contract, the 
"widget_status" table is simply gone.   Which means, the fact that we've 
dropped this table means now developer #2 has to become aware of 
migration 01_migrate_widget_status_timestamp.py, and change that script 
as well.  Without changing it, this is what runs:

"expand":
ALTER TABLE widget ADD COLUMN status_flag INTEGER NULL;
ALTER TABLE widget ADD COLUMN modified_timestamp DATETIME;

"migrate":
01 -> UPDATE widget_status SET 
modified_timestamp=convert_string_to_datetime(modified_date)   # --> 
fails in all cases, no "modified_timestamp" column is present.

02 ->
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status 
WHERE widget_id=widget.id)
UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROM 
widget_status WHERE widget_id=widget.id)  # --> fails on a pre-migrated 
database, no "modified_timestamp" column is present on widget_status

"contract":
DROP TABLE widget_status

I'm not sure by what mechanism the above failure would be discovered in 
development.   But let's assume that they certainly are, as is the case 
now, and in CI we do a "Kilo"->"Liberty" run with data present and the 
failure of these scripts is discovered.

The developer of 02_migrate_widget_status_to_widget.py basically has to 
remove the 01_migrate_widget_status_timestamp.py script entirely and 
merge the work that it does into her migration.  That is, instead of 
having two independent and isolated data migrations:

01 -> UPDATE widget_status SET 
modified_timestamp=convert_string_to_datetime(modified_date)

02 ->
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status 
WHERE widget_id=widget.id)
UPDATE widget SET modified_timestamp=(SELECT modified_timestamp FROM 
widget_status WHERE widget_id=widget.id)

we have to munge the two migrations together, because the state of the 
database that 01 was coded towards will no longer ever exist:

01 -> gone
02 -> moved to 01, and changed to:
01 ->
UPDATE widget SET status_flag=(SELECT status_flag FROM widget_status 
WHERE widget_id=widget.id)
UPDATE widget SET modified_timestamp=(SELECT 
convert_string_to_datetime(modified_date) FROM widget_status WHERE 
widget_id=widget.id)

The developer of 02 would not likely have a clear idea that this is how 
the migration has to be built, unless she carefully reads all migration 
scripts preceding hers that refer to the same tables or columns, or when 
she sees a full run from "Kilo" -> "Liberty" fail; it strongly indicates 
that things like looking at older versions of the code and reading 
through history will be part of a typical strategy in order to figure 
out the correct steps.   Using traditional migration steps, none of this 
complexity is needed; data migrations can be coded independently of each 
other against an explicit and fixed database state that will always 
exist while that data migration runs.

Basically, if we start with a series of traditional schema migrations 
and associated data migrations, we could illustrate that as follows:

As -> Ad -> Bs -> Bd  -> Cs -> Cd -> Ds -> Dd

Where "Xs" is a schema migration and "Xd" is a data migration. Online 
schema changes basically remove all the "s" in between, leaving us with:

Ae -> De -> Ad -> Bd -> Cd -> Dd -> Dc -> Ac

Where "Xe" is an expand and "Xc" is a contract.   Without a traditional 
versioning model, there is no more schema state for B or C; these states 
no longer exist, even though the data migrations, when they were 
written, were coded against these states.    We've written data 
migrations against a schema that ceases to ever exist once a new 
migration is added, as below, our script that was written against the 
state "Ds" will also no longer have a schema at that state, once we add "E":

Ae -> Ee -> Ad -> Bd -> Cd -> Dd -> Ed -> Ec -> Ac      # "De / Dc" has 
disappeared

Basically, online schema migrations mean that known database states are 
constantly being wiped out, and data migration scripts which were 
written against these states are constantly being broken.  As the number 
of changes increases, the number of scripts potentially broken by losing 
the discrete state they were coded against and requiring manual 
re-coding increases.

The process of constantly folding over data migrations, written as 
discrete sequential steps but only at runtime interpreted against an 
entirely different database schema than what is made available during 
development, is a process that for a modest series of changes somewhat 
tedious and error prone, but for a series of changes that represent more 
serious schema refactorings, would quickly become unmanageable.    The 
net result is that as the things we want to do to our schema leave the 
realm of the extremely simple and trivial, online schema migrations 
quickly end up *creating* more work to do than it decreases, and this 
could lead to a hesitancy to take on the job of doing more comprehensive 
schema migrations, thus making technical debt that much more difficult 
to clean up.

This is the very long form version of what I've been hypothesizing. If 
it is the case that Nova is in a place such that more comprehensive, or 
even modest, schema refactorings are assumed to be never needed, and 
changes are always going to be small, simple, and involving little to no 
data migration, then we're fine.     But without any space to write data 
migrations against known states of the schema, since those known states 
keep disappearing, we give up the ability to make incremental changes to 
a database where significant data migrations are needed.   I think 
that's a very high price to pay.

Open Stack

[openstack-dev] [nova][heat] sqlalchemy-migrate tool to alembic

OpenStack

Community

Documentation

Branding & Legal