Open Stack

Thu May 21 16:28:47 UTC 2015

Hi Mike,

I had some initial concerns around the online db schema migration work 
as you do below. However, once I realized one big thing, those concerns 
were allayed.

Here's the big thing that you're missing: no data migrations are allowed 
any more in DB migration scripts.

Yes, that sounds exceedingly weird, I know, but hear me out in the 
comments inline... :) I've trimmed your original email for brevity.

On 05/20/2015 11:57 PM, Mike Bayer wrote:
> But to illustrate, here's the kind of place that goes wrong in such a
> way that is at best pretty annoying and at worst a serious and
> error-prone development and testing headache.    Let's start with a
> hypothetical schema that has some design issues.     Two tables "widget"
> and "widget_status", where "widget_status" has some kind of information
> about a "widget", and it also stores a timestamp, unfortunately as a
> string:
>
> CREATE TABLE widget (
>      id INTEGER PRIMARY KEY,
>      name VARCHAR(30) NOT NULL
> )
>
> CREATE TABLE widget_status (
>     widget_id INTEGER PRIMARY KEY REFERENCES widget(id),
>     status_flag INTEGER NOT NULL,
>     modified_date VARCHAR(30)
> )
>
> Let's say that two entirely different changes by two different
> developers want to accomplish two things:  1. convert "modified_date"
> into a new column "modified_timestamp" which is a DATETIME type, not a
> string and 2. merge these two tables into one, as the need to JOIN all
> the time is non-performant and unnecessary.   That is, we'll end up with
> this:
>
> CREATE TABLE widget (
>      id INTEGER PRIMARY KEY,
>      name VARCHAR(30) NOT NULL
>     status_flag INTEGER NOT NULL,
>     modified_timestamp DATETIME
> )
>
> Right off, let's keep in mind that when online schema migrations runs,
> the fact that there's a #1 and a #2 migration to the schema is lost.

And this is where you're missing the important piece :) While it *is* 
true that the expand phase would combine the *DDL* schema migrations 
into just a final table structure as shown above, it is actually *not* 
true that the knowledge is lost of how the model schemas changed.

The key to all of this is that each change to the DB schema would be in 
a patch that contained corresponding changes to the nova.objects.* 
classes that are used to represent the data. And it is *these* object 
classes that have object versioning and model changes tracked in them.

So, for instance, let's say the widget class was represented in a 
nova.objects.widget.Widget model in the following code, before the patch 
that added the #1 schema migration:

  class Widget(base.NovaPersistentObject):
      VERSION = '1.0'
      fields = {
          'id': fields.IntegerField(),
          'name': fields.StringField(),
          'modified_date': fields.StringField(),
          'status_code': fields.IntegerField()
      }

The above object would have two important methods on it; one that would 
know how to save the data in the object's fields to the backend database 
and another method (obj_make_compatible()) that would know how to return 
the object to callers that expected the object's schema to look a 
certain way at a certain version of the object.

Before the patch that landed DB schema migration #1, the Widget object's 
obj_make_compatible() method would be blank -- since there's no version 
transformations that need to occur. The Widget.save() method would take 
the values in the fields and write those to the widgets and 
widget_status table in a single transaction.

Now, along comes the patch that adds DB schema migration #1 that changes 
the data type of the underlying widget_status.modified_date from a 
VARCHAR to a DATETIME, and simultaneously changes the name of the DB 
column from modified_date to modified_timestamp. In this same patch, 
there would be a corresponding change to the nova.objects.Widget class 
that would make the class definition look something like this:

  class Widget(base.NovaPersistentObject):
      VERSION = '1.1'
      fields = {
          'id': fields.IntegerField(),
          'name': fields.StringField(),
          'modified_timestamp': fields.DatetimeField(),
          'status_code': fields.IntegerField()
      }

      @staticmethod
      def _from_db_object(context, widget, db_obj):
          # db_obj is what comes back from the DB API call
          # widget_get(). The SQL for this call will look like this:
          #
          # SELECT id, modified_date, modified_timestamp, status_code
          # FROM widgets JOIN widget_status ON id = widget_id
          # WHERE widgets.id = ?
          #
          # The returned data will look like this:
          # {'id': 1, 'name': 'my widget',
          #  'modified_date': None or string,
          #  'modified_timestamp': None or datetime,
          #  'status_code': 42
          for f in ('id', 'status_code', 'name'):
              widget[f] = db_obj[f]
          ts = db_obj['modified_timestamp']
          if ts is None:
              ts = datetime.datetime.strftime(db_obj['modified_date'])
          widget['modified_timestamp'] = ts

      def obj_make_compatible(self, primitive, target_version):
          super(Widget, self).obj_make_compatible(...)
          if target_version < (1, 1):
              # Older compute nodes don't know about the
              # modified_timestamp column, only the
              # modified_date column, so "translate" the
              # returned result.
              dstring = str(primitive['modified_timestamp'])
              primitive['modified_date'] = dstring
              del primitive['modified_timestamp']

       @remotable
       def save(self):
           # Above, the _from_db_object() method ensured that the
           # translation occurred between the still-existing-but-old
           # modified_date DB field and the new modified_timestamp
           # field. This method will call widget_save() in the DB API
           # which will hit both the widgets and widget_status table:
           #
           # BEGIN;
           # UPDATE widgets SET modified_timestamp = ?, name = ?
           # WHERE id = ?;
           # UPDATE widget_status SET status_code = ?
           # WHERE widget_id = ?;
           # COMMIT;

Over time, the modified_timestamp DB field will be populated and there 
will be no need to run a single data migrationin any SQLAlchemy-migrate 
or Alembic script...

OK, now here comes another developer and she adds DB schema migration #2 
which adds the status_code field to the widgets table. Note that the 
widget_status table will still be around! The contract phase, which 
would remove the widget_status table isn't going to be run until the DBA 
has determined that all data migrations that happen in the Nova objects 
have occurred...

The Widget class would look like this in the patch that adds the #2 DB 
schema migration:

  class Widget(base.NovaPersistentObject):
      VERSION = '1.1'
      fields = {
          'id': fields.IntegerField(),
          'name': fields.StringField(),
          'modified_timestamp': fields.DatetimeField(),
          'status_code': fields.IntegerField()
      }

      @staticmethod
      def _from_db_object(context, widget, db_obj):
          # db_obj is what comes back from the DB API call
          # widget_get(). The SQL for this call will look like this:
          #
          # SELECT id, modified_date, modified_timestamp,
          #   w.status_code, ws.status_code as ws_status_code
          # FROM widgets w JOIN widget_status ws ON id = widget_id
          # WHERE w.id = ?
          #
          # The returned data will look like this:
          # {'id': 1, 'name': 'my widget',
          #  'modified_date': '20150601',
          #  'modified_timestamp': None or datetime,
          #  'status_code': None or integer,
          #  'ws_status_code': None or integer
          for f in ('id', 'name'):
              widget[f] = db_obj[f]
          ts = db_obj['modified_timestamp']
          if ts is None:
              ts = datetime.datetime.strftime(db_obj['modified_date'])
          widget['modified_timestamp'] = ts
          sc = db_obj['status_code']
          if sc is None:
              sc = db_obj['ws_status_code']

       @remotable
       def save(self):
           # Above, the _from_db_object() method ensured that the
           # translation occurred between the still-existing-but-old
           # modified_date DB field and the new modified_timestamp
           # field as well as the new widget.status_code DB field.
           # This method will call widget_save() in the DB API,
           # which will only now hit the widgets table:
           #
           # BEGIN;
           # UPDATE widgets SET modified_timestamp = ?, name = ?
           #  status_code = ?
           # WHERE id = ?;
           # COMMIT;

In this way, data migrations occur *over time* in the nova.objects 
classes, and the contract phase can be delayed indefinitely until such a 
point that the DBA has determined all data has been properly migrated.

Hope this helps explain things a bit more, Mike!

Best,
-jay

Open Stack

[openstack-dev] [nova][heat] sqlalchemy-migrate tool to alembic

OpenStack

Community

Documentation

Branding & Legal