[openstack-dev] Online Migrations.
Mike Bayer
mbayer at redhat.com
Mon Jun 15 22:37:30 UTC 2015
On 6/15/15 4:21 PM, Andrew Laski wrote:
> On 06/15/15 at 03:23pm, Mike Bayer wrote:
>>
>> 1. at runtime? e.g. your nova service is running, it's doing "SELECT
>> x, y FROM thing", then some magic thing happens somewhere and the app
>> suddenly sees, hey "y" is gone! change all queries to "SELECT x FROM
>> thing". What would this magic thing be? Are you going to run a
>> reflection of the table schema on every query (you definitely
>> aren't). So I don't know that this is possible.
>
> Would it be dangerous to signal that 'y' is gone by having a query
> fail and at that point the model could be updated? In other words, is
> there a chance of a query failing in such a way as to leave data in an
> inconsistent or undesirable state?
Nova currently breaks up its database activities into many small
database transactions, because it calls upon get_session() brand new
within most of its methods. So already it has a problem that the
failure of a database transaction is not necessarily atomic against
other things that have happened in a particular API request. We're
looking to improve this with enginefacade however I don't know that some
Nova operations don't currently rely on this transactional structure in
order to succeed.
As far as the effects of a transaction that fails because a column was
removed as the transaction proceeded, on the MySQL side I'd not be
surprised if some bad things can happen there as its DDL operations are
not transactional, but I don't have knowledge on something specific. As
far as, the column was removed some number of seconds ago, and a brand
new transaction targets that column unaware that it was removed earlier,
that query / transaction just fails in the traditional way, opening us
up only to similar issues as any other failure along a transaction does
right now.
But an approach that builds on this way is at the very least far outside
the mainstream of how relational databases are normally used. It
means that Nova is being built such that service failures on a wide
scale are now part of its design; any time a table or column is
removed, all running nodes will experience failures guaranteed because
we are relying on a purely optimistic approach. All nodes and even
individual threads/greenlets unless we build in a highly synchronized
system will all be rushing out to the database to perform live schema
inspection in order to literally fix its own bugs on the fly, because we
don't have any specific kind of messaging (either versioning, or
messages that indicate a list of columns that have been dropped)
referring to what changes have been made. It also means that this
step has to take place on application startup in any case because the
schema state is unknown except from live inspection of the DB.
If I had to visualize what an approach looks like that does this
somewhat cleanly, other than just putting off contract until the API has
naturally moved beyond it, it would involve a fixed and structured
source of truth about the specific changes we care about, such as a
versioning table or other data table indicating specific "remove()"
directives we're checking for, and the application would be organized
such that it can always get to this information from an in-memory-cached
source before it makes decisions about queries. The information would
need to support being pushed in from the outside such as via a message
queue. This would still not protect against operations currently in
progress failing but at least would prevent future operations from
failing a first time.
We also need to decide on "change the model" vs. "change the
queries". I keep thinking it's going to have to be "change the
queries". ORM and schema models aren't designed to be mutable in a
subtractive sense at runtime (e.g. there is no "remove column"; removes
are much more difficult to book-keep around than additions), and even if
they were, the whole scheme would not be safe for concurrency; that is,
if 10 greenlets / threads all decided to change the model at the same
time, only the first greenlet/thread would win, and the operation would
definitely fail if multiple threads tried to do it at once. Also, the
Nova Cells model, if I understand correctly, means that the same set of
model classes can be used to talk to multiple versions of the database
at once; so even if we did go through all the trouble to change the
models on the fly, that would then break in a Cells environment assuming
not every database had the same contract steps run.
>
>>
>> 2. at application start time? e.g. nova service starts up,
>> something happens before "MyThing" is first declared where MyThing
>> knows that "y" is no longer there for this run (or something that
>> will impact all the queries and persistence operations, less desirable).
>>
>> #2 is much more possible. But still, how does it run? How do we
>> know that "y" is there on one run, and is not there on another? do we:
>>
>> 2a. When the app starts up, we run reflection queries against the DB
>> (e.g. what autogenerate / OSM does, looking in schema catalogs).
>> This is doable, but can get expensive on startup if we really have
>> lots of columns/tables to worry about; it also means that either the
>> changes to the queries here happen totally at query time (intricate,
>> difficult-ish), as for the change to happen at model definition time
>> (simple, easy) means the app needs to be connected to the database
>> before it imports the models, and this is the complete opposite of
>> how Nova's api.py is constructed right now. Plus the feature needs
>> to accommodate for Cells, where there's a totally different database
>> happening (maybe this has to be query time for that reason alone).
>>
>> 2b. In a config file somewhere? Some kind of directive that says,
>> "hey we have now dropped "thing.y". What would that look like?
>>
>> 2c. Based on some kind of version number in the database? Not too
>> much different from #2a.
>>
>>
>>
>>
>>
>>
>>>
>>> That said, I still think we should get the original thing merged. Even
>>> if we did contractions purely with the manual migrations for the
>>> foreseeable future, that'd be something we could deal with.
>>>
>>> --Dan
>>>
>>> __________________________________________________________________________
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
>
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list