[openstack-dev] Online Migrations.

Mike Bayer mbayer at redhat.com
Mon Jun 15 22:37:30 UTC 2015



On 6/15/15 4:21 PM, Andrew Laski wrote:
> On 06/15/15 at 03:23pm, Mike Bayer wrote:
>>
>> 1. at runtime?  e.g. your nova service is running, it's doing "SELECT 
>> x, y FROM thing", then some magic thing happens somewhere and the app 
>> suddenly sees, hey "y" is gone!  change all queries to "SELECT x FROM 
>> thing".     What would this magic thing be?   Are you going to run a 
>> reflection of the table schema on every query (you definitely 
>> aren't).   So I don't know that this is possible.
>
> Would it be dangerous to signal that 'y' is gone by having a query 
> fail and at that point the model could be updated?  In other words, is 
> there a chance of a query failing in such a way as to leave data in an 
> inconsistent or undesirable state?

Nova currently breaks up its database activities into many small 
database transactions, because it calls upon get_session() brand new 
within most of its methods.   So already it has a problem that the 
failure of a database transaction is not necessarily atomic against 
other things that have happened in a particular API request. We're 
looking to improve this with enginefacade however I don't know that some 
Nova operations don't currently rely on this transactional structure in 
order to succeed.

As far as the effects of a transaction that fails because a column was 
removed as the transaction proceeded, on the MySQL side I'd not be 
surprised if some bad things can happen there as its DDL operations are 
not transactional, but I don't have knowledge on something specific.  As 
far as, the column was removed some number of seconds ago, and a brand 
new transaction targets that column unaware that it was removed earlier, 
that query / transaction just fails in the traditional way, opening us 
up only to similar issues as any other failure along a transaction does 
right now.

But an approach that builds on this way is at the very least far outside 
the mainstream of how relational databases are normally used.    It 
means that Nova is being built such that service failures on a wide 
scale  are now part of its design; any time a table or column is 
removed, all running nodes will experience failures guaranteed because 
we are relying on a purely optimistic approach.   All nodes and even 
individual threads/greenlets unless we build in a highly synchronized 
system will all be rushing out to the database to perform live schema 
inspection in order to literally fix its own bugs on the fly, because we 
don't have any specific kind of messaging (either versioning, or 
messages that indicate a list of columns that have been dropped) 
referring to what changes have been made.    It also means that this 
step has to take place on application startup in any case because the 
schema state is unknown except from live inspection of the DB.

If I had to visualize what an approach looks like that does this 
somewhat cleanly, other than just putting off contract until the API has 
naturally moved beyond it, it would involve a fixed and structured 
source of truth about the specific changes we care about, such as a 
versioning table or other data table indicating specific "remove()" 
directives we're checking for, and the application would be organized 
such that it can always get to this information from an in-memory-cached 
source before it makes decisions about queries. The information would 
need to support being pushed in from the outside such as via a message 
queue.    This would still not protect against operations currently in 
progress failing but at least would prevent future operations from 
failing a first time.

We also need to decide on "change the model" vs. "change the 
queries".     I keep thinking it's going to have to be "change the 
queries".  ORM and schema models aren't designed to be mutable in a 
subtractive sense at runtime (e.g. there is no "remove column"; removes 
are much more difficult to book-keep around than additions), and even if 
they were, the whole scheme would not be safe for concurrency; that is, 
if 10 greenlets / threads all decided to change the model at the same 
time, only the first greenlet/thread would win, and the operation would 
definitely fail if multiple threads tried to do it at once.    Also, the 
Nova Cells model, if I understand correctly,  means that the same set of 
model classes can be used to talk to multiple versions of the database 
at once; so even if we did go through all the trouble to change the 
models on the fly, that would then break in a Cells environment assuming 
not every database had the same contract steps run.




>
>>
>> 2. at application start time?   e.g. nova service starts up, 
>> something happens before "MyThing" is first declared where MyThing 
>> knows that "y" is no longer there for this run (or something that 
>> will impact all the queries and persistence operations, less desirable).
>>
>> #2 is much more possible.  But still, how does it run?   How do we 
>> know that "y" is there on one run, and is not there on another?   do we:
>>
>> 2a.  When the app starts up, we run reflection queries against the DB 
>> (e.g. what autogenerate  / OSM does, looking in schema catalogs).    
>> This is doable, but can get expensive on startup if we really have 
>> lots of columns/tables to worry about; it also means that either the 
>> changes to the queries here happen totally at query time (intricate, 
>> difficult-ish), as for the change to happen at model definition time 
>> (simple, easy) means the app needs to be connected to the database 
>> before it imports the models, and this is the complete opposite of 
>> how Nova's api.py is constructed right now.   Plus the feature needs 
>> to accommodate for Cells, where there's a totally different database 
>> happening (maybe this has to be query time for that reason alone).
>>
>> 2b. In a config file somewhere?   Some kind of directive that says, 
>> "hey we have now dropped "thing.y".  What would that look like?
>>
>> 2c. Based on some kind of version number in the database?   Not too 
>> much different from #2a.
>>
>>
>>
>>
>>
>>
>>>
>>> That said, I still think we should get the original thing merged. Even
>>> if we did contractions purely with the manual migrations for the
>>> foreseeable future, that'd be something we could deal with.
>>>
>>> --Dan
>>>
>>> __________________________________________________________________________ 
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: 
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> __________________________________________________________________________ 
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: 
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________ 
>
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: 
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list