[nova] super long online_data_migrations

melanie witt melwittt at gmail.com
Wed Apr 3 16:51:47 UTC 2019


On Wed, 3 Apr 2019 11:11:58 -0500, Matt Riedemann <mriedemos at gmail.com> 
wrote:
> Following up on bug https://bugs.launchpad.net/nova/+bug/1822613 there
> are a few things:
> 
> 1. https://review.openstack.org/#/c/649648/ drops the problematic
> migrate_keypairs_to_api_db data migration since we shouldn't need it
> anymore.

Thanks for doing that, it's now approved.

> 2. As noted in the bug and this thread, some online data migrations
> track a marker for paging and some don't, but there is no generic marker
> to say an online data migration is 'done' and doesn't need to be run
> again (like the migrations table for the sqlalchemy-migrate based schema
> migrations). Persisting something like that would require more thought
> for the online data migrations.
> 
> 3. The other complaint in the bug is that --max-rows defaulting to 50 is
> too small in a large cloud (similar to bug 1742649 for --max-count on
> the map_instances command). I think a compromise here could be something
> like what Dan added for the archive_deleted_rows command:
> 
> https://review.openstack.org/#/c/378718/
> 
> Where you could specify a batch size using --max-rows but also specify
> an --until-complete option to run through all of the online data
> migrations rather than just stopping at --max-size records migrated. If
> people like that idea, then I think we just want a specless blueprint
> for it like what was used for the change above for archive_deleted_rows.

I think it's a good idea -- I think we need it for larger deployments. 
Specless blueprint sounds good to me.

-melanie

> On 3/31/2019 9:21 PM, Mohammed Naser wrote:
>> Hi there,
>>
>> During upgrades, I've noticed that when running online_data_migrations
>> with "infinite-until-done" mode, it loops over all of the migrations
>> one by one.
>>
>> However, one of the online data migrations
>> (instance_obj.populate_missing_availability_zones) makes a query that
>> takes a really long time as it seems inefficient (which eventually
>> results in 0, cause it already ran), which means as it loops in
>> "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large
>> environments.
>>
>> The question ends up in specific:
>>
>> SELECT count(*) AS count_1
>> FROM (SELECT instance_extra.created_at AS instance_extra_created_at,
>> instance_extra.updated_at AS instance_extra_updated_at,
>> instance_extra.deleted_at AS instance_extra_deleted_at,
>> instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS
>> instance_extra_id, instance_extra.instance_uuid AS
>> instance_extra_instance_uuid
>> FROM instance_extra
>> WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1
>>
>> The explain for the DB query in this example:
>>
>> +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
>> | id   | select_type | table          | type | possible_keys | key  |
>> key_len | ref  | rows   | Extra       |
>> +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
>> |    1 | SIMPLE      | instance_extra | ALL  | NULL          | NULL |
>> NULL    | NULL | 382473 | Using where |
>> +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
>>
>> It's possible that it can be ever worse, as this number is from
>> another very-long running environments.
>>
>> +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
>> | id   | select_type | table          | type | possible_keys | key  |
>> key_len | ref  | rows    | Extra       |
>> +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
>> |    1 | SIMPLE      | instance_extra | ALL  | NULL          | NULL |
>> NULL    | NULL | 3008741 | Using where |
>> +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
>>
>> I'm not the SQL expert, could we not optimize this?  Alternatively,
>> could we update the online data migrations code to "pop out" any of
>> the migrations that return 0 for the next iteration, that way it only
>> works on those online_data_migrations that *have* to be done, and
>> ignore those it knows are done?
>>
>> Thanks,
>> Mohammed
>>
> 
> 







More information about the openstack-discuss mailing list