On 03/31/2019 10:21 PM, Mohammed Naser wrote:
Hi there,
During upgrades, I've noticed that when running online_data_migrations with "infinite-until-done" mode, it loops over all of the migrations one by one.
However, one of the online data migrations (instance_obj.populate_missing_availability_zones) makes a query that takes a really long time as it seems inefficient (which eventually results in 0, cause it already ran), which means as it loops in "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large environments.
The question ends up in specific:
SELECT count(*) AS count_1 FROM (SELECT instance_extra.created_at AS instance_extra_created_at, instance_extra.updated_at AS instance_extra_updated_at, instance_extra.deleted_at AS instance_extra_deleted_at, instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS instance_extra_id, instance_extra.instance_uuid AS instance_extra_instance_uuid FROM instance_extra WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1
Ugh. :( The online data migration shouldn't be calling the above SQL statement at all. Instead, the migration should be doing something like this: SELECT ie.instance_uuid FROM instance_extra AS ie WHERE ie.keypairs IS NULL AND ie.deletd = 0 LIMIT 100 and then while getting any rows returned from the above, perform the work of transforming the problematic data in the table for each matched instance_uuid. I'm actually not sure what the above query has to do with availability zones, but I'll look into it later on this morning. Can you report a bug about this and we'll get on it ASAP? Best, -jay
The explain for the DB query in this example:
+------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 382473 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
It's possible that it can be ever worse, as this number is from another very-long running environments.
+------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 3008741 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
I'm not the SQL expert, could we not optimize this? Alternatively, could we update the online data migrations code to "pop out" any of the migrations that return 0 for the next iteration, that way it only works on those online_data_migrations that *have* to be done, and ignore those it knows are done?
Thanks, Mohammed