[nova] super long online_data_migrations
Hi there, During upgrades, I've noticed that when running online_data_migrations with "infinite-until-done" mode, it loops over all of the migrations one by one. However, one of the online data migrations (instance_obj.populate_missing_availability_zones) makes a query that takes a really long time as it seems inefficient (which eventually results in 0, cause it already ran), which means as it loops in "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large environments. The question ends up in specific: SELECT count(*) AS count_1 FROM (SELECT instance_extra.created_at AS instance_extra_created_at, instance_extra.updated_at AS instance_extra_updated_at, instance_extra.deleted_at AS instance_extra_deleted_at, instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS instance_extra_id, instance_extra.instance_uuid AS instance_extra_instance_uuid FROM instance_extra WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1 The explain for the DB query in this example: +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 382473 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ It's possible that it can be ever worse, as this number is from another very-long running environments. +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 3008741 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ I'm not the SQL expert, could we not optimize this? Alternatively, could we update the online data migrations code to "pop out" any of the migrations that return 0 for the next iteration, that way it only works on those online_data_migrations that *have* to be done, and ignore those it knows are done? Thanks, Mohammed -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
On Sun, Mar 31, 2019 at 10:21 PM Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi there,
During upgrades, I've noticed that when running online_data_migrations with "infinite-until-done" mode, it loops over all of the migrations one by one.
However, one of the online data migrations (instance_obj.populate_missing_availability_zones) makes a query that takes a really long time as it seems inefficient (which eventually results in 0, cause it already ran), which means as it loops in "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large environments.
The question ends up in specific:
SELECT count(*) AS count_1 FROM (SELECT instance_extra.created_at AS instance_extra_created_at, instance_extra.updated_at AS instance_extra_updated_at, instance_extra.deleted_at AS instance_extra_deleted_at, instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS instance_extra_id, instance_extra.instance_uuid AS instance_extra_instance_uuid FROM instance_extra WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1
The explain for the DB query in this example:
+------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 382473 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
It's possible that it can be ever worse, as this number is from another very-long running environments.
+------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 3008741 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
I'm not the SQL expert, could we not optimize this? Alternatively, could we update the online data migrations code to "pop out" any of the migrations that return 0 for the next iteration, that way it only works on those online_data_migrations that *have* to be done, and ignore those it knows are done?
and while we're at it, can we just bump the default rows-per-run to something more than 50 rows? it seems super .. small :)
Thanks, Mohammed
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
Hi Mohammed, On Mon, Apr 1, 2019 at 4:29 AM Mohammed Naser <mnaser@vexxhost.com> wrote:
On Sun, Mar 31, 2019 at 10:21 PM Mohammed Naser <mnaser@vexxhost.com> wrote:
Hi there,
During upgrades, I've noticed that when running online_data_migrations with "infinite-until-done" mode, it loops over all of the migrations one by one.
However, one of the online data migrations (instance_obj.populate_missing_availability_zones) makes a query that takes a really long time as it seems inefficient (which eventually results in 0, cause it already ran), which means as it loops in "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large environments.
Hmm, all we do in that migration is try to get instance records whose availability_zone is None [1] and if no records are found we just return all done. While I agree that once a migration is done, the next time we loop through all the migrations we again do the query at least once to ensure we get back zero records for most of the migrations (we don't always use persistent markers to see if the migration was completed in the previous run) which means we do run through the whole table.
The question ends up in specific:
SELECT count(*) AS count_1 FROM (SELECT instance_extra.created_at AS instance_extra_created_at, instance_extra.updated_at AS instance_extra_updated_at, instance_extra.deleted_at AS instance_extra_deleted_at, instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS instance_extra_id, instance_extra.instance_uuid AS instance_extra_instance_uuid FROM instance_extra WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS
anon_1
This is the keypair_obj.migrate_keypairs_to_api_db migration that was added in Newton. Since we are just counting, we need not pull the whole record I guess (not sure how much improvement that would cause), I am myself not an SQL expert, maybe jaypipes can help here.
The explain for the DB query in this example:
+------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 382473 | Using where |
+------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
It's possible that it can be ever worse, as this number is from another very-long running environments.
+------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 3008741 | Using where |
+------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
I'm not the SQL expert, could we not optimize this? Alternatively, could we update the online data migrations code to "pop out" any of the migrations that return 0 for the next iteration, that way it only works on those online_data_migrations that *have* to be done, and ignore those it knows are done?
I don't know if there is a good way by which we can persistently store the state of finished migrations to ensure they are not executed ever again (as in not having to make the query) once done. It would also be nice to also be able to opt-in into specific migrations specially since these span over releases.
and while we're at it, can we just bump the default rows-per-run to something more than 50 rows? it seems super .. small :)
I agree the default 50 is a pretty small batch size specially for large deployments. [1] https://github.com/openstack/nova/blob/95a87bce9fa7575c172a7d06344fd3cd070db... Thanks for bringing this up, Regards, Surya.
On 04/01/2019 05:30 AM, Surya Seetharaman wrote:
I don't know if there is a good way by which we can persistently store the state of finished migrations to ensure they are not executed ever again (as in not having to make the query) once done. It would also be nice to also be able to opt-in into specific migrations specially since these span over releases.
Well, that's another problem with not doing data migrations inside an Alembic/sqlalchemy-migrate migration script... you have to re-execute the query over and over again to see if a migration is necessary. If the data migrations were in the Alembic/sqlalchemy-migrate script, then the migrate_version DB table could be used to determine if a migration was done and none of this would be an issue. Best, -jay
On 03/31/2019 10:21 PM, Mohammed Naser wrote:
Hi there,
During upgrades, I've noticed that when running online_data_migrations with "infinite-until-done" mode, it loops over all of the migrations one by one.
However, one of the online data migrations (instance_obj.populate_missing_availability_zones) makes a query that takes a really long time as it seems inefficient (which eventually results in 0, cause it already ran), which means as it loops in "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large environments.
The question ends up in specific:
SELECT count(*) AS count_1 FROM (SELECT instance_extra.created_at AS instance_extra_created_at, instance_extra.updated_at AS instance_extra_updated_at, instance_extra.deleted_at AS instance_extra_deleted_at, instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS instance_extra_id, instance_extra.instance_uuid AS instance_extra_instance_uuid FROM instance_extra WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1
Ugh. :( The online data migration shouldn't be calling the above SQL statement at all. Instead, the migration should be doing something like this: SELECT ie.instance_uuid FROM instance_extra AS ie WHERE ie.keypairs IS NULL AND ie.deletd = 0 LIMIT 100 and then while getting any rows returned from the above, perform the work of transforming the problematic data in the table for each matched instance_uuid. I'm actually not sure what the above query has to do with availability zones, but I'll look into it later on this morning. Can you report a bug about this and we'll get on it ASAP? Best, -jay
The explain for the DB query in this example:
+------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 382473 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
It's possible that it can be ever worse, as this number is from another very-long running environments.
+------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 3008741 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
I'm not the SQL expert, could we not optimize this? Alternatively, could we update the online data migrations code to "pop out" any of the migrations that return 0 for the next iteration, that way it only works on those online_data_migrations that *have* to be done, and ignore those it knows are done?
Thanks, Mohammed
On Mon, Apr 1, 2019 at 8:25 AM Jay Pipes <jaypipes@gmail.com> wrote:
On 03/31/2019 10:21 PM, Mohammed Naser wrote:
Hi there,
During upgrades, I've noticed that when running online_data_migrations with "infinite-until-done" mode, it loops over all of the migrations one by one.
However, one of the online data migrations (instance_obj.populate_missing_availability_zones) makes a query that takes a really long time as it seems inefficient (which eventually results in 0, cause it already ran), which means as it loops in "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large environments.
The question ends up in specific:
SELECT count(*) AS count_1 FROM (SELECT instance_extra.created_at AS instance_extra_created_at, instance_extra.updated_at AS instance_extra_updated_at, instance_extra.deleted_at AS instance_extra_deleted_at, instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS instance_extra_id, instance_extra.instance_uuid AS instance_extra_instance_uuid FROM instance_extra WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1
Ugh. :(
The online data migration shouldn't be calling the above SQL statement at all.
Instead, the migration should be doing something like this:
SELECT ie.instance_uuid FROM instance_extra AS ie WHERE ie.keypairs IS NULL AND ie.deletd = 0 LIMIT 100
and then while getting any rows returned from the above, perform the work of transforming the problematic data in the table for each matched instance_uuid.
I'm actually not sure what the above query has to do with availability zones, but I'll look into it later on this morning.
Can you report a bug about this and we'll get on it ASAP?
Thanks for looking into this. I've noticed some of the newer data migrations actually take into question the limit, but this one doesn't from the looks of it. https://bugs.launchpad.net/nova/+bug/1822613 I'd be happy to test it out against a big data set, if you'd like.
Best, -jay
The explain for the DB query in this example:
+------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 382473 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
It's possible that it can be ever worse, as this number is from another very-long running environments.
+------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 3008741 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
I'm not the SQL expert, could we not optimize this? Alternatively, could we update the online data migrations code to "pop out" any of the migrations that return 0 for the next iteration, that way it only works on those online_data_migrations that *have* to be done, and ignore those it knows are done?
Thanks, Mohammed
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
Following up on bug https://bugs.launchpad.net/nova/+bug/1822613 there are a few things: 1. https://review.openstack.org/#/c/649648/ drops the problematic migrate_keypairs_to_api_db data migration since we shouldn't need it anymore. 2. As noted in the bug and this thread, some online data migrations track a marker for paging and some don't, but there is no generic marker to say an online data migration is 'done' and doesn't need to be run again (like the migrations table for the sqlalchemy-migrate based schema migrations). Persisting something like that would require more thought for the online data migrations. 3. The other complaint in the bug is that --max-rows defaulting to 50 is too small in a large cloud (similar to bug 1742649 for --max-count on the map_instances command). I think a compromise here could be something like what Dan added for the archive_deleted_rows command: https://review.openstack.org/#/c/378718/ Where you could specify a batch size using --max-rows but also specify an --until-complete option to run through all of the online data migrations rather than just stopping at --max-size records migrated. If people like that idea, then I think we just want a specless blueprint for it like what was used for the change above for archive_deleted_rows. On 3/31/2019 9:21 PM, Mohammed Naser wrote:
Hi there,
During upgrades, I've noticed that when running online_data_migrations with "infinite-until-done" mode, it loops over all of the migrations one by one.
However, one of the online data migrations (instance_obj.populate_missing_availability_zones) makes a query that takes a really long time as it seems inefficient (which eventually results in 0, cause it already ran), which means as it loops in "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large environments.
The question ends up in specific:
SELECT count(*) AS count_1 FROM (SELECT instance_extra.created_at AS instance_extra_created_at, instance_extra.updated_at AS instance_extra_updated_at, instance_extra.deleted_at AS instance_extra_deleted_at, instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS instance_extra_id, instance_extra.instance_uuid AS instance_extra_instance_uuid FROM instance_extra WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1
The explain for the DB query in this example:
+------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 382473 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
It's possible that it can be ever worse, as this number is from another very-long running environments.
+------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 3008741 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
I'm not the SQL expert, could we not optimize this? Alternatively, could we update the online data migrations code to "pop out" any of the migrations that return 0 for the next iteration, that way it only works on those online_data_migrations that *have* to be done, and ignore those it knows are done?
Thanks, Mohammed
-- Thanks, Matt
On Wed, 3 Apr 2019 11:11:58 -0500, Matt Riedemann <mriedemos@gmail.com> wrote:
Following up on bug https://bugs.launchpad.net/nova/+bug/1822613 there are a few things:
1. https://review.openstack.org/#/c/649648/ drops the problematic migrate_keypairs_to_api_db data migration since we shouldn't need it anymore.
Thanks for doing that, it's now approved.
2. As noted in the bug and this thread, some online data migrations track a marker for paging and some don't, but there is no generic marker to say an online data migration is 'done' and doesn't need to be run again (like the migrations table for the sqlalchemy-migrate based schema migrations). Persisting something like that would require more thought for the online data migrations.
3. The other complaint in the bug is that --max-rows defaulting to 50 is too small in a large cloud (similar to bug 1742649 for --max-count on the map_instances command). I think a compromise here could be something like what Dan added for the archive_deleted_rows command:
https://review.openstack.org/#/c/378718/
Where you could specify a batch size using --max-rows but also specify an --until-complete option to run through all of the online data migrations rather than just stopping at --max-size records migrated. If people like that idea, then I think we just want a specless blueprint for it like what was used for the change above for archive_deleted_rows.
I think it's a good idea -- I think we need it for larger deployments. Specless blueprint sounds good to me. -melanie
On 3/31/2019 9:21 PM, Mohammed Naser wrote:
Hi there,
During upgrades, I've noticed that when running online_data_migrations with "infinite-until-done" mode, it loops over all of the migrations one by one.
However, one of the online data migrations (instance_obj.populate_missing_availability_zones) makes a query that takes a really long time as it seems inefficient (which eventually results in 0, cause it already ran), which means as it loops in "blocks" of 50, there's almost a 2-3 to 8 minute wait in really large environments.
The question ends up in specific:
SELECT count(*) AS count_1 FROM (SELECT instance_extra.created_at AS instance_extra_created_at, instance_extra.updated_at AS instance_extra_updated_at, instance_extra.deleted_at AS instance_extra_deleted_at, instance_extra.deleted AS instance_extra_deleted, instance_extra.id AS instance_extra_id, instance_extra.instance_uuid AS instance_extra_instance_uuid FROM instance_extra WHERE instance_extra.keypairs IS NULL AND instance_extra.deleted = 0) AS anon_1
The explain for the DB query in this example:
+------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 382473 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+--------+-------------+
It's possible that it can be ever worse, as this number is from another very-long running environments.
+------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | instance_extra | ALL | NULL | NULL | NULL | NULL | 3008741 | Using where | +------+-------------+----------------+------+---------------+------+---------+------+---------+-------------+
I'm not the SQL expert, could we not optimize this? Alternatively, could we update the online data migrations code to "pop out" any of the migrations that return 0 for the next iteration, that way it only works on those online_data_migrations that *have* to be done, and ignore those it knows are done?
Thanks, Mohammed
participants (5)
-
Jay Pipes
-
Matt Riedemann
-
melanie witt
-
Mohammed Naser
-
Surya Seetharaman