[nova][ptg] drop the shadow table concept
CERN reported two issues with archive_deleted_rows CLI: * When one record gets inserted into the shadow_instance_extra but didn't get deleted from instance_extra (I know this is in a single transaction but sometimes it happens), needs manual cleanup on the database * Also there could be two cells running this command at the same time fighting for the API db lock,
TODOs: * tssurya to report bugs / improvements on archive_deleted_rows CLI based on CERN's experience with long table locking * mnaser to report a wishlist bug / specless bp about one step db purge CLI which would skip the shadow tables
Cheers, gibi
On 11/10/2019 10:29 AM, Balázs Gibizer wrote:
- Also there could be two cells running this command at the same time
fighting for the API db lock,
In Train the --all-cells option was added to the CLI so that should resolve this issue. I think Mel said she backported those changes internally so I'm not sure how hard it would be for those to go back to Stein or Rocky or whatever release CERN is using now.
On 11/10/19 12:41, Matt Riedemann wrote:
On 11/10/2019 10:29 AM, Balázs Gibizer wrote:
- Also there could be two cells running this command at the same time
fighting for the API db lock,
In Train the --all-cells option was added to the CLI so that should resolve this issue. I think Mel said she backported those changes internally so I'm not sure how hard it would be for those to go back to Stein or Rocky or whatever release CERN is using now.
That's correct, I backported --all-cells [1][2][3][4] to Stein, Rocky, and Queens downstream. I found it not to be easy but YMMV.
The primary conflicts in Stein were with --before, so I went ahead and brought those patches back as well [5][6][7] since we also needed --before to help people avoid the "orphaned virt guests if archive runs while nova-compute is down" problem.
Same deal for Rocky.
And finally with Queens, there's an additional conflict around deleting instance group members [8], so I also brought that back because it's related to all of the database cleanup issues that support has repeatedly faced with customers.
Hope this helps anyone considering backporting --all-cells.
Cheers, -melanie
[1] https://review.opendev.org/675218 [2] https://review.opendev.org/675209 [3] https://review.opendev.org/675205 [4] https://review.opendev.org/507486 [5] https://review.opendev.org/661289 [6] https://review.opendev.org/556751 [7] https://review.opendev.org/643779 [8] https://review.opendev.org/598953
On 11/11/19 08:50, melanie witt wrote:
On 11/10/19 12:41, Matt Riedemann wrote:
On 11/10/2019 10:29 AM, Balázs Gibizer wrote:
- Also there could be two cells running this command at the same time
fighting for the API db lock,
In Train the --all-cells option was added to the CLI so that should resolve this issue. I think Mel said she backported those changes internally so I'm not sure how hard it would be for those to go back to Stein or Rocky or whatever release CERN is using now.
That's correct, I backported --all-cells [1][2][3][4] to Stein, Rocky, and Queens downstream. I found it not to be easy but YMMV.
The primary conflicts in Stein were with --before, so I went ahead and brought those patches back as well [5][6][7] since we also needed --before to help people avoid the "orphaned virt guests if archive runs while nova-compute is down" problem.
Same deal for Rocky.
And finally with Queens, there's an additional conflict around deleting instance group members [8], so I also brought that back because it's related to all of the database cleanup issues that support has repeatedly faced with customers.
Sorry, I have to be pedantic and amend the info about Queens ^ to add that --purge [9][10][11] was another conflict in Queens that I also backported because we had a separate request open by support for that as well anyway.
Hope this helps anyone considering backporting --all-cells.
Cheers, -melanie
[1] https://review.opendev.org/675218 [2] https://review.opendev.org/675209 [3] https://review.opendev.org/675205 [4] https://review.opendev.org/507486 [5] https://review.opendev.org/661289 [6] https://review.opendev.org/556751 [7] https://review.opendev.org/643779 [8] https://review.opendev.org/598953
[9] https://review.opendev.org/550171 [10] https://review.opendev.org/550182 [11] https://review.opendev.org/550502
On 11/10/2019 10:29 AM, Balázs Gibizer wrote:
- When one record gets inserted into the shadow_instance_extra but
didn't get deleted from instance_extra (I know this is in a single transaction but sometimes it happens), needs manual cleanup on the database
Is this potentially caused by the issue attempting to be fixed here?
https://review.opendev.org/#/c/412771/
On Mon, Nov 11, 2019 at 12:33 AM Balázs Gibizer balazs.gibizer@est.tech wrote:
CERN reported two issues with archive_deleted_rows CLI:
- When one record gets inserted into the shadow_instance_extra but
didn't get deleted from instance_extra (I know this is in a single transaction but sometimes it happens), needs manual cleanup on the database
- Also there could be two cells running this command at the same time
fighting for the API db lock,
TODOs:
- tssurya to report bugs / improvements on archive_deleted_rows CLI
based on CERN's experience with long table locking
- mnaser to report a wishlist bug / specless bp about one step db purge
CLI which would skip the shadow tables
I did my homework:
https://bugs.launchpad.net/nova/+bug/1852121
I don't think I have time currently to iterate and work on it right now, but at least it's documented.
Cheers, gibi
On 11/11/2019 1:05 PM, Mohammed Naser wrote:
I did my homework:
https://bugs.launchpad.net/nova/+bug/1852121
I don't think I have time currently to iterate and work on it right now, but at least it's documented.
I commented in the bug and, without more details, I don't see how it's really worth the trouble of refactoring the archive/purge code to deal with this optimization but I can probably be proven wrong.
participants (4)
-
Balázs Gibizer
-
Matt Riedemann
-
melanie witt
-
Mohammed Naser