[goals][upgrade-checkers] Retrospective
Now that Stein is GA I figured I'd write up a quick retrospective for the upgrade-checkers goal [1]. Statistics ---------- From the tasks in the story [2]: - There were 52 total tasks. - Merged: 40 - Review: 4 (still open) - Invalid: 8* (not applicable or opted out) Excluding Invalid tasks, the overall completion percentage of the goal was 91% which is pretty good. Granted, the majority of projects just added the placeholder framework for the upgrade check command which is not very useful for operators, but it's a start for those projects to build in real checks later. *5 of the Invalid tasks were for deployment projects which were not a target for the goal [3]. The others were for swift, telemetry and panko. oslo library ------------ To ease development, Ben Nemec extracted the framework of the nova check code into the oslo.upgradecheck library [4]. This made getting the framework in place for all projects much easier and ensures a high level of consistency in the command structure which is really important with this type of thing. Thanks again Ben! Not universal ------------- As mentioned above, not all projects implemented the command since it maybe didn't make sense for those projects (swift, mistral and adjutant are good examples) and other projects just aren't in a place development-wise to put much priority in the goal (telemetry and panko). Extension points ---------------- Neutron created an extension point for their upgrade checks to support stadium projects and have already been using it. This was not something I thought about when we started the goal so it was cool that the neutron team embraced this and put their own spin on it. Contributions from NEC developers --------------------------------- Akhil Jain and Rajat Dhasmana from NEC volunteered to add the framework of the command in most projects which was extremely helpful. Thank you Akhil and Rajat, and NEC for providing development resources to work on a cross-community goal like this. Next ---- While most projects have the framework in place for upgrade checks, a lot of projects released Stein with a simple placeholder check. A few projects have started adding real checks to aid in upgrades to Stein, which is great. I'm hopeful that other projects will add real checks as they encounter upgrade issues during Train development and need a way to signal those changes to operators in an automated way. Remember: if you're adding an upgrades release note, consider if you can automate that note. Another next step is to get deployment projects to build in a call to each project's upgrade check command because without deployment projects using the tool it doesn't serve much purpose. Let me note that this does not need to be driven by the deployment projects either - developers from each project can integrate their check into a deployment project of choice, like I did for OpenStack-Ansible [5]. I'm sure the deployment project dev teams would appreciate the service project teams driving this work. [1] https://governance.openstack.org/tc/goals/stein/upgrade-checkers.html [2] https://storyboard.openstack.org/#!/story/2003657 [3] See the note: https://governance.openstack.org/tc/goals/stein/upgrade-checkers.html#comple... [4] https://docs.openstack.org/oslo.upgradecheck/latest/ [5] https://review.openstack.org/#/c/575125/ -- Thanks, Matt
Matt Riedemann wrote:
Now that Stein is GA I figured I'd write up a quick retrospective for the upgrade-checkers goal [1]. [...]
Thanks Matt for driving this, and thanks to Akhil, Rajat, Ben and others for working on it ! Hoping to see that framework more widely used in the future to help operators proactively address OpenStack upgrade issues. -- Thierry Carrez (ttx)
On Mon, Apr 15, 2019, at 23:04, Matt Riedemann wrote:
Now that Stein is GA I figured I'd write up a quick retrospective for the upgrade-checkers goal [1]. (snipped)
Another next step is to get deployment projects to build in a call to each project's upgrade check command because without deployment projects using the tool it doesn't serve much purpose. Let me note that this does not need to be driven by the deployment projects either - developers from each project can integrate their check into a deployment project of choice, like I did for OpenStack-Ansible [5]. I'm sure the deployment project dev teams would appreciate the service project teams driving this work.
Thanks Matt for leading this work, and for the work to bring the upgrade checks in OSA. It was indeed very appreciated, with our low resources. Regards, Jean-Philippe Evrard (evrardjp)
On Mon, 15 Apr 2019 at 22:01, Matt Riedemann <mriedemos@gmail.com> wrote:
Now that Stein is GA I figured I'd write up a quick retrospective for the upgrade-checkers goal [1].
Statistics ----------
From the tasks in the story [2]:
- There were 52 total tasks. - Merged: 40 - Review: 4 (still open) - Invalid: 8* (not applicable or opted out)
Excluding Invalid tasks, the overall completion percentage of the goal was 91% which is pretty good. Granted, the majority of projects just added the placeholder framework for the upgrade check command which is not very useful for operators, but it's a start for those projects to build in real checks later.
*5 of the Invalid tasks were for deployment projects which were not a target for the goal [3]. The others were for swift, telemetry and panko.
oslo library ------------
To ease development, Ben Nemec extracted the framework of the nova check code into the oslo.upgradecheck library [4]. This made getting the framework in place for all projects much easier and ensures a high level of consistency in the command structure which is really important with this type of thing. Thanks again Ben!
Not universal -------------
As mentioned above, not all projects implemented the command since it maybe didn't make sense for those projects (swift, mistral and adjutant are good examples) and other projects just aren't in a place development-wise to put much priority in the goal (telemetry and panko).
Extension points ----------------
Neutron created an extension point for their upgrade checks to support stadium projects and have already been using it. This was not something I thought about when we started the goal so it was cool that the neutron team embraced this and put their own spin on it.
Contributions from NEC developers ---------------------------------
Akhil Jain and Rajat Dhasmana from NEC volunteered to add the framework of the command in most projects which was extremely helpful. Thank you Akhil and Rajat, and NEC for providing development resources to work on a cross-community goal like this.
Next ----
While most projects have the framework in place for upgrade checks, a lot of projects released Stein with a simple placeholder check. A few projects have started adding real checks to aid in upgrades to Stein, which is great. I'm hopeful that other projects will add real checks as they encounter upgrade issues during Train development and need a way to signal those changes to operators in an automated way. Remember: if you're adding an upgrades release note, consider if you can automate that note.
Another next step is to get deployment projects to build in a call to each project's upgrade check command because without deployment projects using the tool it doesn't serve much purpose. Let me note that this does not need to be driven by the deployment projects either - developers from each project can integrate their check into a deployment project of choice, like I did for OpenStack-Ansible [5]. I'm sure the deployment project dev teams would appreciate the service project teams driving this work.
I put together a patch for kolla-ansible with support for upgrade checks for some projects: https://review.opendev.org/644528. It's on the backburner at the moment but I plan to return to it during the Train cycle. Perhaps you could clarify a few things about expected usage. 1. Should the tool be run using the new code? I would assume so. 2. How would you expect this to be run with multiple projects? I was thinking of adding a new command that performs upgrade checks for all projects that would be read-only, then also performing the check again as part of the upgrade procedure. 3. For the warnings, would you recommend a -Werror style argument that optionally flags up warnings as errors? Reporting non-fatal errors is quite difficult in Ansible.
[1] https://governance.openstack.org/tc/goals/stein/upgrade-checkers.html [2] https://storyboard.openstack.org/#!/story/2003657 [3] See the note:
https://governance.openstack.org/tc/goals/stein/upgrade-checkers.html#comple... [4] https://docs.openstack.org/oslo.upgradecheck/latest/ [5] https://review.openstack.org/#/c/575125/
--
Thanks,
Matt
On 4/24/2019 8:21 AM, Mark Goddard wrote:
I put together a patch for kolla-ansible with support for upgrade checks for some projects: https://review.opendev.org/644528. It's on the backburner at the moment but I plan to return to it during the Train cycle. Perhaps you could clarify a few things about expected usage.
Cool. I'd probably try to pick one service (nova?) to start with before trying to bite off all of these in a single change (that review is kind of daunting). Also, as part of the community wide goal I wrote up reference docs in the nova tree [1] which might answer your questions with links for more details.
1. Should the tool be run using the new code? I would assume so.
Depends on what you mean by "new code". When nova introduced this in Ocata it was meant to be run in a venv or container after upgrading the newton schema and data migrations to ocata, but before restarting the services with the ocata code and that's how grenade uses it. But the checks should also be idempotent and can be run as a post-install/upgrade verify step, which is how OSA uses it (and is described in the nova install docs [2]).
2. How would you expect this to be run with multiple projects? I was thinking of adding a new command that performs upgrade checks for all projects that would be read-only, then also performing the check again as part of the upgrade procedure.
Hmm, good question. This probably depends on each deployment tool and how they roll through services to do the upgrade. Obviously you'd want to run each project's checks as part of upgrading that service, but I guess you're looking for some kind of "should we even start this whole damn upgrade if we can detect early that there are going to be issues?". If the early run is read-only though - and I'm assuming by read-only you mean they won't cause a failure - how are you going to expose that there is a problem without failing? Would you make that configurable? Otherwise the checks themselves are supposed to be read-only and not change your data (they aren't the same thing as an online data migration routine for example).
3. For the warnings, would you recommend a -Werror style argument that optionally flags up warnings as errors? Reporting non-fatal errors is quite difficult in Ansible.
OSA fails on any return codes that aren't 0 (success) or 1 (warning). It's hard to say when warning should be considered an error really. When writing these checks I think of warning as a case where you might be OK but we don't really know for sure, so it can aid in debugging upgrade-related issues after the fact but might not necessarily mean you shouldn't upgrade. mnaser has brought up the idea in the past of making the output more machine readable so tooling could pick and choose which things it considers to be a failure (assuming the return code was 1). That's an interesting idea but one I haven't put a lot of thought into. It might be as simple as outputting a unique code per check per project, sort of like the error code concept in the API guidelines [3] which the placement project is using [4]. [1] https://docs.openstack.org/nova/latest/reference/upgrade-checks.html [2] https://docs.openstack.org/nova/latest/install/verify.html [3] https://specs.openstack.org/openstack/api-wg/guidelines/errors.html [4] https://opendev.org/openstack/placement/src/branch/master/placement/errors.p... -- Thanks, Matt
On Thu, 25 Apr 2019 at 23:50, Matt Riedemann <mriedemos@gmail.com> wrote:
On 4/24/2019 8:21 AM, Mark Goddard wrote:
I put together a patch for kolla-ansible with support for upgrade checks for some projects: https://review.opendev.org/644528. It's on the backburner at the moment but I plan to return to it during the Train cycle. Perhaps you could clarify a few things about expected usage.
Cool. I'd probably try to pick one service (nova?) to start with before trying to bite off all of these in a single change (that review is kind of daunting).
Also, as part of the community wide goal I wrote up reference docs in the nova tree [1] which might answer your questions with links for more details.
1. Should the tool be run using the new code? I would assume so.
Depends on what you mean by "new code". When nova introduced this in Ocata it was meant to be run in a venv or container after upgrading the newton schema and data migrations to ocata, but before restarting the services with the ocata code and that's how grenade uses it. But the checks should also be idempotent and can be run as a post-install/upgrade verify step, which is how OSA uses it (and is described in the nova install docs [2]).
In kolla land, I mean should I use the container image for the current release or the target release to execute the nova-status command. It sounds like it's the latter, which also implies we're using the target version of kolla/kolla-ansible. I hadn't twigged that we'd need to perform the schema upgrade and online migrations.
2. How would you expect this to be run with multiple projects? I was thinking of adding a new command that performs upgrade checks for all projects that would be read-only, then also performing the check again as part of the upgrade procedure.
Hmm, good question. This probably depends on each deployment tool and how they roll through services to do the upgrade. Obviously you'd want to run each project's checks as part of upgrading that service, but I guess you're looking for some kind of "should we even start this whole damn upgrade if we can detect early that there are going to be issues?". If the early run is read-only though - and I'm assuming by read-only you mean they won't cause a failure - how are you going to expose that there is a problem without failing? Would you make that configurable? Otherwise the checks themselves are supposed to be read-only and not change your data (they aren't the same thing as an online data migration routine for example).
If we need to have run the schema upgrade and migrations before the upgrade check, I think that reduces the usefulness of a separate check operation. I was thinking you might be able to run the checks against the system prior to making any upgrade changes, but it seems not. I guess a separate check after the upgrade might still be useful for diagnosing upgrade issues from warnings.
3. For the warnings, would you recommend a -Werror style argument that optionally flags up warnings as errors? Reporting non-fatal errors is quite difficult in Ansible.
OSA fails on any return codes that aren't 0 (success) or 1 (warning). It's hard to say when warning should be considered an error really. When writing these checks I think of warning as a case where you might be OK but we don't really know for sure, so it can aid in debugging upgrade-related issues after the fact but might not necessarily mean you shouldn't upgrade. mnaser has brought up the idea in the past of making the output more machine readable so tooling could pick and choose which things it considers to be a failure (assuming the return code was 1). That's an interesting idea but one I haven't put a lot of thought into. It might be as simple as outputting a unique code per check per project, sort of like the error code concept in the API guidelines [3] which the placement project is using [4].
Machine readable would be nice. Perhaps there's something we could do to generate a report of the combined results.
[1] https://docs.openstack.org/nova/latest/reference/upgrade-checks.html [2] https://docs.openstack.org/nova/latest/install/verify.html [3] https://specs.openstack.org/openstack/api-wg/guidelines/errors.html [4]
https://opendev.org/openstack/placement/src/branch/master/placement/errors.p...
--
Thanks,
Matt
On 4/26/2019 3:59 AM, Mark Goddard wrote:
I was thinking you might be able to run the checks against the system prior to making any upgrade changes, but it seems not.
You can and it might produce a warning or error for some checks saying to run online data migrations, but not all checks are related to or remedied with online data migrations. You can look at the History of Checks for nova [1]. The cells v2 ones in Ocata were very much about the need to create cell mappings and such - something we (nova) don't do for you. There are other checks for the minimum API version on a dependent external service (like placement, and I have one for cinder here [2]). [1] https://docs.openstack.org/nova/latest/cli/nova-status.html#upgrade [2] https://review.opendev.org/#/c/649759/ -- Thanks, Matt
On 4/26/19 3:59 AM, Mark Goddard wrote:
> 3. For the warnings, would you recommend a -Werror style argument that > optionally flags up warnings as errors? Reporting non-fatal errors is > quite difficult in Ansible.
OSA fails on any return codes that aren't 0 (success) or 1 (warning). It's hard to say when warning should be considered an error really. When writing these checks I think of warning as a case where you might be OK but we don't really know for sure, so it can aid in debugging upgrade-related issues after the fact but might not necessarily mean you shouldn't upgrade. mnaser has brought up the idea in the past of making the output more machine readable so tooling could pick and choose which things it considers to be a failure (assuming the return code was 1). That's an interesting idea but one I haven't put a lot of thought into. It might be as simple as outputting a unique code per check per project, sort of like the error code concept in the API guidelines [3] which the placement project is using [4].
Machine readable would be nice. Perhaps there's something we could do to generate a report of the combined results.
Note that there's a todo[0] in the oslo.upgradecheck code to switch to cliff for the output. That would allow us to easily output in machine-readable formats. 0: https://github.com/openstack/oslo.upgradecheck/blob/master/oslo_upgradecheck...
On Fri, Apr 26, 2019 at 5:03 AM Mark Goddard <mark@stackhpc.com> wrote:
On Thu, 25 Apr 2019 at 23:50, Matt Riedemann <mriedemos@gmail.com> wrote:
On 4/24/2019 8:21 AM, Mark Goddard wrote:
I put together a patch for kolla-ansible with support for upgrade checks for some projects: https://review.opendev.org/644528. It's on the backburner at the moment but I plan to return to it during the Train cycle. Perhaps you could clarify a few things about expected usage.
Cool. I'd probably try to pick one service (nova?) to start with before trying to bite off all of these in a single change (that review is kind of daunting).
Also, as part of the community wide goal I wrote up reference docs in the nova tree [1] which might answer your questions with links for more details.
1. Should the tool be run using the new code? I would assume so.
Depends on what you mean by "new code". When nova introduced this in Ocata it was meant to be run in a venv or container after upgrading the newton schema and data migrations to ocata, but before restarting the services with the ocata code and that's how grenade uses it. But the checks should also be idempotent and can be run as a post-install/upgrade verify step, which is how OSA uses it (and is described in the nova install docs [2]).
In kolla land, I mean should I use the container image for the current release or the target release to execute the nova-status command. It sounds like it's the latter, which also implies we're using the target version of kolla/kolla-ansible. I hadn't twigged that we'd need to perform the schema upgrade and online migrations.
2. How would you expect this to be run with multiple projects? I was thinking of adding a new command that performs upgrade checks for all projects that would be read-only, then also performing the check again as part of the upgrade procedure.
Hmm, good question. This probably depends on each deployment tool and how they roll through services to do the upgrade. Obviously you'd want to run each project's checks as part of upgrading that service, but I guess you're looking for some kind of "should we even start this whole damn upgrade if we can detect early that there are going to be issues?". If the early run is read-only though - and I'm assuming by read-only you mean they won't cause a failure - how are you going to expose that there is a problem without failing? Would you make that configurable? Otherwise the checks themselves are supposed to be read-only and not change your data (they aren't the same thing as an online data migration routine for example).
If we need to have run the schema upgrade and migrations before the upgrade check, I think that reduces the usefulness of a separate check operation. I was thinking you might be able to run the checks against the system prior to making any upgrade changes, but it seems not. I guess a separate check after the upgrade might still be useful for diagnosing upgrade issues from warnings.
3. For the warnings, would you recommend a -Werror style argument that optionally flags up warnings as errors? Reporting non-fatal errors is quite difficult in Ansible.
OSA fails on any return codes that aren't 0 (success) or 1 (warning). It's hard to say when warning should be considered an error really. When writing these checks I think of warning as a case where you might be OK but we don't really know for sure, so it can aid in debugging upgrade-related issues after the fact but might not necessarily mean you shouldn't upgrade. mnaser has brought up the idea in the past of making the output more machine readable so tooling could pick and choose which things it considers to be a failure (assuming the return code was 1). That's an interesting idea but one I haven't put a lot of thought into. It might be as simple as outputting a unique code per check per project, sort of like the error code concept in the API guidelines [3] which the placement project is using [4].
Machine readable would be nice. Perhaps there's something we could do to generate a report of the combined results.
Interesting you bring that up, I made an attempt a while back, but didn't have the resources to drive it through. https://review.opendev.org/#/c/576944/
[1] https://docs.openstack.org/nova/latest/reference/upgrade-checks.html [2] https://docs.openstack.org/nova/latest/install/verify.html [3] https://specs.openstack.org/openstack/api-wg/guidelines/errors.html [4] https://opendev.org/openstack/placement/src/branch/master/placement/errors.p...
--
Thanks,
Matt
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
participants (6)
-
Ben Nemec
-
Jean-Philippe Evrard
-
Mark Goddard
-
Matt Riedemann
-
Mohammed Naser
-
Thierry Carrez