[openstack-dev] [tripleo] Validations before upgrades and updates

Ben Nemec openstack at nemebean.com
Mon May 15 14:57:42 UTC 2017

On 05/08/2017 06:45 AM, Marios Andreou wrote:
> Hi folks, after some discussion locally with colleagues about improving
> the upgrades experience, one of the items that came up was pre-upgrade
> and update validations. I took an AI to look at the current status of
> tripleo-validations [0] and posted a simple WIP [1] intended to be run
> before an undercloud update/upgrade and which just checks service
> status. It was pointed out by shardy that for such checks it is better
> to instead continue to use the per-service  manifests where possible
> like [2] for example where we check status before N..O major upgrade.
> There may still be some undercloud specific validations that we can land
> into the tripleo-validations repo (thinking about things like the
> neutron networks/ports, validating the current nova nodes state etc?).
> So do folks have any thoughts about this subject - for example the kinds
> of things we should be checking - Steve said he had some reviews in
> progress for collecting the overcloud ansible puppet/docker config into
> an ansible playbook that the operator can invoke for upgrade of the
> 'manual' nodes (for example compute in the N..O workflow) - the point
> being that we can add more per-service ansible validation tasks into the
> service manifests for execution when the play is run by the operator -
> but I'll let Steve point at and talk about those.

We had a similar discussion regarding controller node replacement 
because starting that process with the overcloud in an inconsistent 
state tends to end badly.  Unfortunately those docs are only available 
downstream at this time, but the basics were:

-Verify that the stack is in a *_COMPLETE state (this may seem obvious, 
but we've had people try to do these major processes while the stack is 
in a broken state)
-Verify undercloud disk space.  For node replacement we recommended a 
minimum of 10 GB free.
-Verify that all pacemaker services are up.
-Check Galera and Rabbit clusters and verify all nodes are up.
-For node replacement we also disabled stonith.  That might be a good 
idea during upgrades as well in case some services take a while to come 
back up.  You really don't want a node getting killed during the process.
-General undercloud service checks (nova, ironic, etc.)

More information about the OpenStack-dev mailing list