[openstack-dev] [Nova] State machines in Nova
Murray, Paul (HP Cloud)
pmurray at hpe.com
Thu Jun 2 13:08:44 UTC 2016
> -----Original Message-----
> From: Monty Taylor [mailto:mordred at inaugust.com]
> Sent: 01 June 2016 13:54
> To: openstack-dev at lists.openstack.org
> Subject: Re: [openstack-dev] [Nova] State machines in Nova
> On 06/01/2016 03:50 PM, Andrew Laski wrote:
> > On Wed, Jun 1, 2016, at 05:51 AM, Miles Gould wrote:
> >> On 31/05/16 21:03, Timofei Durakov wrote:
> >>> there is blueprint that was approved during Liberty and
> >>> resubmitted to Newton(with spec).
> >>> The idea is to define state machines for operations as
> >>> live-migration, resize, etc. and to deal with them operation states.
> >> +1 to introducing an explicit state machine - IME they make complex
> >> logic much easier to reason about. However, think carefully about how
> >> you'll make changes to that state machine later. In Ironic, this is
> >> an ongoing problem: every time we change the state machine, we have
> >> to decide whether to lie to older clients (and if so, what lie to
> >> tell them), or whether to present them with the truth (and if so, how
> >> badly they'll break). AIUI this would be a much smaller problem if
> >> we'd considered this possibility carefully at the beginning.
> > This is a great point. I think most people have an implicit assumption
> > that the state machine will be exposed to end users via the API. I
> > would like to avoid that for exactly the reason you've mentioned. Of
> > course we'll want to expose something to users but whatever that is
> > should be loosely coupled with the internal states that actually drive the
I think this raises an interesting point.
tl;dr: I am starting to think we should not do the migration state machine spec being proposed before the tasks. But we should at least make the states we assign something other than arbitrary strings (e.g. constants defined in a particular place) and we should use the state names consistently.
Transitions can come from two places: 1) the user invokes the API to change the state of an instance, this is a good place to check that the instance is in a state to do the externally visible transition, 2) the state of the instance changes due to an internal event (host crash, deliberate operation...) this implies a change in the externally visible state of the instance, but cannot be prevented just because the state machine says this shouldn't happen (usually this is captured by the error state, but we can do better sometimes).
I think the state machines that are being defined in these changes are actually high level phases of the migration process that are currently observed by the user. I'm not sure they are particularly useful for coordinating the migration process itself and so are maybe not the right place to enforce internal transitions.
Live migration is an oddity in nova. Usually an instance is a single entity running on a single host (ignoring clustered hypervisors for the moment). There is a host manager responsible for that host that has the best view of the actual state of the instance or operations being performed on it. Generally the host manager is the natural place to coordinate operations on the instance.
In the case of live migration there are actually two VMs running on different hosts at a same time. The migration process involves coordinating transitions of those two VMs (attaching disks, plugging networks, starting the target VM, starting the migration, rebinding ports, stopping the source VM.....). The two VMs and their own individual states in this process are not represented explicitly. We only have an overall process coordinated by a distributed sequence of rpcs. There is a current spec moving that coordination to the conductor. When that sequence is interrupted or even completely lost (e.g. by a conductor failing or being restarted) we get into trouble. I think this is where our real problem lies.
We should sort out the internal process. The external view given to the user can be a true reflection the current state of the instance. The transitions of the instance should be internally coordinated.
More information about the OpenStack-dev