[Openstack-operators] [HA] RFC: user story including hypervisor reservation / host maintenance / storage AZs / event history (fwd)

Adam Spiers aspiers at suse.com
Tue Jun 28 15:41:57 UTC 2016


Juvonen, Tomi (Nokia - FI/Espoo) <tomi.juvonen at nokia.com> wrote:
> Thank you very much from the interest. Need to look over other
> discussion and perhaps have a session in Barcelona to look the
> way forward after change in Nova.

Indeed, sounds good!

> > -----Original Message-----
> > From: Adam Spiers [mailto:aspiers at suse.com]
> > Sent: Monday, June 20, 2016 4:43 PM
> > To: Juvonen, Tomi (Nokia - FI/Espoo) <tomi.juvonen at nokia.com>
> > Cc: openstack-operators mailing list <openstack-
> > operators at lists.openstack.org>
> > Subject: Re: [Openstack-operators] [HA] RFC: user story including
> > hypervisor reservation / host maintenance / storage AZs / event history
> > (fwd)
> >
> > Hi Tomi,
> >
> > Juvonen, Tomi (Nokia - FI/Espoo) <tomi.juvonen at nokia.com> wrote:
> > > I'm working in the OPNFV Doctor project that is about fault
> > > management and maintenance (NFV). The goal of the project is to
> > > build fault management and maintenance framework for high
> > > availability of Network Services on top of virtualized
> > > infrastructure.
> > >
> > > https://wiki.opnfv.org/display/doctor
> > >
> > > Currently there is already landed effort to OpenStack to have
> > > ability to detect failures fast, change states in OpenStack (Nova),
> > > add state information that was missing and also to expose that to
> > > owner of a VM. Also alarm is triggered. By all this one can now rely
> > > the states and get notice about faults in a split second. Surely
> > > with system configured monitor different faults and make actions
> > > based configured policies, or leave some actions for consumers of
> > > the alarms risen.
> >
> > Sounds very interesting - thanks.  Does this really have to be limited
> > to OPNFV though?  It sounds like it would be very useful within
> > OpenStack generally.
> Surely not just for OPNFV, but for all operators.

Right - so why is it part of the OPNFV project?  That gives the
impression that it would only be usable in NFV contexts.

> If playing with the idea
> of having link to some external tool to have more than
> "host_maintenance_reason", like it now would seem some more generic
> "host_details", where one could have external REST API to call to have any
> wanted host specific details that one would like to expose also to
> tenant/owner of server.

Sounds like you are talking about some kind of "whiteboard" feature
per instance which would act as a sort of communication channel
between the project user/owner and the cloud operator?  Can you
describe a use case which is unrelated to maintenance?

> If having that tool it could also have maintenance
> or host failure specific scenarios implemented. Could have admin to do
> things manually, or configure tool VNF / instance specifically to do some
> actions..

I think we should distinguish between a place to store freeform
human-readable text, and a way for the cloud operator to plan and then
carry out maintenance actions in a manner which would be communicated
to affected users.  The latter would require structured
machine-readable values, otherwise it would be impossible to reliably
implement well-defined workflows.

If we implement a new freeform text field and then it gets treated as
machine-readable by external tools, then there will be no consistency
across different clouds, which will make it hard for operators to
share those tools without conflicting with other uses.

> OPNFV use case here is just the more specific maintenance state
> to begin with, but who knows what one might want to implement there at the
> end. Auto evacuate... ?

Please be careful of the word evacuate, because it is ambiguous, as I
explained in my Austin talk:

  https://youtu.be/lddtWUP_IKQ?t=13m8s

> That is anyhow far in next steps as of complex to
> build. It is even case specific, what to do in different scenarios:
> - Manually do any action by admin.
> - Automatically move VM (maybe not if problem with bigger scale)
> - Let it stay on host over maintenance (not busy hour for service)
> - Let VM owner remove/add VM (to host already gone through maintenance)
> ...

Yes, these are all possible scenarios.  It depends very much on the
kind of maintenance.  The HA community talked about this topic a lot
in Austin, and agreed that any solution supporting automatic workflows
should be configurable so that each cloud operator can configure their
cloud to behave in the way which makes the most sense for them.  Our
discussion was captured in this etherpad, although it might be
slightly difficult to wade through for people who did not attend the
meetings:

  https://etherpad.openstack.org/p/newton-instance-ha

> > > For maintenance I had a session in Austin to talk with Ops and Nova
> > > core about the maintenance part. There it was seen that Nova didn't
> > > want more specific information about host maintenance (maintenance
> > > state, maintenance window...), so as a result of the discussion
> > > there is a spec that was now transferred to Ocata:
> > >
> > > https://review.openstack.org/310510/
> >
> > That's great - thanks a lot for highlighting, as it certainly seems to
> > overlap a lot with the functionality which NTT proposed and is now
> > described here:
> >
> >   http://specs.openstack.org/openstack/openstack-user-stories/user-
> > stories/proposed/ha_vm.html
>
> Thanks, need to familiarize into this as well as other requests in the
> field.

The talk which I mentioned above might help you get familiar with this area:

  https://www.openstack.org/videos/video/high-availability-for-pets-and-hypervisors-state-of-the-nation

> > > The spec proposes a link to Nova external tool to provide more
> > > specific information about host (compute) maintenance and by latest
> > > comments it could have any host specific extra information to the
> > > same place (for example you have mentioned event history). Still if
> > > looking this kind of tool, why not make it configurable for anything
> > > convenient for different operator scenario like automatic operations
> > > if so wanted.
> >
> > Yes, that definitely makes sense to me.
> >
> > > Anyhow project like Nova do not want big new functionalities, so all
> > > "more complex flows" should reside somewhere outside.
> >
> > Right.  I can certainly understand that desire, but I'm a bit confused
> > why the spec is proposing both extending Nova's API / DB schema *and*
> > adding an external tool.
>
> I understand this point as just the text field is also usable. External
> tool is kind of out of scope of the spec.

OK, so you mean that nova just provides the mechanism for
reading/writing the data, but it is up to operators to decide how to
use it?

> Anyhow would mention it to
> have the understanding that the aim is to build more functionality in
> the future into OpenStack and not to limit to what single string can offer.

I see.  I'm a bit worried that this might turn into a mess, but I
guess we can try it and see :-)

Anyway thanks a lot for the discussion and info shared!



More information about the OpenStack-operators mailing list