[openstack-dev] [cinder][drivers] Backend and volume health reporting

Avishay Traeger avishay at stratoscale.com
Mon Aug 15 08:28:59 UTC 2016

On Sun, Aug 14, 2016 at 5:53 PM, John Griffith <john.griffith8 at gmail.com>
> ​I'd like to get a more detailed use case and example of a problem you
> want to solve with this.  I have a number of concerns including those I
> raised in your "list manageable volumes" proposal.​  Most importantly
> there's really no clear definition of what these fields mean and how they
> should be interpreted.

I didn't specify what anything means yet on purpose - the idea was to first
gather information here about what various backends can report, then we
make an educated decision about what health states make sense to expose.

I see Cinder's potential as a single pane of glass management for all of my
cloud's storage.  Once I do some initial configuration, I hope to look at
the backend's UI as little as possible.  Today a user can create a volume,
but can't know anything about it's resiliency or availability.  The user
has a volume that's "available" and is happy.  But what does the user
really care about?  In my opinion not Cinder's internal state machine, but
things like "Is my data safe?" and "Is my data accessible?"  That's the
problem that I want to solve here.

> For backends, I'm not sure what you want to solve that can't be handled
> already by the scheduler and report-capabilities periodic job?  You can
> already report back from your backend to the scheduler that you shouldn't
> be used for any scheduling activities going forward.  More detailed info
> than that might be useful, but I'm not sure it wouldn't fall into an
> already existing OpenStack monitoring project like Monasca?

My storage requires maintenance and now all volumes are unaccessible.  I
have management access and create as many volumes as I want, but no
attach.  Or the storage is down totally.  Or it is up but
performance/reliability is degraded due to rebuilds in progress.  Or
multiple disks failed, and I lost data from 100 volumes.

In all these cases, all I see is that my volumes are available/in-use.  To
have any real insight into what is going on the admin has to go to the
storage backend and use vendor-specific APIs to find out.  Why not abstract
these APIs as well, to allow the admin to monitor the storage?  It can be
as simple as "Hey, there's a problem, your volumes aren't accessible - go
look at the backend's UI" - without going into details.

Do you propose every vendor write a Monasca plugin?  It doesn't seem to be
in line with their goal...

As far as volumes, I personally don't think volumes should have more than a
> few states.  They're either "ok" and available for an operation or they're
> not.

I agree.  In my opinion volumes have way too many states today.  But that's
another topic.  What I am proposing is not new states, or a new state
machine, but rather a simple health property: volume['health'] = "healthy",
volume['health'] = "error".  Whatever the backend reports.

> The list you have seems ok to me, but I don't see a ton of value in fault
> prediction or going to great lengths to avoid something failing. The
> current model we have of a volume being "ok" until it's "not" seems
> perfectly reasonable to me.  Typically my experience is that trying to be
> clever and polling/monitoring to try and preemptively change the status of
> a volume does little more than result in complexity, confusion and false
> status changes of resources.  I'm pretty strongly opposed to having a level
> of granularity of the volume here.  At least for now, I'd rather see what
> you have in mind for the backend and nail that down to something that's
> solid and basically bullet proof before trying to tackle thousands of
> volumes which have transient states.  And of course the biggest question I
> have still "what problem" you hope to solve here?

This is not about fault prediction, or preemptive changes, or anything
fancy like that.  It's simply reporting on the current health.  "You have
lost the data in this volume, sorry".  "Don't bother trying to attach this
volume right now, it's not accessible."  "The storage is currently doing
something with your volume and performance will suck."

I don't know exactly what we want to expose - I'd rather answer that after
getting feedback from vendors about what information is available.  But
providing some real, up to date, health status on storage resources is a
big value for customers.


*Avishay Traeger, PhD*
*System Architect*

Mobile: +972 54 447 1475
E-mail: avishay at stratoscale.com

Web <http://www.stratoscale.com/> | Blog <http://www.stratoscale.com/blog/>
 | Twitter <https://twitter.com/Stratoscale> | Google+
 | Linkedin <https://www.linkedin.com/company/stratoscale>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160815/5c1c1141/attachment.html>

More information about the OpenStack-dev mailing list