[openstack-dev] [glance] Periodically checking Glance image files

Sergio A. de Carvalho Jr. scarvalhojr at gmail.com
Wed Sep 14 10:16:47 UTC 2016


I think a proactive background check service could be useful in some cases
but of course it'd have to be optional and configurable to allow operators
to tune the trade-off between the effort required to check all images
versus the risk of hitting a rogue file.

Letting the backend report health back to Glance, as suggested by Avishay,
is also an option but not every backend has this capability (e.g. local
filesystem) and that would also require the backend to keep track of the
original checksum in Glance, which again might not be always possible.

Another option I see is to update the image status when Glance attempts to
serve an image and notices that the file isn't available or doesn't match
the checksum. In Icehouse, Glance simply returns a 500, which doesn't get
properly reported back to the user (when a VM is being created). I'm not
sure if this is handled better in later versions of Glance and Nova.


On Tue, Sep 13, 2016 at 8:01 AM, Avishay Traeger <avishay at stratoscale.com>
wrote:

> On Tue, Sep 13, 2016 at 7:16 AM, Nikhil Komawar <nik.komawar at gmail.com>
> wrote:
> >     Firstly, I'd like to mention that Glance is built-in (and if deployed
> >     correctly) is self-resilient in ensuring that you do NOT need an
> audit
> >     of such files. In fact, if any operator (particularly large scale
> >     operator) needs such a system we have a serious issue where
> >     potentially
> >     important /user/ data is likely to be lost resulting in legal
> >     issues (so
> >     please beware).
>
> Can you please elaborate on how Glance is self-resilient?
>
> Hey Sergio,
>>
>>
>> Glad to know that you're not having any feature related issues (to me
>> this is a good sign). Based on your answers, it makes sense to require a
>> reliability solution for backend data (or some sort of health monitoring
>> for the user data).
>>
>
> All backends will at some point lose some data.  The ask is for reflecting
> the image's "health" to the user.
>
>
>> So, I wonder what your thoughts are for such an audit system. At a first
>> glance, this looks rather not scalable, at least if you plan to do the
>> audit on all of the active images. Consider a deployment trying to run
>> this for around 100-500K active image records. This will need to be run
>> in batches, thus completing the list of records and saying that you've
>> done a full audit of the active image -- is a NP-complete problem (new
>> images can be introduced, some images can be updated in the meantime,
>> etc.)
>>
>
> NP-complete?  Really?  Every storage system scrubs all data periodically
> to protect from disk errors.  Glance images should be relatively static
> anyway.
>
>
>> The failure rate is low, so a random (sparse check) on the image data
>> won't help either. Would a cron job setup to do the audit for smaller
>> deployments work? May be we can look into some known cron solutions to
>> do the trick?
>>
>
> How about letting the backend report the health?  S3, for example, reports
> an event on object loss
> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html#supported-notification-event-types>.
> The S3 driver could monitor those events and update status.  Swift performs
> scrubbing to determine object health - I haven't checked if it reports an
> event on object loss, but don't see any reason not to.  For local
> filesystem, it would need its own scrubbing process (e.g., recalculate hash
> for each object every N days).  On the other hand if it is a mount of some
> filer, the filer should be able to report on health.
>
> Thanks,
> Avishay
>
> --
> *Avishay Traeger, PhD*
> *System Architect*
>
> Mobile: +972 54 447 1475
> E-mail: avishay at stratoscale.com
>
>
>
> Web <http://www.stratoscale.com/> | Blog
> <http://www.stratoscale.com/blog/> | Twitter
> <https://twitter.com/Stratoscale> | Google+
> <https://plus.google.com/u/1/b/108421603458396133912/108421603458396133912/posts>
>  | Linkedin <https://www.linkedin.com/company/stratoscale>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160914/affcad64/attachment.html>


More information about the OpenStack-dev mailing list