[openstack-dev] [Openstack-operators] [nova] Is verification of images in the image cache necessary?
mbooth at redhat.com
Tue May 24 10:49:48 UTC 2016
On Tue, May 24, 2016 at 11:06 AM, John Garbutt <john at johngarbutt.com> wrote:
> On 24 May 2016 at 10:16, Matthew Booth <mbooth at redhat.com> wrote:
> > During its periodic task, ImageCacheManager does a checksum of every
> > in the cache. It verifies this checksum against a previously stored
> > or creates that value if it doesn't already exist. Based on this
> > information it generates a log message if the image is corrupt, but
> > otherwise takes no action. Going by git, this has been the case since
> > The commit which added it was associated with 'blueprint
> > nova-image-cache-management phase 1'. I can't find this blueprint, but I
> > find this page:
> > . This talks about 'detecting images which are corrupt'. It doesn't
> > why we would want to do that, though. It also doesn't seem to have been
> > followed through in the last 4 years, suggesting that nobody's really
> > bothered.
> > I understand that corruption of bits on disks is a thing, but it's a
> > for more than just the image cache. I feel that this is a problem much
> > better solved at other layers, prime candidates being the block and
> > filesystem layers. There are existing robust solutions to bitrot at both
> > these layers which would cover all aspects of data corruption, not just
> > randomly selected slice.
> That might mean improved docs on the need to configure such a thing.
> > As it stands, I think this code is regularly running a pretty expensive
> > looking for something which will very rarely happen, only to generate a
> > message which nobody is looking for. And it could be solved better in
> > ways. Would anybody be sad if I deleted it?
> For completeness, we need to deprecate it using the usual cycles:
I guess I'm arguing that it isn't a feature, and never has been: it really
doesn't do anything at all except generate a log message. Are log messages
part of the deprecation contract?
If operators are genuinely finding corrupt images to be a problem and find
this log message useful that would be extremely useful to know.
> I like the idea of checking the md5 matches before each boot, as it
> mirrors the check we do after downloading from glance. Its possible
> thats very unlikely to spot anything that shouldn't already be worried
> about by something else. It may just be my love of symmetry that makes
> me like that idea?
It just feels arbitrary to me for a few reasons. Firstly, it's only
relevant to storage schemes which use the file in the image cache as a
backing file. In this libvirt driver, this is just the qcow2 backend. While
this is the default, most users are actually using ceph. Assuming it isn't
cloning it directly from ceph-backed glance, the Rbd backend imports from
the image cache during spawn, and has nothing to do with it thereafter. So
for Rbd we'd want to check during spawn. Same for the Flat, Lvm and Ploop
Except that it's still arbitrary because we're not checking the Qcow
overlay on each boot. Or ephemeral or swap disks. Or Lvm, Flat or Rbd disks
at all. Or the operating system. And it's still expensive, and better done
by the block or filesystem layer.
I'm not personally convinced there's all that much point checking during
download either, but given that we're loading all the bits anyway that
check is essentially free. However, even if we decided we needed to defend
the system against bitrot above the block/filesystem layer (and I'm not at
all convinced of that) we'd want a coordinated design for it. Without one,
we risk implementing a bunch of disconnected/incomplete stuff that doesn't
meet anybody's needs, but burns resources anyway.
Red Hat Engineering, Virtualisation Team
Phone: +442070094448 (UK)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev