<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Tue, May 24, 2016 at 11:06 AM, John Garbutt <span dir="ltr"><<a href="mailto:john@johngarbutt.com" target="_blank">john@johngarbutt.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On 24 May 2016 at 10:16, Matthew Booth <<a href="mailto:mbooth@redhat.com">mbooth@redhat.com</a>> wrote:<br>

> During its periodic task, ImageCacheManager does a checksum of every image<br>

> in the cache. It verifies this checksum against a previously stored value,<br>

> or creates that value if it doesn't already exist.[1] Based on this<br>

> information it generates a log message if the image is corrupt, but<br>

> otherwise takes no action. Going by git, this has been the case since 2012.<br>

><br>

> The commit which added it was associated with 'blueprint<br>

> nova-image-cache-management phase 1'. I can't find this blueprint, but I did<br>

> find this page: <a href="https://wiki.openstack.org/wiki/Nova-image-cache-management" rel="noreferrer" target="_blank">https://wiki.openstack.org/wiki/Nova-image-cache-management</a><br>

> . This talks about 'detecting images which are corrupt'. It doesn't explain<br>

> why we would want to do that, though. It also doesn't seem to have been<br>

> followed through in the last 4 years, suggesting that nobody's really that<br>

> bothered.<br>

><br>

> I understand that corruption of bits on disks is a thing, but it's a thing<br>

> for more than just the image cache. I feel that this is a problem much<br>

> better solved at other layers, prime candidates being the block and<br>

> filesystem layers. There are existing robust solutions to bitrot at both of<br>

> these layers which would cover all aspects of data corruption, not just this<br>

> randomly selected slice.<br>

<br>

</span>+1<br>

<br>

That might mean improved docs on the need to configure such a thing.<br>

<span class=""><br>

> As it stands, I think this code is regularly running a pretty expensive task<br>

> looking for something which will very rarely happen, only to generate a log<br>

> message which nobody is looking for. And it could be solved better in other<br>

> ways. Would anybody be sad if I deleted it?<br>

<br>

</span>For completeness, we need to deprecate it using the usual cycles:<br>

<a href="https://governance.openstack.org/reference/tags/assert_follows-standard-deprecation.html" rel="noreferrer" target="_blank">https://governance.openstack.org/reference/tags/assert_follows-standard-deprecation.html</a></blockquote><div><br></div><div>I guess I'm arguing that it isn't a feature, and never has been: it really doesn't do anything at all except generate a log message. Are log messages part of the deprecation contract?</div><div><br></div><div>If operators are genuinely finding corrupt images to be a problem and find this log message useful that would be extremely useful to know.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

I like the idea of checking the md5 matches before each boot, as it<br>

mirrors the check we do after downloading from glance. Its possible<br>

thats very unlikely to spot anything that shouldn't already be worried<br>

about by something else. It may just be my love of symmetry that makes<br>

me like that idea?<br></blockquote><div><br></div><div>It just feels arbitrary to me for a few reasons. Firstly, it's only relevant to storage schemes which use the file in the image cache as a backing file. In this libvirt driver, this is just the qcow2 backend. While this is the default, most users are actually using ceph. Assuming it isn't cloning it directly from ceph-backed glance, the Rbd backend imports from the image cache during spawn, and has nothing to do with it thereafter. So for Rbd we'd want to check during spawn. Same for the Flat, Lvm and Ploop backends.</div><div><br></div><div>Except that it's still arbitrary because we're not checking the Qcow overlay on each boot. Or ephemeral or swap disks. Or Lvm, Flat or Rbd disks at all. Or the operating system. And it's still expensive, and better done by the block or filesystem layer.</div><div><br></div><div>I'm not personally convinced there's all that much point checking during download either, but given that we're loading all the bits anyway that check is essentially free. However, even if we decided we needed to defend the system against bitrot above the block/filesystem layer (and I'm not at all convinced of that) we'd want a coordinated design for it. Without one, we risk implementing a bunch of disconnected/incomplete stuff that doesn't meet anybody's needs, but burns resources anyway.</div><div><br></div><div>Matt</div></div>-- <br><div class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><span style="font-size:12.8px">Matthew Booth</span><br></div><div>Red Hat Engineering, Virtualisation Team</div><div><br></div><div>Phone: +442070094448 (UK)</div><div><br></div></div></div></div></div>

</div></div>