Thanks for your reply, Sean! I've submitted a bug for nova: https://bugs.launchpad.net/nova/+bug/2130429 I'll mention here that it's actually fully reproducible in devstack environment, I just mistakenly switched into file-backed nova VMs in my environment during testing, and such uploads are treated as qcow2 instead of raw in which case Glance is capable of detecting data corruption which fails shelve procedure and prevents VM from being offloaded. I've provided detailed reproduction steps in the bug description. On Fri, Oct 31, 2025 at 12:38 PM Sean Mooney <smooney@redhat.com> wrote:
hi thanks for the report can we get the detail into a launchpad bug?
you can file the single bug for nova and we can triage it fully but ill reply inlien as well
On 31/10/2025 08:51, Vladimir Prokofev wrote:
Hello everyone.
I've discovered a bug in nova/glanceclient interoperation that leads to image corruption during shelving under some circumstances involving connection disruption.
I was able to reproduce it in two different production environments running on xena and 2024.2 respectively, I'm also able to reproduce it in devstack though without image corruption - there it just ends up in a shelve failure due to detected corruption.
My setup assumes LVM-backed QEMU VMs with backend in CEPH for Glance, but I believe this is applicable to a variety of nova/glance backends.
When shelve is triggered, nova-compute creates an image file locally, and then initiates upload of said image file into Glance[0]. If something happens to the connection during the upload("broken pipe", "connection timeout") - nova-compute retries upload operation[1] while image object is removed from CEPH backend by Glance[2] Problem here is that the image_file object that is passed to glanceclient by nova-compute is a byte-stream created with an open()[3] call, and upon retry it resumes the upload from the point where it was interrupted. This is easily confirmed by calling image_data.tell() in glanceclient.v2.images.Controller.upload() function - it will be at zero initially and non-zero on retry.
so looking at the wapper object fixing this in nova will be kidn of anoying.
the fix woudl be to just seek the byte stream back to the sart of the file.
however the retry is doen dynamiclly today via the overloaded call fucntion which does not currently have awareness of which method is being invoked and as a result does not reset the stream.
if the retyr logic was scoped to the upload function say here
https://opendev.org/openstack/nova/src/branch/master/nova/image/glance.py#L5... it woudl be cleaner to fix
that does not mean we cant do something like check the method that is passed here
https://opendev.org/openstack/nova/src/branch/master/nova/image/glance.py#L1...
and do some processing on the args
or modify
https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/drive... or
https://opendev.org/openstack/nova/src/branch/master/nova/image/glance.py#L6...
to be a closure/wrapped in a decorator that woudl
infact the simplest fix might be to add a finally block here
https://opendev.org/openstack/nova/src/branch/master/nova/image/glance.py#L7... that just seeks the data back to 0
that way it undoes the internal modifiction to the stream postion.
an alternive woudl be to fix this in glance client but not galce itself.
https://github.com/openstack/python-glanceclient/blob/master/glanceclient/v2...
we coudl modify the upload funciton to first seek the image data to localtion 0 and then proceed with the upload.
the api contract of the method say that the image data is a file like obejct
https://docs.python.org/3/glossary.html#term-file-like-object that just says a file like object is a a synonym for https://docs.python.org/3/glossary.html#term-file-object
```
An object exposing a file-oriented API (with methods such as |read()| or |write()|) to an underlying resource. Depending on the way it was created, a file object can mediate access to a real on-disk file or to another type of storage or communication device (for example standard input/output, in-memory buffers, sockets, pipes, etc.). File objects are also called /file-like objects/ or /streams/.
There are actually three categories of file objects: raw binary files <https://docs.python.org/3/glossary.html#term-binary-file>, buffered binary files <https://docs.python.org/3/glossary.html#term-binary-file> and text files <https://docs.python.org/3/glossary.html#term-text-file>. Their interfaces are defined in the |io| <https://docs.python.org/3/library/io.html#module-io> module. The canonical way to create a file object is by using the |open()| <https://docs.python.org/3/library/functions.html#open> function.
```
ok so if we look at https://docs.python.org/3/library/io.html#class-hierarchy what does the interface require
the base class of the interface is https://docs.python.org/3/library/io.html#io.IOBase
it provide "|fileno|, |seek|, and |truncate" as stubs which later calsses must impelnt and "close|, |closed|, |__enter__|, |__exit__|, |flush|, |isatty|, |__iter__|, |__next__|, |readable|, |readline|, |readlines|, |seekable|, |tell|, |writable|, and |writelines" as mixin methods.|
so since https://docs.python.org/3/library/io.html#io.IOBase.seek is required for seekable stream we can us it to reset the stream we pass to the beginging
now does the glance client requrie that you pass a file like object that supprots random access in its api contract? technially no. so it woudl be more correct for nova to do the resetting then the glance client as we as the application can enforece the stlightly stricter requiremnt without narrowing the api contact
glance client in teh future could narrow the contract and sue https://docs.python.org/3/library/io.html#io.IOBase.seekable and implement the resetting of the stream if its not at postion 0 and raises an excption if its not seakable but i dont think that is correct
i think this should be fixed as a nova bug.
I haven't thoroughly checked protection offered in Glance master, I believe it may have been significantly improved, but in older releases such as 2024.2, this leads to a corrupted shelved image, because upon retry only part of an image object is uploaded, after which original VM is offloaded(removed) and you end up with lost data.
Now, my issue here is where to submit this bug: glanceclient or nova? This problem is easily fixable in glanceclient by calling image_data.seek(0) in glanceclient.v2.images.Controller.upload(): it makes sense to always point byte-stream to the beginning before initiating upload, but should it really be responsibility of a client to perform such a sanity check? I'm also not sure if there're cases that this will break, for example if a non-seekable object is passed into glanceclient, but I'm not sure if this is even possible?
[0]
https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/drive...
[1]
https://opendev.org/openstack/nova/src/branch/master/nova/image/glance.py#L1...
[2]
https://opendev.org/openstack/glance_store/src/branch/master/glance_store/_d...
[3]
https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/drive...