[openstack-dev] [nova] [glance] How to deal with aborted image read?

Robert Collins robertc at robertcollins.net
Sun Jun 7 22:22:17 UTC 2015


On 6 June 2015 at 13:08, Ian Cordasco <ian.cordasco at rackspace.com> wrote:
>
>
> On 6/5/15, 02:55, "Flavio Percoco" <flavio at redhat.com> wrote:
>
>>On 04/06/15 11:46 -0600, Chris Friesen wrote:
>>>On 06/04/2015 03:01 AM, Flavio Percoco wrote:
>>>>On 03/06/15 16:46 -0600, Chris Friesen wrote:
>>>>>We recently ran into an issue where nova couldn't write an image file
>>>>>due to
>>>>>lack of space and so just quit reading from glance.
>>>>>
>>>>>This caused glance to be stuck with an open file descriptor, which
>>>>>meant that
>>>>>the image consumed space even after it was deleted.
>>>>>
>>>>>I have a crude fix for nova at
>>>>>"https://review.openstack.org/#/c/188179/"
>>>>>which basically continues to read the image even though it can't write
>>>>>it.
>>>>>That seems less than ideal for large images though.
>>>>>
>>>>>Is there a better way to do this?  Is there a way for nova to indicate
>>>>>to
>>>>>glance that it's no longer interested in that image and glance can
>>>>>close the
>>>>>file?
>>>>>
>>>>>If I've followed this correctly, on the glance side I think the code in
>>>>>question is ultimately
>>>>>glance_store._drivers.filesystem.ChunkedFile.__iter__().
>>>>
>>>>Actually, to be honest, I was quite confused by the email :P
>>>>
>>>>Correct me if I still didn't understand what you're asking.
>>>>
>>>>You ran out of space on the Nova side while downloading the image and
>>>>there's a file descriptor leak somewhere either in that lovely (sarcasm)
>>>>glance wrapper or in glanceclient.
>>>
>>>The first part is correct, but the file descriptor is actually held by
>>>glance-api.
>>>
>>>>Just by reading your email and glancing your patch, I believe the bug
>>>>might be in glanceclient but I'd need to five into this. The piece of
>>>>code you'll need to look into is[0].
>>>>
>>>>glance_store is just used server side. If that's what you meant -
>>>>glance is keeping the request and the ChunkedFile around - then yes,
>>>>glance_store is the place to look into.
>>>>
>>>>[0]
>>>>https://github.com/openstack/python-glanceclient/blob/master/glanceclien
>>>>t/v1/images.py#L152
>>>
>>>I believe what's happening is that the ChunkedFile code opens the file
>>>and creates the iterator.  Nova then starts iterating through the
>>>file.
>>>
>>>If nova (or any other user of glance) iterates all the way through the
>>>file then the ChunkedFile code will hit the "finally" clause in
>>>__iter__() and close the file descriptor.
>>>
>>>If nova starts iterating through the file and then stops (due to
>>>running out of room, for example), the ChunkedFile.__iter__() routine
>>>is left with an open file descriptor.  At this point deleting the
>>>image will not actually free up any space.
>>>
>>>I'm not a glance guy so I could be wrong about the code.  The
>>>externally-visible data are:
>>>1) glance-api is holding an open file descriptor to a deleted image file
>>>2) If I kill glance-api the disk space is freed up.
>>>3) If I modify nova to always finish iterating through the file the
>>>problem doesn't occur in the first place.
>>
>>Gotcha, thanks for explaining. I think the problem is that there might
>>be a reference leak and therefore the FD is kept opened. Probably the
>>request interruption is not getting to the driver. I've filed this
>>bug[0] so we can look into it.
>>
>>[0] https://bugs.launchpad.net/glance-store/+bug/1462235
>>
>>Flavio
>>
>>--
>>@flaper87
>>Flavio Percoco
>
> So the problem is with how we use ResponseSerializer and the ChunkedFile
> (https://git.openstack.org/cgit/openstack/glance/tree/glance/api/v2/image_d
> ata.py#n222). I think the problem we'll have is that webob provides
> nothing on a Response
> (https://webob.readthedocs.org/en/latest/modules/webob.html#response) to
> hook into so we can close the ChunkedFile.
>
> I wonder if we used the body_file attribute if webob would close the file
> when the response is closed (because I'm assuming that nova/glanceclient
> are closing the response with which it's downloading the data).

But the maximum leak time is a single GC run, which we don't expect to
be long, unless the server is super quiet (and if it is, the temporary
leak is less like to be an issue, no?).

I wonder if there's actually something else going on here. E.g. a
broken LB in front of glance-api which is preventing the HTTP
connection termination from being detected, and the thread is staying
open-and-stalled. That would explain the symptoms just as well - and
is deployer specific so could also explain the (perceived) trickiness
in reproduction.

-Rob

-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-dev mailing list