[openstack-dev] Improvement of Cinder API wrt https://bugs.launchpad.net/nova/+bug/1213953

Avishay Traeger AVISHAY at il.ibm.com
Tue Nov 5 07:27:23 UTC 2013


So while doubling the timeout will fix some cases, there will be cases with
larger volumes and/or slower systems where the bug will still hit.  Even
timing out on the download progress can lead to unnecessary timeouts (if
it's really slow, or volume is really big, it can stay at 5% for some
time).

I think the proper fix is to make sure that Cinder is moving the volume
into 'error' state in all cases where there is an error.  Nova can then
poll as long as its in the 'downloading' state, until it's 'available' or
'error'.  Is there a reason why Cinder would legitimately get stuck in
'downloading'?

Thanks,
Avishay



From:	John Griffith <john.griffith at solidfire.com>
To:	"OpenStack Development Mailing List (not for usage questions)"
            <openstack-dev at lists.openstack.org>,
Date:	11/05/2013 07:41 AM
Subject:	Re: [openstack-dev] Improvement of Cinder API wrt
            https://bugs.launchpad.net/nova/+bug/1213953



On Tue, Nov 5, 2013 at 7:27 AM, John Griffith
<john.griffith at solidfire.com> wrote:
> On Tue, Nov 5, 2013 at 6:29 AM, Chris Friesen
> <chris.friesen at windriver.com> wrote:
>> On 11/04/2013 03:49 PM, Solly Ross wrote:
>>>
>>> So, There's currently an outstanding issue with regards to a Nova
>>> shortcut command that creates a volume from an image and then boots
>>> from it in one fell swoop.  The gist of the issue is that there is
>>> currently a set timeout which can time out before the volume creation
>>> has finished (it's designed to time out in case there is an error),
>>> in cases where the image download or volume creation takes an
>>> extended period of time (e.g. under a Gluster backend for Cinder with
>>> certain network conditions).
>>>
>>> The proposed solution is a modification to the Cinder API to provide
>>> more detail on what exactly is going on, so that we could
>>> programmatically tune the timeout.  My initial thought is to create a
>>> new column in the Volume table called 'status_detail' to provide more
>>> detailed information about the current status.  For instance, for the
>>> 'downloading' status, we could have 'status_detail' be the completion
>>> percentage or JSON containing the total size and the current amount
>>> copied.  This way, at each interval we could check to see if the
>>> amount copied had changed, and trigger the timeout if it had not,
>>> instead of blindly assuming that the operation will complete within a
>>> given amount of time.
>>>
>>> What do people think?  Would there be a better way to do this?
>>
>>
>> The only other option I can think of would be some kind of callback that
>> cinder could explicitly call to drive updates and/or notifications of
faults
>> rather than needing to wait for a timeout.  Possibly a combination of
both
>> would be best, that way you could add a --poll option to the "create
volume
>> and boot" CLI command.
>>
>> I come from the kernel-hacking world and most things there involve
>> event-driven callbacks.  Looking at the openstack code I was kind of
>> surprised to see hardcoded timeouts and RPC casts with no callbacks to
>> indicate completion.
>>
>> Chris
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

I believe you're referring to [1], which was closed after a patch was
added to nova to double the timeout length.  Based on comments sounds
like your still seeing issues on some Gluster (maybe other) setups?

Rather than mess with the API in order to do debug, why don't you use
the info in the cinder-logs?

[1] https://bugs.launchpad.net/nova/+bug/1213953

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev






More information about the OpenStack-dev mailing list