[openstack-dev] [nova][cinder] what are the key errors with volume detach

Matt Riedemann mriedem at linux.vnet.ibm.com
Fri Dec 18 14:25:14 UTC 2015



On 12/17/2015 9:24 AM, Matt Riedemann wrote:
>
>
> On 12/17/2015 8:51 AM, Andrea Rosa wrote:
>>
>>>> The communication with cinder is async, Nova doesn't wait or check if
>>>> the detach on cinder side has been executed correctly.
>>>
>>> Yeah, I guess nova gets the 202 back:
>>>
>>> http://logs.openstack.org/18/258118/2/check/gate-tempest-dsvm-full-ceph/7a5290d/logs/screen-n-cpu.txt.gz#_2015-12-16_03_30_43_990
>>>
>>>
>>>
>>> Should nova be waiting for detach to complete before it tries deleting
>>> the volume (in the case that delete_on_termination=True in the bdm)?
>>>
>>> Should nova be waiting (regardless of volume delete) for the volume
>>> detach to complete - or timeout and fail the instance delete if it
>>> doesn't?
>>
>> I'll revisit this change next year trying to look at the problem in a
>> different way.
>> Thank you all for your time and all the suggestions.
>> --
>> Andrea Rosa
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> I had a quick discussion with hemna this morning and he confirmed that
> nova should be waiting for os-detach to complete before we try to delete
> the volume, because if the volume status isn't 'available' the delete
> will fail.
>
> Also, if nova is hitting a failure to delete the volume it's swallowing
> it by passing raise_exc=False to _cleanup_volumes here [1]. Then we go
> on our merry way and delete the bdms in the nova database [2]. But I'd
> think at that point we're orphaning volumes in cinder that think they
> are still attached.
>
> If this is passing today it's probably just luck that we're getting the
> volume detached fast enough before we try to delete it.
>
> [1]
> https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2425-L2426
>
> [2]
> https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L909
>

I've confirmed that we definitely race in the gate with detach of the 
volume and then deleting it, we fail to delete the volume about 28K 
times in a week in the gate [1].

I've opened a bug [2] to track fixing this.

[1] 
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Failed%20to%20delete%20volume%5C%22%20AND%20message:%5C%22due%20to%5C%22%20AND%20tags:%5C%22screen-n-cpu.txt%5C%22
[2] https://bugs.launchpad.net/nova/+bug/1527623

-- 

Thanks,

Matt Riedemann




More information about the OpenStack-dev mailing list