[openstack-dev] [nova][cinder] what are the key errors with volume detach

Matt Riedemann mriedem at linux.vnet.ibm.com
Wed Dec 16 16:41:24 UTC 2015



On 12/14/2015 11:24 AM, Andrea Rosa wrote:
>
>
> On 10/12/15 15:29, Matt Riedemann wrote:
>
>>> In a simplified view of a detach volume we can say that the nova code
>>> does:
>>> 1 detach the volume from the instance
>>> 2 Inform cinder about the detach and call the terminate_connection on
>>> the cinder API.
>>> 3 delete the dbm recod in the nova DB
>>
>> We actually:
>>
>> 1. terminate the connection in cinder:
>>
>> https://github.com/openstack/nova/blob/c4ca1abb4a49bf0bce765acd3ce906bd117ce9b7/nova/compute/manager.py#L2312
>>
>>
>> 2. detach the volume
>>
>> https://github.com/openstack/nova/blob/c4ca1abb4a49bf0bce765acd3ce906bd117ce9b7/nova/compute/manager.py#L2315
>>
>>
>> 3. delete the volume (if marked for delete_on_termination):
>>
>> https://github.com/openstack/nova/blob/c4ca1abb4a49bf0bce765acd3ce906bd117ce9b7/nova/compute/manager.py#L2348
>>
>>
>> 4. delete the bdm in the nova db:
>>
>> https://github.com/openstack/nova/blob/c4ca1abb4a49bf0bce765acd3ce906bd117ce9b7/nova/compute/manager.py#L908
>>
>>
>
> I am confused here, why are are you referring to the _shutdown_instance
> code?

Because that's the code in the compute manager that calls cinder to 
terminate the connection to the storage backend and detaches the volume 
from the instance, which you pointed out in your email as part of 
terminating the instance.

>
>
>> So if terminate_connection fails, we shouldn't get to detach. And if
>> detach fails, we shouldn't get to delete.
>>
>>>
>>> If 2 fails the volumes get stuck in a detaching status and any further
>>> attempt to delete or detach the volume will fail:
>>> "Delete for volume <volume_id> failed: Volume <volume_id> is still
>>> attached, detach volume first. (HTTP 400)"
>>>
>>> And if you try to detach:
>>> "EROR (BadRequest): Invalid input received: Invalid volume: Unable to
>>> detach volume. Volume status must be 'in-use' and attach_status must
>>> be 'attached' to detach. Currently: status: 'detaching',
>>> attach_status: 'attached.' (HTTP 400)"
>>>
>>> at the moment the only way to clean up the situation is to hack the
>>> nova DB for deleting the bdm record and do some hack on the cinder
>>> side as well.
>>> We wanted a way to clean up the situation avoiding the manual hack to
>>> the nova DB.
>>
>> Can't cinder rollback state somehow if it's bogus or failed an
>> operation? For example, if detach failed, shouldn't we not be in
>> 'detaching' state? This is like auto-reverting task_state on server
>> instances when an operation fails so that we can reset or delete those
>> servers if needed.
>
> I think that is an option but probably it is part of the redesign of the
> cinder API (see the solution proposed #3), but It would be nice to get
> cinder guys commenting here.
>
>>> Solution proposed #3
>>> Ok, so the solution is to fix the Cinder API and makes the interaction
>>> between Nova volume manager and that API robust.
>>> This time I was right (YAY) but as you can imagine this fix is not
>>> going to be an easy one and after talking with Cinder guys they
>>> clearly told me that thatt is going to be a massive change in the
>>> Cinder API and it is unlikely to land in the N(utella) or O(melette)
>>> release.
>
>> As Sean pointed out in another reply, I feel like what we're really
>> missing here is some rollback code in the case that delete fails so we
>> don't get in this stuck state and have to rely on deleting the BDMs
>> manually in the database just to delete the instance.
>>
>> We should rollback on delete fail 1 so that delete request 2 can pass
>> the 'check attach' checks again.
>
> The communication with cinder is async, Nova doesn't wait or check if
> the detach on cinder side has been executed correctly.

Yeah, I guess nova gets the 202 back:

http://logs.openstack.org/18/258118/2/check/gate-tempest-dsvm-full-ceph/7a5290d/logs/screen-n-cpu.txt.gz#_2015-12-16_03_30_43_990

Should nova be waiting for detach to complete before it tries deleting 
the volume (in the case that delete_on_termination=True in the bdm)?

Should nova be waiting (regardless of volume delete) for the volume 
detach to complete - or timeout and fail the instance delete if it doesn't?

>
> Thanks
> --
> Andrea Rosa
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

-- 

Thanks,

Matt Riedemann




More information about the OpenStack-dev mailing list