On 12/17/2015 9:24 AM, Matt Riedemann wrote: > > > On 12/17/2015 8:51 AM, Andrea Rosa wrote: >> >>>> The communication with cinder is async, Nova doesn't wait or check if >>>> the detach on cinder side has been executed correctly. >>> >>> Yeah, I guess nova gets the 202 back: >>> >>> http://logs.openstack.org/18/258118/2/check/gate-tempest-dsvm-full-ceph/7a5290d/logs/screen-n-cpu.txt.gz#_2015-12-16_03_30_43_990 >>> >>> >>> >>> Should nova be waiting for detach to complete before it tries deleting >>> the volume (in the case that delete_on_termination=True in the bdm)? >>> >>> Should nova be waiting (regardless of volume delete) for the volume >>> detach to complete - or timeout and fail the instance delete if it >>> doesn't? >> >> I'll revisit this change next year trying to look at the problem in a >> different way. >> Thank you all for your time and all the suggestions. >> -- >> Andrea Rosa >> >> __________________________________________________________________________ >> >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > I had a quick discussion with hemna this morning and he confirmed that > nova should be waiting for os-detach to complete before we try to delete > the volume, because if the volume status isn't 'available' the delete > will fail. > > Also, if nova is hitting a failure to delete the volume it's swallowing > it by passing raise_exc=False to _cleanup_volumes here [1]. Then we go > on our merry way and delete the bdms in the nova database [2]. But I'd > think at that point we're orphaning volumes in cinder that think they > are still attached. > > If this is passing today it's probably just luck that we're getting the > volume detached fast enough before we try to delete it. > > [1] > https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L2425-L2426 > > [2] > https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L909 > I've confirmed that we definitely race in the gate with detach of the volume and then deleting it, we fail to delete the volume about 28K times in a week in the gate [1]. I've opened a bug [2] to track fixing this. [1] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Failed%20to%20delete%20volume%5C%22%20AND%20message:%5C%22due%20to%5C%22%20AND%20tags:%5C%22screen-n-cpu.txt%5C%22 [2] https://bugs.launchpad.net/nova/+bug/1527623 -- Thanks, Matt Riedemann