Cinder snapshot delete successful when expected to fail
Tony Pearce
tony.pearce at cinglevue.com
Mon Jan 20 08:57:19 UTC 2020
Hi all,
I made some progress on this, but I am unsure how to make it better. Long
story short:
- queens issue = snapshot is deleted from openstack but shouldnt be
because the snapshot is unable to be deleted on the storage side
- compared pike / queens / stein "nimble.py" and found all are
different, with each newer version of openstack having additions in code
- done some trial and error tests and decided to use the stein nimble.py
and modify it
- found the 'delete snapshot' section and found it is set to not halt on
error
- added 'raise' into the function and re-tested the 'delete snapshot'
scenario
- RESULT = now the snapshot is NOT deleted but goes "error deleting"
instead :)
So now after making that change, the snapshot is now in an unavailable
status. I am looking as to how I can do something else other than make this
snapshot go into an unavailable condition. Such as display a message while
keeping the snapshot "available" because it can still be used
Short story long:
The "nimble.py" driver changes between pike,queens,stein versions (though
within the file it has "driver version 4.0.1" on all). Pike has around 1700
lines. Queens has 1900 and Stein has 1910 approx.
I confirmed the issue with the driver by copying the nimble.py driver (and
the other 2 files named nimble.pyc and nimble.pyo) from Pike into the
Queens test env. to test if the snapshot still gets deleted under Queens or
shows an error instead. The snapshot was not deleted and it goes error
status as expected.
note: Initially, I only copied the text data from nimble.py and it appears
as though the update to the text file was ignored. It looks to me like,
openstack uses one of those .pyc or .pyo files instead. I googled on this
and they are binaries that are used in some situations. If I make any
changes to the nimble.py file then I need to re-generate those .pyc and
.pyo files from the .py.
So what is happening here is; I want to try and delete a snapshot that has
a clone. The expected outcome is the snapshot is not deleted in Openstack.
Current experience is that Openstack deletes the snapshot from the volume
snapshots, leaving the snapshot behind on the array storage side.
In the volume.log, I see the array sends back an error 409 with "has a
clone" response. I managed to find which section is printing the error in
the volume.log from the nimble.py driver file and so I edited the text
section that gets printed and re-run the test. The volume.log now gets
printed with the new text additions I added 'DELETE' and 'Response' words:
: NimbleAPIException: DELETE Failed to execute api
snapshots/0464a5fd65d75fcfe1000000000000011d00001b1d: Response Error Code:
409 Message: Snapshot snapshot-4ee076ad-2e14-4d0d-bc20-64c17c741f8c for
volume volume-7dd64cf1-1d56-4f21-a153-a8137b68c557 has a clone.
This is the python code where I made those changes: basically, because the
error code is not 200 or 201 then it throws the error from what I
understand.
> def delete(self, api):
> url = self.uri + api
> r = requests.delete(url, headers=self.headers, verify=self.verify)
> if r.status_code != 201 and r.status_code != 200:
> base = "DELETE Failed to execute api %(api)s: Response Error
> Code: %(code)s" % {
> 'api': api,
> 'code': r.status_code}
> LOG.debug("Base error : %(base)s", {'base': base})
> try:
> msg = _("%(base)s Message: %(msg)s") % {
> 'base': base,
> 'msg': r.json()['messages'][1]['text']}
> except IndexError:
> msg = _("%(base)s Message: %(msg)s") % {
> 'base': base,
> 'msg': six.text_type(r.json())}
> raise NimbleAPIException(msg)
> return r.json()
However, slightly before the above within nimble.py I think I have found
the function that is causing the initial problem:
def delete_snap(self, volume_name, snap_name):
> snap_info = self.get_snap_info(snap_name, volume_name)
> api = "snapshots/" + six.text_type(snap_info['id'])
> try:
> self.delete(api)
> except NimbleAPIException as ex:
> LOG.debug("delete snapshot exception: %s", ex)
> if SM_OBJ_HAS_CLONE in six.text_type(ex):
> # if snap has a clone log the error and continue ahead
> LOG.warning('Snapshot %(snap)s : %(state)s',
> {'snap': snap_name,
> 'state': SM_OBJ_HAS_CLONE})
> else:
> raise
SM_OBJ_HAS_CLONE is looking for "has a clone" and it's defined in the
beginning of the file: "SM_OBJ_HAS_CLONE = "has a clone"" and I can see
this in the log file "has a clone" as a response 409.
My problem is literally " # if snap has a clone log the error and continue
ahead" - it shouldnt be continuing, because by continuing it is deleting
the snapshot on the Openstack side but is unable to do the same on the
storage side because of the dependency issue.
So what I did next was to look into the different "delete volume" section
for some help - because a similar condition can occur there -> to explain;
if volume2 is a clone of volume1 then we can't delete volume1 until we
first delete volume2.
What I notice is that there is a "raise" in that section at the end - I
think I understand this to be throwing an exception to openstack. ie to
cause an error condition.
Here's the delete volume section from the driver:
def delete_volume(self, volume):
"""Delete the specified volume."""
backup_snap_name, backup_vol_name = self
.is_volume_backup_clone(volume)
eventlet.sleep(DEFAULT_SLEEP)
self.APIExecutor.online_vol(volume['name'], False)
LOG.debug("Deleting volume %(vol)s", {'vol': volume['name']})
@utils.retry(NimbleAPIException, retries=3)
def _retry_remove_vol(volume):
self.APIExecutor.delete_vol(volume['name'])
try:
_retry_remove_vol(volume)
except NimbleAPIException as ex:
LOG.debug("delete volume exception: %s", ex)
if SM_OBJ_HAS_CLONE in six.text_type(ex):
LOG.warning('Volume %(vol)s : %(state)s',
{'vol': volume['name'],
'state': SM_OBJ_HAS_CLONE})
# set the volume back to be online and raise busy exception
self.APIExecutor.online_vol(volume['name'], True)
raise exception.VolumeIsBusy(volume_name=volume['name'])
raise
So with the above, I modified the delete snapshot section and put in a
simple "raise" like this (highlighted in yellow)
>
> def delete_snap(self, volume_name, snap_name):
> snap_info = self.get_snap_info(snap_name, volume_name)
> api = "snapshots/" + six.text_type(snap_info['id'])
> try:
> self.delete(api)
> except NimbleAPIException as ex:
> LOG.debug("delete snapshot exception: %s", ex)
> if SM_OBJ_HAS_CLONE in six.text_type(ex):
> # if snap has a clone log the error and continue ahead
> LOG.warning('Snapshot %(snap)s : %(state)s',
> {'snap': snap_name,
> 'state': SM_OBJ_HAS_CLONE})
> raise
> else:
> raise
And now when I test, the snapshot is not deleted but it instead goes into
ERROR-DELETING. It's not perfect but at least I can make the snapshot back
to "available" from the admin section within Openstack.
Would anyone be able to if possible, give me some pointers how to accept
this error but not cause the snapshot to go into "error" ? I think that I
need to create a class?
regards
*Tony Pearce* |
*Senior Network Engineer / Infrastructure Lead**Cinglevue International
<https://www.cinglevue.com>*
Email: tony.pearce at cinglevue.com
Web: http://www.cinglevue.com
*Australia*
1 Walsh Loop, Joondalup, WA 6027 Australia.
Direct: +61 8 6202 0036 | Main: +61 8 6202 0024
Note: This email and all attachments are the sole property of Cinglevue
International Pty Ltd. (or any of its subsidiary entities), and the
information contained herein must be considered confidential, unless
specified otherwise. If you are not the intended recipient, you must not
use or forward the information contained in these documents. If you have
received this message in error, please delete the email and notify the
sender.
On Sat, 18 Jan 2020 at 09:44, Tony Pearce <tony.pearce at cinglevue.com> wrote:
> Thank you. That really helps.
>
> I am going to diff the nimble.py files between Pike and Queens and see
> what's changed.
>
> On Fri, 17 Jan 2020, 22:18 Alan Bishop, <abishop at redhat.com> wrote:
>
>>
>>
>> On Fri, Jan 17, 2020 at 2:01 AM Tony Pearce <tony.pearce at cinglevue.com>
>> wrote:
>>
>>> Could anyone help by pointing me where to go to be able to dig into this
>>> issue further?
>>>
>>> I have installed a test Openstack environment using RDO Packstack. I
>>> wanted to install the same version that I have in Production (Pike) but
>>> it's not listed in the CentOS repo via yum search. So I installed Queens. I
>>> am using nimble.py Cinder driver. Nimble Storage is a storage array
>>> accessed via iscsi from the Openstack host, and is controlled from
>>> Openstack by the driver and API.
>>>
>>> *What I expected to happen:*
>>> 1. create an instance with volume (the volume is created on the storage
>>> array successfully and instance boots from it)
>>> 2. take a snapshot (snapshot taken on the volume on the array
>>> successfully)
>>> 3. create a new instance from the snapshot (the api tells the array to
>>> clone the snapshot into a new volume on the array and use that volume for
>>> the instance)
>>> 4. try and delete the snapshot
>>> Expected Result - Openstack gives the user a message like "you're not
>>> allowed to do that".
>>>
>>> Note: Step 3 above creates a child volume from the parent snapshot.
>>> It's impossible to delete the parent snapshot because IO READ is sent to
>>> that part of the original volume (as I understand it).
>>>
>>> *My production problem is this: *
>>> 1. create an instance with volume (the volume is created on the storage
>>> array successfully)
>>> 2. take a snapshot (snapshot taken on the volume on the array
>>> successfully)
>>> 3. create a new instance from the snapshot (the api tells the array to
>>> clone the snapshot into a new volume on the array and use that volume for
>>> the instance)
>>> 4. try and delete the snapshot
>>> Result - snapshot goes into error state and later, all Cinder operations
>>> fail such as new instance/create volume etc. until the correct service is
>>> restarted. Then everything works once again.
>>>
>>>
>>> To troubleshoot the above, I installed the RDP Packstack Queens (because
>>> I couldnt get Pike). I tested the above and now, the result is the snapshot
>>> is successfully deleted from openstack but not deleted on the array. The
>>> log is below for reference. But I can see the in the log that the array
>>> sends back info to openstack saying the snapshot has a clone and the delete
>>> cannot be done because of that. Also response code 409.
>>>
>>> *Some info about why the problem with Pike started in the first place*
>>> 1. Vendor is Nimble Storage which HPE purchased
>>> 2. HPE/Nimble have dropped support for openstack. Latest supported
>>> version is Queens and Nimble array version v4.x. The current Array version
>>> is v5.x. Nimble say there are no guarantees with openstack, the driver and
>>> the array version v5.x
>>> 3. I was previously advised by Nimble that the array version v5.x will
>>> work fine and so left our DR array on v5.x with a pending upgrade that had
>>> a blocker due to an issue. This issue was resolved in December and the
>>> pending upgrade completed to match the DR array took place around 30 days
>>> ago.
>>>
>>>
>>> With regards to the production issue, I assumed that the array API has
>>> some changes between v4.x and v5.x and it's causing an issue with Cinder
>>> due to the API response. Although I have not been able to find out if or
>>> what changes there are that may have occurred after the array upgrade, as
>>> the documentation for this is Nimble internal-only.
>>>
>>>
>>> *So with that - some questions if I may:*
>>> When Openstack got the 409 error response from the API (as seen in the
>>> log below), why would Openstack then proceed to delete the snapshot on the
>>> Openstack side? How could I debug this further? I'm not sure what Openstack
>>> Cinder is acting on in terns of the response as yet. Maybe Openstack is not
>>> specifically looking for the error code in the response?
>>>
>>> The snapshot that got deleted on the openstack side is a problem. Would
>>> this be related to the driver? Could it be possible that the driver did not
>>> pass the error response to Cinder?
>>>
>>
>> Hi Tony,
>>
>> This is exactly what happened, and it appears to be a driver bug
>> introduced in queens by [1]. The code in question [2] logs the error, but
>> fails to propagate the exception. As far as the volume manager is
>> concerned, the snapshot deletion was successful.
>>
>> [1] https://review.opendev.org/601492
>> [2]
>> https://opendev.org/openstack/cinder/src/branch/stable/queens/cinder/volume/drivers/nimble.py#L1815
>>
>> Alan
>>
>> Thanks in advance. Just for reference, the log snippet is below.
>>>
>>>
>>> ==> volume.log <==
>>>> 2020-01-17 16:53:23.718 24723 WARNING py.warnings
>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>> 87e34c89e6fb41d2af25085b64011a55 - default default]
>>>> /usr/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:852:
>>>> InsecureRequestWarning: Unverified HTTPS request is being made. Adding
>>>> certificate verification is strongly advised. See:
>>>> https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
>>>> InsecureRequestWarning)
>>>> : NimbleAPIException: Failed to execute api
>>>> snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409
>>>> Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.
>>>> ==> api.log <==
>>>> 2020-01-17 16:53:23.769 25242 INFO cinder.api.openstack.wsgi
>>>> [req-bfcbff34-134b-497e-82b1-082d48f8767f df7548ecad684f26b8bc802ba63a9814
>>>> 87e34c89e6fb41d2af25085b64011a55 - default default]
>>>> http://192.168.53.45:8776/v3/87e34c89e6fb41d2af25085b64011a55/volumes/detail
>>>> returned with HTTP 200
>>>> 2020-01-17 16:53:23.770 25242 INFO eventlet.wsgi.server
>>>> [req-bfcbff34-134b-497e-82b1-082d48f8767f df7548ecad684f26b8bc802ba63a9814
>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] 192.168.53.45 "GET
>>>> /v3/87e34c89e6fb41d2af25085b64011a55/volumes/detail HTTP/1.1" status: 200
>>>> len: 4657 time: 0.1152730
>>>> ==> volume.log <==
>>>> 2020-01-17 16:53:23.811 24723 WARNING py.warnings
>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>> 87e34c89e6fb41d2af25085b64011a55 - default default]
>>>> /usr/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:852:
>>>> InsecureRequestWarning: Unverified HTTPS request is being made. Adding
>>>> certificate verification is strongly advised. See:
>>>> https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
>>>> InsecureRequestWarning)
>>>> : NimbleAPIException: Failed to execute api
>>>> snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409
>>>> Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.
>>>> 2020-01-17 16:53:23.902 24723 ERROR cinder.volume.drivers.nimble
>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] Re-throwing Exception
>>>> Failed to execute api snapshots/0464a5fd65d75fcfe1000000000000011100001a41:
>>>> Error Code: 409 Message: Snapshot
>>>> snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.:
>>>> NimbleAPIException: Failed to execute api
>>>> snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409
>>>> Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.
>>>> 2020-01-17 16:53:23.903 24723 WARNING cinder.volume.drivers.nimble
>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] Snapshot
>>>> snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 : has a clone:
>>>> NimbleAPIException: Failed to execute api
>>>> snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409
>>>> Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.
>>>> 2020-01-17 16:53:23.964 24723 WARNING cinder.quota
>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] Deprecated: Default
>>>> quota for resource: snapshots_Nimble-DR is set by the default quota flag:
>>>> quota_snapshots_Nimble-DR, it is now deprecated. Please use the default
>>>> quota class for default quota.
>>>> 2020-01-17 16:53:24.054 24723 INFO cinder.volume.manager
>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] Delete snapshot
>>>> completed successfully.
>>>
>>>
>>>
>>> Regards,
>>>
>>> *Tony Pearce* |
>>> *Senior Network Engineer / Infrastructure Lead**Cinglevue International
>>> <https://www.cinglevue.com>*
>>>
>>> Email: tony.pearce at cinglevue.com
>>> Web: http://www.cinglevue.com
>>>
>>> *Australia*
>>> 1 Walsh Loop, Joondalup, WA 6027 Australia.
>>>
>>> Direct: +61 8 6202 0036 | Main: +61 8 6202 0024
>>>
>>> Note: This email and all attachments are the sole property of Cinglevue
>>> International Pty Ltd. (or any of its subsidiary entities), and the
>>> information contained herein must be considered confidential, unless
>>> specified otherwise. If you are not the intended recipient, you must not
>>> use or forward the information contained in these documents. If you have
>>> received this message in error, please delete the email and notify the
>>> sender.
>>>
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200120/7a4c0fd3/attachment-0001.html>
More information about the openstack-discuss
mailing list