Cinder snapshot delete successful when expected to fail

Rajat Dhasmana rdhasman at redhat.com
Mon Jan 20 09:24:48 UTC 2020


Hi Tony,

The 'raise' you used raises 'NimbleAPIException' which isn't handled
anywhere and is caught by the generic exception block here[1] which sets
your snapshot to error_deleting state.

My suggestion to try raise exception.SnapshotIsBusy exception which will be
caught here[2] and will set your snapshot to available state.

Let me know if it works.

[1]
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L1242
[2]
https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L1228

Regards
Rajat Dhasmana

On Mon, Jan 20, 2020 at 2:35 PM Tony Pearce <tony.pearce at cinglevue.com>
wrote:

> Hi all,
> I made some progress on this, but I am unsure how to make it better. Long
> story short:
>
>    - queens issue = snapshot is deleted from openstack but shouldnt be
>    because the snapshot is unable to be deleted on the storage side
>    - compared pike / queens / stein "nimble.py" and found all are
>    different, with each newer version of openstack having additions in code
>    - done some trial and error tests and decided to use the stein
>    nimble.py and modify it
>    - found the 'delete snapshot' section and found it is set to not halt
>    on error
>    - added 'raise' into the function and re-tested the 'delete snapshot'
>    scenario
>    - RESULT = now the snapshot is NOT deleted but goes "error deleting"
>    instead :)
>
> So now after making that change, the snapshot is now in an unavailable
> status. I am looking as to how I can do something else other than make this
> snapshot go into an unavailable condition. Such as display a message while
> keeping the snapshot "available" because it can still be used
>
> Short story long:
>
> The "nimble.py" driver changes between pike,queens,stein versions (though
> within the file it has "driver version 4.0.1" on all). Pike has around 1700
> lines. Queens has 1900 and Stein has 1910 approx.
>
> I confirmed the issue with the driver by copying the nimble.py driver (and
> the other 2 files named nimble.pyc and nimble.pyo) from Pike into the
> Queens test env. to test if the snapshot still gets deleted under Queens or
> shows an error instead. The snapshot was not deleted and it goes error
> status as expected.
> note: Initially, I only copied the text data from nimble.py and it appears
> as though the update to the text file was ignored. It looks to me like,
> openstack uses one of those .pyc or .pyo files instead. I googled on this
> and they are binaries that are used in some situations. If I make any
> changes to the nimble.py file then I need to re-generate those .pyc and
> .pyo files from the .py.
>
> So what is happening here is; I want to try and delete a snapshot that has
> a clone. The expected outcome is the snapshot is not deleted in Openstack.
> Current experience is that Openstack deletes the snapshot from the volume
> snapshots, leaving the snapshot behind on the array storage side.
>
> In the volume.log, I see the array sends back an error 409 with "has a
> clone" response. I managed to find which section is printing the error in
> the volume.log from the nimble.py driver file and so I edited the text
> section that gets printed and re-run the test. The volume.log now gets
> printed with the new text additions I added 'DELETE' and 'Response' words:
>
> : NimbleAPIException: DELETE Failed to execute api
> snapshots/0464a5fd65d75fcfe1000000000000011d00001b1d: Response Error Code:
> 409 Message: Snapshot snapshot-4ee076ad-2e14-4d0d-bc20-64c17c741f8c for
> volume volume-7dd64cf1-1d56-4f21-a153-a8137b68c557 has a clone.
>
> This is the python code where I made those changes: basically, because the
> error code is not 200 or 201 then it throws the error from what I
> understand.
>
>>     def delete(self, api):
>>         url = self.uri + api
>>         r = requests.delete(url, headers=self.headers, verify=self.verify)
>>         if r.status_code != 201 and r.status_code != 200:
>>             base = "DELETE Failed to execute api %(api)s: Response Error
>> Code: %(code)s" % {
>>                 'api': api,
>>                 'code': r.status_code}
>>             LOG.debug("Base error : %(base)s", {'base': base})
>>             try:
>>                 msg = _("%(base)s Message: %(msg)s") % {
>>                     'base': base,
>>                     'msg': r.json()['messages'][1]['text']}
>>             except IndexError:
>>                 msg = _("%(base)s Message: %(msg)s") % {
>>                     'base': base,
>>                     'msg': six.text_type(r.json())}
>>             raise NimbleAPIException(msg)
>>         return r.json()
>
>
>
> However, slightly before the above within nimble.py I think I have found
> the function that is causing the initial problem:
>
>     def delete_snap(self, volume_name, snap_name):
>>         snap_info = self.get_snap_info(snap_name, volume_name)
>>         api = "snapshots/" + six.text_type(snap_info['id'])
>>         try:
>>             self.delete(api)
>>         except NimbleAPIException as ex:
>>             LOG.debug("delete snapshot exception: %s", ex)
>>             if SM_OBJ_HAS_CLONE in six.text_type(ex):
>>                 # if snap has a clone log the error and continue ahead
>>                 LOG.warning('Snapshot %(snap)s : %(state)s',
>>                             {'snap': snap_name,
>>                              'state': SM_OBJ_HAS_CLONE})
>>             else:
>>                 raise
>
>
>  SM_OBJ_HAS_CLONE  is looking for "has a clone" and it's defined in the
> beginning of the file: "SM_OBJ_HAS_CLONE = "has a clone"" and I can see
> this in the log file "has a clone" as a response 409.
>
> My problem is literally " # if snap has a clone log the error and continue
> ahead" - it shouldnt be continuing, because by continuing it is deleting
> the snapshot on the Openstack side but is unable to do the same on the
> storage side because of the dependency issue.
>
> So what I did next was to look into the different "delete volume" section
> for some help - because a similar condition can occur there -> to explain;
> if volume2 is a clone of volume1 then we can't delete volume1 until we
> first delete volume2.
> What I notice is that there is a "raise" in that section at the end - I
> think I understand this to be throwing an exception to openstack. ie to
> cause an error condition.
>
> Here's the delete volume section from the driver:
>
> def delete_volume(self, volume):
>         """Delete the specified volume."""
>         backup_snap_name, backup_vol_name = self
> .is_volume_backup_clone(volume)
>         eventlet.sleep(DEFAULT_SLEEP)
>         self.APIExecutor.online_vol(volume['name'], False)
>         LOG.debug("Deleting volume %(vol)s", {'vol': volume['name']})
>
>         @utils.retry(NimbleAPIException, retries=3)
>         def _retry_remove_vol(volume):
>             self.APIExecutor.delete_vol(volume['name'])
>         try:
>             _retry_remove_vol(volume)
>         except NimbleAPIException as ex:
>             LOG.debug("delete volume exception: %s", ex)
>             if SM_OBJ_HAS_CLONE in six.text_type(ex):
>                 LOG.warning('Volume %(vol)s : %(state)s',
>                             {'vol': volume['name'],
>                              'state': SM_OBJ_HAS_CLONE})
>
> # set the volume back to be online and raise busy exception
>                 self.APIExecutor.online_vol(volume['name'], True)
>                 raise exception.VolumeIsBusy(volume_name=volume['name'])
>             raise
>
>
> So with the above, I modified the delete snapshot section and put in a
> simple "raise" like this (highlighted in yellow)
>
>>
>>     def delete_snap(self, volume_name, snap_name):
>>         snap_info = self.get_snap_info(snap_name, volume_name)
>>         api = "snapshots/" + six.text_type(snap_info['id'])
>>         try:
>>             self.delete(api)
>>         except NimbleAPIException as ex:
>>             LOG.debug("delete snapshot exception: %s", ex)
>>             if SM_OBJ_HAS_CLONE in six.text_type(ex):
>>                 # if snap has a clone log the error and continue ahead
>>                 LOG.warning('Snapshot %(snap)s : %(state)s',
>>                             {'snap': snap_name,
>>                              'state': SM_OBJ_HAS_CLONE})
>>                 raise
>>             else:
>>                 raise
>
>
>
> And now when I test, the snapshot is not deleted but it instead goes into
> ERROR-DELETING. It's not perfect but at least I can make the snapshot back
> to "available" from the admin section within Openstack.
>
> Would anyone be able to if possible, give me some pointers how to accept
> this error but not cause the snapshot to go into "error" ? I think that I
> need to create a class?
>
> regards
>
> *Tony Pearce*   |
> *Senior Network Engineer / Infrastructure Lead**Cinglevue International
> <https://www.cinglevue.com>*
>
> Email: tony.pearce at cinglevue.com
> Web: http://www.cinglevue.com
>
> *Australia*
> 1 Walsh Loop, Joondalup, WA 6027 Australia.
>
> Direct: +61 8 6202 0036 | Main: +61 8 6202 0024
>
> Note: This email and all attachments are the sole property of Cinglevue
> International Pty Ltd. (or any of its subsidiary entities), and the
> information contained herein must be considered confidential, unless
> specified otherwise.   If you are not the intended recipient, you must not
> use or forward the information contained in these documents.   If you have
> received this message in error, please delete the email and notify the
> sender.
>
>
>
>
> On Sat, 18 Jan 2020 at 09:44, Tony Pearce <tony.pearce at cinglevue.com>
> wrote:
>
>> Thank you. That really helps.
>>
>> I am going to diff the nimble.py files between Pike and Queens and see
>> what's changed.
>>
>> On Fri, 17 Jan 2020, 22:18 Alan Bishop, <abishop at redhat.com> wrote:
>>
>>>
>>>
>>> On Fri, Jan 17, 2020 at 2:01 AM Tony Pearce <tony.pearce at cinglevue.com>
>>> wrote:
>>>
>>>> Could anyone help by pointing me where to go to be able to dig into
>>>> this issue further?
>>>>
>>>> I have installed a test Openstack environment using RDO Packstack. I
>>>> wanted to install the same version that I have in Production (Pike) but
>>>> it's not listed in the CentOS repo via yum search. So I installed Queens. I
>>>> am using nimble.py Cinder driver. Nimble Storage is a storage array
>>>> accessed via iscsi from the Openstack host, and is controlled from
>>>> Openstack by the driver and API.
>>>>
>>>> *What I expected to happen:*
>>>> 1. create an instance with volume (the volume is created on the storage
>>>> array successfully and instance boots from it)
>>>> 2. take a snapshot  (snapshot taken on the volume on the array
>>>> successfully)
>>>> 3. create a new instance from the snapshot (the api tells the array to
>>>> clone the snapshot into a new volume on the array and use that volume for
>>>> the instance)
>>>> 4. try and delete the snapshot
>>>> Expected Result - Openstack gives the user a message like "you're not
>>>> allowed to do that".
>>>>
>>>>  Note: Step 3 above creates a child volume from the parent snapshot.
>>>> It's impossible to delete the parent snapshot because IO READ is sent to
>>>> that part of the original volume (as I understand it).
>>>>
>>>> *My production problem is this: *
>>>> 1. create an instance with volume (the volume is created on the storage
>>>> array successfully)
>>>> 2. take a snapshot  (snapshot taken on the volume on the array
>>>> successfully)
>>>> 3. create a new instance from the snapshot (the api tells the array to
>>>> clone the snapshot into a new volume on the array and use that volume for
>>>> the instance)
>>>> 4. try and delete the snapshot
>>>> Result - snapshot goes into error state and later, all Cinder
>>>> operations fail such as new instance/create volume etc. until the correct
>>>> service is restarted. Then everything works once again.
>>>>
>>>>
>>>> To troubleshoot the above, I installed the RDP Packstack Queens
>>>> (because I couldnt get Pike). I tested the above and now, the result is the
>>>> snapshot is successfully deleted from openstack but not deleted on the
>>>> array. The log is below for reference. But I can see the in the log that
>>>> the array sends back info to openstack saying the snapshot has a clone and
>>>> the delete cannot be done because of that. Also response code 409.
>>>>
>>>> *Some info about why the problem with Pike started in the first place*
>>>> 1. Vendor is Nimble Storage which HPE purchased
>>>> 2. HPE/Nimble have dropped support for openstack. Latest supported
>>>> version is Queens and Nimble array version v4.x. The current Array version
>>>> is v5.x. Nimble say there are no guarantees with openstack, the driver and
>>>> the array version v5.x
>>>> 3. I was previously advised by Nimble that the array version v5.x will
>>>> work fine and so left our DR array on v5.x with a pending upgrade that had
>>>> a blocker due to an issue. This issue was resolved in December and the
>>>> pending upgrade completed to match the DR array took place around 30 days
>>>> ago.
>>>>
>>>>
>>>> With regards to the production issue, I assumed that the array API has
>>>> some changes between v4.x and v5.x and it's causing an issue with Cinder
>>>> due to the API response. Although I have not been able to find out if or
>>>> what changes there are that may have occurred after the array upgrade, as
>>>> the documentation for this is Nimble internal-only.
>>>>
>>>>
>>>> *So with that - some questions if I may:*
>>>>  When Openstack got the 409 error response from the API (as seen in the
>>>> log below), why would Openstack then proceed to delete the snapshot on the
>>>> Openstack side? How could I debug this further? I'm not sure what Openstack
>>>> Cinder is acting on in terns of the response as yet. Maybe Openstack is not
>>>> specifically looking for the error code in the response?
>>>>
>>>> The snapshot that got deleted on the openstack side is a problem. Would
>>>> this be related to the driver? Could it be possible that the driver did not
>>>> pass the error response to Cinder?
>>>>
>>>
>>> Hi Tony,
>>>
>>> This is exactly what happened, and it appears to be a driver bug
>>> introduced in queens by [1]. The code in question [2] logs the error, but
>>> fails to propagate the exception. As far as the volume manager is
>>> concerned, the snapshot deletion was successful.
>>>
>>> [1] https://review.opendev.org/601492
>>> [2]
>>> https://opendev.org/openstack/cinder/src/branch/stable/queens/cinder/volume/drivers/nimble.py#L1815
>>>
>>> Alan
>>>
>>> Thanks in advance. Just for reference, the log snippet is below.
>>>>
>>>>
>>>> ==> volume.log <==
>>>>> 2020-01-17 16:53:23.718 24723 WARNING py.warnings
>>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>>> 87e34c89e6fb41d2af25085b64011a55 - default default]
>>>>> /usr/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:852:
>>>>> InsecureRequestWarning: Unverified HTTPS request is being made. Adding
>>>>> certificate verification is strongly advised. See:
>>>>> https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
>>>>>   InsecureRequestWarning)
>>>>> : NimbleAPIException: Failed to execute api
>>>>> snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409
>>>>> Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.
>>>>> ==> api.log <==
>>>>> 2020-01-17 16:53:23.769 25242 INFO cinder.api.openstack.wsgi
>>>>> [req-bfcbff34-134b-497e-82b1-082d48f8767f df7548ecad684f26b8bc802ba63a9814
>>>>> 87e34c89e6fb41d2af25085b64011a55 - default default]
>>>>> http://192.168.53.45:8776/v3/87e34c89e6fb41d2af25085b64011a55/volumes/detail
>>>>> returned with HTTP 200
>>>>> 2020-01-17 16:53:23.770 25242 INFO eventlet.wsgi.server
>>>>> [req-bfcbff34-134b-497e-82b1-082d48f8767f df7548ecad684f26b8bc802ba63a9814
>>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] 192.168.53.45 "GET
>>>>> /v3/87e34c89e6fb41d2af25085b64011a55/volumes/detail HTTP/1.1" status: 200
>>>>>  len: 4657 time: 0.1152730
>>>>> ==> volume.log <==
>>>>> 2020-01-17 16:53:23.811 24723 WARNING py.warnings
>>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>>> 87e34c89e6fb41d2af25085b64011a55 - default default]
>>>>> /usr/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:852:
>>>>> InsecureRequestWarning: Unverified HTTPS request is being made. Adding
>>>>> certificate verification is strongly advised. See:
>>>>> https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
>>>>>   InsecureRequestWarning)
>>>>> : NimbleAPIException: Failed to execute api
>>>>> snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409
>>>>> Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.
>>>>> 2020-01-17 16:53:23.902 24723 ERROR cinder.volume.drivers.nimble
>>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] Re-throwing Exception
>>>>> Failed to execute api snapshots/0464a5fd65d75fcfe1000000000000011100001a41:
>>>>> Error Code: 409 Message: Snapshot
>>>>> snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.:
>>>>> NimbleAPIException: Failed to execute api
>>>>> snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409
>>>>> Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.
>>>>> 2020-01-17 16:53:23.903 24723 WARNING cinder.volume.drivers.nimble
>>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] Snapshot
>>>>> snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 : has a clone:
>>>>> NimbleAPIException: Failed to execute api
>>>>> snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409
>>>>> Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume
>>>>> volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.
>>>>> 2020-01-17 16:53:23.964 24723 WARNING cinder.quota
>>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] Deprecated: Default
>>>>> quota for resource: snapshots_Nimble-DR is set by the default quota flag:
>>>>> quota_snapshots_Nimble-DR, it is now deprecated. Please use the default
>>>>> quota class for default quota.
>>>>> 2020-01-17 16:53:24.054 24723 INFO cinder.volume.manager
>>>>> [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814
>>>>> 87e34c89e6fb41d2af25085b64011a55 - default default] Delete snapshot
>>>>> completed successfully.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> *Tony Pearce*   |
>>>> *Senior Network Engineer / Infrastructure Lead**Cinglevue
>>>> International <https://www.cinglevue.com>*
>>>>
>>>> Email: tony.pearce at cinglevue.com
>>>> Web: http://www.cinglevue.com
>>>>
>>>> *Australia*
>>>> 1 Walsh Loop, Joondalup, WA 6027 Australia.
>>>>
>>>> Direct: +61 8 6202 0036 | Main: +61 8 6202 0024
>>>>
>>>> Note: This email and all attachments are the sole property of Cinglevue
>>>> International Pty Ltd. (or any of its subsidiary entities), and the
>>>> information contained herein must be considered confidential, unless
>>>> specified otherwise.   If you are not the intended recipient, you must not
>>>> use or forward the information contained in these documents.   If you have
>>>> received this message in error, please delete the email and notify the
>>>> sender.
>>>>
>>>>
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200120/1d5db5cb/attachment-0001.html>


More information about the openstack-discuss mailing list