Hi Tony, The 'raise' you used raises 'NimbleAPIException' which isn't handled anywhere and is caught by the generic exception block here[1] which sets your snapshot to error_deleting state. My suggestion to try raise exception.SnapshotIsBusy exception which will be caught here[2] and will set your snapshot to available state. Let me know if it works. [1] https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L12... [2] https://github.com/openstack/cinder/blob/master/cinder/volume/manager.py#L12... Regards Rajat Dhasmana On Mon, Jan 20, 2020 at 2:35 PM Tony Pearce <tony.pearce@cinglevue.com> wrote:
Hi all, I made some progress on this, but I am unsure how to make it better. Long story short:
- queens issue = snapshot is deleted from openstack but shouldnt be because the snapshot is unable to be deleted on the storage side - compared pike / queens / stein "nimble.py" and found all are different, with each newer version of openstack having additions in code - done some trial and error tests and decided to use the stein nimble.py and modify it - found the 'delete snapshot' section and found it is set to not halt on error - added 'raise' into the function and re-tested the 'delete snapshot' scenario - RESULT = now the snapshot is NOT deleted but goes "error deleting" instead :)
So now after making that change, the snapshot is now in an unavailable status. I am looking as to how I can do something else other than make this snapshot go into an unavailable condition. Such as display a message while keeping the snapshot "available" because it can still be used
Short story long:
The "nimble.py" driver changes between pike,queens,stein versions (though within the file it has "driver version 4.0.1" on all). Pike has around 1700 lines. Queens has 1900 and Stein has 1910 approx.
I confirmed the issue with the driver by copying the nimble.py driver (and the other 2 files named nimble.pyc and nimble.pyo) from Pike into the Queens test env. to test if the snapshot still gets deleted under Queens or shows an error instead. The snapshot was not deleted and it goes error status as expected. note: Initially, I only copied the text data from nimble.py and it appears as though the update to the text file was ignored. It looks to me like, openstack uses one of those .pyc or .pyo files instead. I googled on this and they are binaries that are used in some situations. If I make any changes to the nimble.py file then I need to re-generate those .pyc and .pyo files from the .py.
So what is happening here is; I want to try and delete a snapshot that has a clone. The expected outcome is the snapshot is not deleted in Openstack. Current experience is that Openstack deletes the snapshot from the volume snapshots, leaving the snapshot behind on the array storage side.
In the volume.log, I see the array sends back an error 409 with "has a clone" response. I managed to find which section is printing the error in the volume.log from the nimble.py driver file and so I edited the text section that gets printed and re-run the test. The volume.log now gets printed with the new text additions I added 'DELETE' and 'Response' words:
: NimbleAPIException: DELETE Failed to execute api snapshots/0464a5fd65d75fcfe1000000000000011d00001b1d: Response Error Code: 409 Message: Snapshot snapshot-4ee076ad-2e14-4d0d-bc20-64c17c741f8c for volume volume-7dd64cf1-1d56-4f21-a153-a8137b68c557 has a clone.
This is the python code where I made those changes: basically, because the error code is not 200 or 201 then it throws the error from what I understand.
def delete(self, api): url = self.uri + api r = requests.delete(url, headers=self.headers, verify=self.verify) if r.status_code != 201 and r.status_code != 200: base = "DELETE Failed to execute api %(api)s: Response Error Code: %(code)s" % { 'api': api, 'code': r.status_code} LOG.debug("Base error : %(base)s", {'base': base}) try: msg = _("%(base)s Message: %(msg)s") % { 'base': base, 'msg': r.json()['messages'][1]['text']} except IndexError: msg = _("%(base)s Message: %(msg)s") % { 'base': base, 'msg': six.text_type(r.json())} raise NimbleAPIException(msg) return r.json()
However, slightly before the above within nimble.py I think I have found the function that is causing the initial problem:
def delete_snap(self, volume_name, snap_name):
snap_info = self.get_snap_info(snap_name, volume_name) api = "snapshots/" + six.text_type(snap_info['id']) try: self.delete(api) except NimbleAPIException as ex: LOG.debug("delete snapshot exception: %s", ex) if SM_OBJ_HAS_CLONE in six.text_type(ex): # if snap has a clone log the error and continue ahead LOG.warning('Snapshot %(snap)s : %(state)s', {'snap': snap_name, 'state': SM_OBJ_HAS_CLONE}) else: raise
SM_OBJ_HAS_CLONE is looking for "has a clone" and it's defined in the beginning of the file: "SM_OBJ_HAS_CLONE = "has a clone"" and I can see this in the log file "has a clone" as a response 409.
My problem is literally " # if snap has a clone log the error and continue ahead" - it shouldnt be continuing, because by continuing it is deleting the snapshot on the Openstack side but is unable to do the same on the storage side because of the dependency issue.
So what I did next was to look into the different "delete volume" section for some help - because a similar condition can occur there -> to explain; if volume2 is a clone of volume1 then we can't delete volume1 until we first delete volume2. What I notice is that there is a "raise" in that section at the end - I think I understand this to be throwing an exception to openstack. ie to cause an error condition.
Here's the delete volume section from the driver:
def delete_volume(self, volume): """Delete the specified volume.""" backup_snap_name, backup_vol_name = self .is_volume_backup_clone(volume) eventlet.sleep(DEFAULT_SLEEP) self.APIExecutor.online_vol(volume['name'], False) LOG.debug("Deleting volume %(vol)s", {'vol': volume['name']})
@utils.retry(NimbleAPIException, retries=3) def _retry_remove_vol(volume): self.APIExecutor.delete_vol(volume['name']) try: _retry_remove_vol(volume) except NimbleAPIException as ex: LOG.debug("delete volume exception: %s", ex) if SM_OBJ_HAS_CLONE in six.text_type(ex): LOG.warning('Volume %(vol)s : %(state)s', {'vol': volume['name'], 'state': SM_OBJ_HAS_CLONE})
# set the volume back to be online and raise busy exception self.APIExecutor.online_vol(volume['name'], True) raise exception.VolumeIsBusy(volume_name=volume['name']) raise
So with the above, I modified the delete snapshot section and put in a simple "raise" like this (highlighted in yellow)
def delete_snap(self, volume_name, snap_name): snap_info = self.get_snap_info(snap_name, volume_name) api = "snapshots/" + six.text_type(snap_info['id']) try: self.delete(api) except NimbleAPIException as ex: LOG.debug("delete snapshot exception: %s", ex) if SM_OBJ_HAS_CLONE in six.text_type(ex): # if snap has a clone log the error and continue ahead LOG.warning('Snapshot %(snap)s : %(state)s', {'snap': snap_name, 'state': SM_OBJ_HAS_CLONE}) raise else: raise
And now when I test, the snapshot is not deleted but it instead goes into ERROR-DELETING. It's not perfect but at least I can make the snapshot back to "available" from the admin section within Openstack.
Would anyone be able to if possible, give me some pointers how to accept this error but not cause the snapshot to go into "error" ? I think that I need to create a class?
regards
*Tony Pearce* | *Senior Network Engineer / Infrastructure Lead**Cinglevue International <https://www.cinglevue.com>*
Email: tony.pearce@cinglevue.com Web: http://www.cinglevue.com
*Australia* 1 Walsh Loop, Joondalup, WA 6027 Australia.
Direct: +61 8 6202 0036 | Main: +61 8 6202 0024
Note: This email and all attachments are the sole property of Cinglevue International Pty Ltd. (or any of its subsidiary entities), and the information contained herein must be considered confidential, unless specified otherwise. If you are not the intended recipient, you must not use or forward the information contained in these documents. If you have received this message in error, please delete the email and notify the sender.
On Sat, 18 Jan 2020 at 09:44, Tony Pearce <tony.pearce@cinglevue.com> wrote:
Thank you. That really helps.
I am going to diff the nimble.py files between Pike and Queens and see what's changed.
On Fri, 17 Jan 2020, 22:18 Alan Bishop, <abishop@redhat.com> wrote:
On Fri, Jan 17, 2020 at 2:01 AM Tony Pearce <tony.pearce@cinglevue.com> wrote:
Could anyone help by pointing me where to go to be able to dig into this issue further?
I have installed a test Openstack environment using RDO Packstack. I wanted to install the same version that I have in Production (Pike) but it's not listed in the CentOS repo via yum search. So I installed Queens. I am using nimble.py Cinder driver. Nimble Storage is a storage array accessed via iscsi from the Openstack host, and is controlled from Openstack by the driver and API.
*What I expected to happen:* 1. create an instance with volume (the volume is created on the storage array successfully and instance boots from it) 2. take a snapshot (snapshot taken on the volume on the array successfully) 3. create a new instance from the snapshot (the api tells the array to clone the snapshot into a new volume on the array and use that volume for the instance) 4. try and delete the snapshot Expected Result - Openstack gives the user a message like "you're not allowed to do that".
Note: Step 3 above creates a child volume from the parent snapshot. It's impossible to delete the parent snapshot because IO READ is sent to that part of the original volume (as I understand it).
*My production problem is this: * 1. create an instance with volume (the volume is created on the storage array successfully) 2. take a snapshot (snapshot taken on the volume on the array successfully) 3. create a new instance from the snapshot (the api tells the array to clone the snapshot into a new volume on the array and use that volume for the instance) 4. try and delete the snapshot Result - snapshot goes into error state and later, all Cinder operations fail such as new instance/create volume etc. until the correct service is restarted. Then everything works once again.
To troubleshoot the above, I installed the RDP Packstack Queens (because I couldnt get Pike). I tested the above and now, the result is the snapshot is successfully deleted from openstack but not deleted on the array. The log is below for reference. But I can see the in the log that the array sends back info to openstack saying the snapshot has a clone and the delete cannot be done because of that. Also response code 409.
*Some info about why the problem with Pike started in the first place* 1. Vendor is Nimble Storage which HPE purchased 2. HPE/Nimble have dropped support for openstack. Latest supported version is Queens and Nimble array version v4.x. The current Array version is v5.x. Nimble say there are no guarantees with openstack, the driver and the array version v5.x 3. I was previously advised by Nimble that the array version v5.x will work fine and so left our DR array on v5.x with a pending upgrade that had a blocker due to an issue. This issue was resolved in December and the pending upgrade completed to match the DR array took place around 30 days ago.
With regards to the production issue, I assumed that the array API has some changes between v4.x and v5.x and it's causing an issue with Cinder due to the API response. Although I have not been able to find out if or what changes there are that may have occurred after the array upgrade, as the documentation for this is Nimble internal-only.
*So with that - some questions if I may:* When Openstack got the 409 error response from the API (as seen in the log below), why would Openstack then proceed to delete the snapshot on the Openstack side? How could I debug this further? I'm not sure what Openstack Cinder is acting on in terns of the response as yet. Maybe Openstack is not specifically looking for the error code in the response?
The snapshot that got deleted on the openstack side is a problem. Would this be related to the driver? Could it be possible that the driver did not pass the error response to Cinder?
Hi Tony,
This is exactly what happened, and it appears to be a driver bug introduced in queens by [1]. The code in question [2] logs the error, but fails to propagate the exception. As far as the volume manager is concerned, the snapshot deletion was successful.
[1] https://review.opendev.org/601492 [2] https://opendev.org/openstack/cinder/src/branch/stable/queens/cinder/volume/...
Alan
Thanks in advance. Just for reference, the log snippet is below.
==> volume.log <==
2020-01-17 16:53:23.718 24723 WARNING py.warnings [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814 87e34c89e6fb41d2af25085b64011a55 - default default] /usr/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) : NimbleAPIException: Failed to execute api snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409 Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone. ==> api.log <== 2020-01-17 16:53:23.769 25242 INFO cinder.api.openstack.wsgi [req-bfcbff34-134b-497e-82b1-082d48f8767f df7548ecad684f26b8bc802ba63a9814 87e34c89e6fb41d2af25085b64011a55 - default default] http://192.168.53.45:8776/v3/87e34c89e6fb41d2af25085b64011a55/volumes/detail returned with HTTP 200 2020-01-17 16:53:23.770 25242 INFO eventlet.wsgi.server [req-bfcbff34-134b-497e-82b1-082d48f8767f df7548ecad684f26b8bc802ba63a9814 87e34c89e6fb41d2af25085b64011a55 - default default] 192.168.53.45 "GET /v3/87e34c89e6fb41d2af25085b64011a55/volumes/detail HTTP/1.1" status: 200 len: 4657 time: 0.1152730 ==> volume.log <== 2020-01-17 16:53:23.811 24723 WARNING py.warnings [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814 87e34c89e6fb41d2af25085b64011a55 - default default] /usr/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) : NimbleAPIException: Failed to execute api snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409 Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone. 2020-01-17 16:53:23.902 24723 ERROR cinder.volume.drivers.nimble [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814 87e34c89e6fb41d2af25085b64011a55 - default default] Re-throwing Exception Failed to execute api snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409 Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone.: NimbleAPIException: Failed to execute api snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409 Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone. 2020-01-17 16:53:23.903 24723 WARNING cinder.volume.drivers.nimble [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814 87e34c89e6fb41d2af25085b64011a55 - default default] Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 : has a clone: NimbleAPIException: Failed to execute api snapshots/0464a5fd65d75fcfe1000000000000011100001a41: Error Code: 409 Message: Snapshot snapshot-3806efc5-65ca-495a-a87a-baaddc9607d9 for volume volume-5b02db35-8d5c-4ef6-b0e7-2f9b62cac57e has a clone. 2020-01-17 16:53:23.964 24723 WARNING cinder.quota [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814 87e34c89e6fb41d2af25085b64011a55 - default default] Deprecated: Default quota for resource: snapshots_Nimble-DR is set by the default quota flag: quota_snapshots_Nimble-DR, it is now deprecated. Please use the default quota class for default quota. 2020-01-17 16:53:24.054 24723 INFO cinder.volume.manager [req-60fe4335-af66-4c46-9bbd-2408bf4d6f07 df7548ecad684f26b8bc802ba63a9814 87e34c89e6fb41d2af25085b64011a55 - default default] Delete snapshot completed successfully.
Regards,
*Tony Pearce* | *Senior Network Engineer / Infrastructure Lead**Cinglevue International <https://www.cinglevue.com>*
Email: tony.pearce@cinglevue.com Web: http://www.cinglevue.com
*Australia* 1 Walsh Loop, Joondalup, WA 6027 Australia.
Direct: +61 8 6202 0036 | Main: +61 8 6202 0024
Note: This email and all attachments are the sole property of Cinglevue International Pty Ltd. (or any of its subsidiary entities), and the information contained herein must be considered confidential, unless specified otherwise. If you are not the intended recipient, you must not use or forward the information contained in these documents. If you have received this message in error, please delete the email and notify the sender.