cron triggers execution fails with cinder.volume_snapshots_create

Francois Scheurer francois.scheurer at everyware.ch
Tue Sep 24 08:55:20 UTC 2019


Hi Gorka and Renat


Thanks you for your suggestions and sorry to have forgotten the 
[mistral] subject prefix .


 >Renat:
 >workflow should probablybe responsible for tracking a status of an 
operation.

 >Gorka:
 >Instead of a sleep, which may get you through this issue but fall into a
 >different one and won't return the right status code, you should
 >probably have a loop checking the status of the backup and return a non
 >zero status code if it ends up in "error" state.

The idea of Gorka sounds good.

If you look at the snapshot worflow of Jose Castro, you will find a 
similar snippet:

#https://techblog.web.cern.ch/techblog/post/scheduled-snapshots/
#https://gitlab.cern.ch/cloud-infrastructure/mistral-workflows/raw/master/workflows/instance_snapshot.yaml 
| sed -e 's%action_region: "cern"%action_region: "ch-zh1"%' 
 >instance_snapshot.yaml

     stop_instance:
       description: 'Stops the instance for consistency'
       action: nova.servers_stop
       input:
         server: <% $.instance %>
         action_region: <% $.action_region %>
       on-success:
         - wait_for_stop_instance
       on-error:
         - error_task

     wait_for_stop_instance:
       description: 'Waits until the instance is shutoff to continue'
       action: nova.servers_find
       input:
         id: <% $.instance %>
         status: 'SHUTOFF'
         action_region: <% $.action_region %>
       retry:
         delay: 5
         count: 40
       on-success:
         - check_boot_source
       on-error:
         - error_task


 >We’ve discussed a more generic solution in the past for similar 
situations but it seems to be virtually impossible to find it.

Ok so it looks that this issue cannot be fixed with a small bugfix.
It would require a feature extension.

I can imagine that quite a few api calls from the different openstack 
modules/services are asynchronous and would require mistral to check 
their progress status every time in a different ad hoc manner.
That would make the such a new feature in mistral quite expensive to 
implement.

It would be great if every async call would return a job_id in a 
standard form by each service.
So mistral would be able to track them in an uniform way.
This would also allows openstack client to run in sync or async mode, 
according to the user need.

But such a design requirement better need to be done at day one; it is 
likely too late to change all openstack services...


However, there is a minor enhancement that could be done:
let the user specify if a cron trigger need to auto-delete itself after 
its last execution or not.

Keeping expired cron triggers could be nice for:
-avoiding the such racing issues as with swift/radosgw
-allowing the user to edit and reschedule a expired cron trigger

What do you think?


Best Regards

Francois









On 9/24/19 8:36 AM, Renat Akhmerov wrote:
> Hi!
>
> I would kindly ask you to add [mistral] into the subject of the emails 
> related to Mistral. I just saw this thread accidentally (since I can’t 
> read everything) and missed it in the first place.
>
> On the issue itself… So yes, the discovery you made makes perfect 
> sense. I agree that a workflow should probablybe responsible for 
> tracking a status of an operation. We’ve discussed a more generic 
> solution in the past for similar situations but it seems to be 
> virtually impossible to find it. If you have some ideas, please share. 
> We can discuss it.
>
>
> Thanks
>
> Renat Akhmerov
> @Nokia
> On 23 Sep 2019, 14:41 +0700, Gorka Eguileor <geguileo at redhat.com>, wrote:
>> On 20/09, Francois Scheurer wrote:
>>> Hi Gorka
>>>
>>>
>>>> Then I assume you prefer the Swift backup driver over the Ceph one
>>>> because you are using one of the OpenStack releases that had 
>>>> trouble >with
>>> Incremental Backups on the Ceph backup driver.
>>>
>>>
>>> You are probably right. But I cannot answer that because I was not 
>>> involve
>>> in that decision.
>>>
>>>
>>> Ok in the radosgw logs I see this:
>>>
>>>
>>> 2019-09-20 15:40:06.805529 7f19edb9b700 20 
>>> token_id=gAAAAABdhNauRvNev5P90ovX7_cb5_4MkY1tg5JHFpAH8JL-_0vDs06lHW5F9Iphua7fxCWTxxdL-0fRzhR8We_nN6Hx9z3FTWcTXLUMtIUPe0WMKQgW6JkUTP8RwSjAfF4W04OztEg3VAUGN_5gWRlBX-KT9uypnEszadG1yA7gpjkCokNnD8oaIeE6arvs_EjfJib51rao
>>> 2019-09-20 15:40:06.805664 7f19edb9b700 20 sending request to
>>> https://keystone.service.stage.ewcs.ch/v3/auth/tokens
>>> 2019-09-20 15:40:06.805803 7f19edb9b700 20 ssl verification is set 
>>> to off
>>> 2019-09-20 15:40:07.235356 7f19edb9b700 20 sending request to
>>> https://keystone.service.stage.ewcs.ch/v3/auth/tokens
>>> 2019-09-20 15:40:07.235404 7f19edb9b700 20 ssl verification is set 
>>> to off
>>> 2019-09-20 15:40:07.267091 7f19edb9b700  5 Failed keystone auth from
>>> https://keystone.service.stage.ewcs.ch/v3/auth/tokens with 404
>>> BTW: our radosgw is configured to delegate user authentication to 
>>> keystone.
>>>
>>> In keystone logs I see this:
>>>
>>> 2019-09-20 15:40:07.218 24 INFO keystone.token.provider
>>> [req-21b2f11c-9e67-4487-af05-420acfb65ace - - - - -] Token being 
>>> processed:
>>> token.user_id [f7c7296949f84a4387c5172808a0965b],
>>> token.expires_at[2019-09-21T13:40:07.000000Z],
>>> token.audit_ids[[u'hFweMPCrSO2D00rNcRNECw']], 
>>> token.methods[[u'password']],
>>> token.system[None], token.domain_id[None],
>>> token.project_id[4120792f50bc4cf2b4f97c4546462f06], 
>>> token.trust_id[None],
>>> token.federated_groups[None], token.identity_provider_id[None],
>>> token.protocol_id[None],
>>> token.access_token_id[None],token.application_credential_id[None].
>>> 2019-09-20 15:40:07.257 21 INFO keystone.common.wsgi
>>> [req-9f858abb-68f9-42cf-b71a-f1cafca91844 
>>> f7c7296949f84a4387c5172808a0965b
>>> 4120792f50bc4cf2b4f97c4546462f06 - default default] GET
>>> http://keystone.service.stage.ewcs.ch/v3/auth/tokens
>>> 2019-09-20 15:40:07.265 21 WARNING keystone.common.wsgi
>>> [req-9f858abb-68f9-42cf-b71a-f1cafca91844 
>>> f7c7296949f84a4387c5172808a0965b
>>> 4120792f50bc4cf2b4f97c4546462f06 - default default] Could not find 
>>> trust:
>>> 934ed82d2b14413899023da0bee6a953.: TrustNotFound: Could not find trust:
>>> 934ed82d2b14413899023da0bee6a953.
>>>
>>>
>>> So what happens is following:
>>>
>>> 1. when the user creates the cron trigger, mistral creates a trust
>>> 2. when the cron trigger executes the workflow, openstack create a
>>> volume snapshot (a rbd image) then copy it to swift (rgw) then
>>> delete the snapshot
>>> 3. when the execution finishes, if the cron trigger has no remaining
>>> executions scheduled, then mistral remove the cron trigger and the trust
>>>
>>> The problem is a racing issue: apprently the copying of the snapshot to
>>> swift run in the background and mistral removes the trust before the
>>> operation completes...
>>>
>>> That explains the error in keystone and also the cron trigger execution
>>> result which is "success" even if the resulting backup is actually 
>>> "failed".
>>>
>>>
>>> To test this theory I set up the same cron trigger with more than one
>>> scheduled execution and the backups were suddenly created correctly ;-).
>>>
>>>
>>> So something need to be done on the code to deal with this racing issue.
>>>
>>> In the meantime, I will try to put a sleep action after the 'create 
>>> backup'
>>> action.
>>>
>>
>> Hi,
>>
>> Congrats on figuring out the issue. :-)
>>
>> Instead of a sleep, which may get you through this issue but fall into a
>> different one and won't return the right status code, you should
>> probably have a loop checking the status of the backup and return a non
>> zero status code if it ends up in "error" state.
>>
>> Cheers,
>> Gorka.
>>
>>>
>>> Best Regards
>>>
>>> Francois
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 9/20/19 4:02 PM, Gorka Eguileor wrote:
>>>> On 20/09, Francois Scheurer wrote:
>>>>> Hi Gorka
>>>>>
>>>>>
>>>>> We have a swift endpoint set up on opentstack, which points to our 
>>>>> ceph
>>>>> radosgw backend
>>>>>
>>>>> Radosgw provides s3 & swift.
>>>>>
>>>>> So the swift logs are here actually the radosgw logs.
>>>>>
>>>> Hi,
>>>>
>>>> OK, thanks for the clarification.
>>>>
>>>> Then I assume you prefer the Swift backup driver over the Ceph one
>>>> because you are using one of the OpenStack releases that had trouble
>>>> with Incremental Backups on the Ceph backup driver.
>>>>
>>>> Cheers,
>>>> Gorka.
>>>>
>>>>
>>>>> Cheers
>>>>>
>>>>> Francois
>>>>>
>>>>>
>>>>>
>>>>> On 9/20/19 2:46 PM, Gorka Eguileor wrote:
>>>>>> On 20/09, Francois Scheurer wrote:
>>>>>>> Dear Gorka and Hervé
>>>>>>>
>>>>>>>
>>>>>>> Thanks for your hints.
>>>>>>>
>>>>>>> I have set the debug log level on radosgw.
>>>>>>>
>>>>>>> I will retest now and post here the results.
>>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> Francois
>>>>>> Hi,
>>>>>>
>>>>>> Sorry, I may have missed something in the conversation, weren't you
>>>>>> using Swift?
>>>>>>
>>>>>> I think you need to see the Swift logs as well, since that's the API
>>>>>> service that complained about the authorization.
>>>>>>
>>>>>> Cheers,
>>>>>> Gorka.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>> EveryWare AG
>>>>>>> François Scheurer
>>>>>>> Senior Systems Engineer
>>>>>>> Zurlindenstrasse 52a
>>>>>>> CH-8003 Zürich
>>>>>>>
>>>>>>> tel: +41 44 466 60 00
>>>>>>> fax: +41 44 466 60 10
>>>>>>> mail: francois.scheurer at everyware.ch
>>>>>>> web: http://www.everyware.ch
>>>>> --
>>>>>
>>>>>
>>>>> EveryWare AG
>>>>> François Scheurer
>>>>> Senior Systems Engineer
>>>>> Zurlindenstrasse 52a
>>>>> CH-8003 Zürich
>>>>>
>>>>> tel: +41 44 466 60 00
>>>>> fax: +41 44 466 60 10
>>>>> mail: francois.scheurer at everyware.ch
>>>>> web: http://www.everyware.ch
>>>>
>>> --
>>>
>>>
>>> EveryWare AG
>>> François Scheurer
>>> Senior Systems Engineer
>>> Zurlindenstrasse 52a
>>> CH-8003 Zürich
>>>
>>> tel: +41 44 466 60 00
>>> fax: +41 44 466 60 10
>>> mail: francois.scheurer at everyware.ch
>>> web: http://www.everyware.ch
>>>
>>
>>
>>
-- 


EveryWare AG
François Scheurer
Senior Systems Engineer
Zurlindenstrasse 52a
CH-8003 Zürich

tel: +41 44 466 60 00
fax: +41 44 466 60 10
mail: francois.scheurer at everyware.ch
web: http://www.everyware.ch

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190924/b3633308/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5230 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190924/b3633308/attachment-0001.bin>


More information about the openstack-discuss mailing list