[openstack-dev] [manila] Barcelona Design Summit summary

Ben Swartzlander ben at swartzlander.org
Thu Nov 10 02:23:19 UTC 2016


On 11/04/2016 02:00 PM, Joshua Harlow wrote:
> Ben Swartzlander wrote:
>> Thanks to gouthamr for doing these writeups and for recording!
>>
>> We had a great turn out at the manila Fishbowl and working sessions.
>> Important notes and Action Items are below:
>>
>> ===========================
>> Fishbowl 1: Race Conditions
>> ===========================
>> Thursday 27th Oct / 11:00 - 11:40 / AC Hotel -Salon Barcelona - P1
>> Etherpad: https://etherpad.openstack.org/p/ocata-manila-race-conditions
>> Video: https://www.youtube.com/watch?v=__P7zQobAQw
>>
>> Gist:
>> * We've some race conditions that have worsened over time:
>> * Deleting a share while snapshotting the share
>> * Two simultaneous delete-share calls
>> * Two simultaneous create-snapshot calls
>> * Though the end result of the race conditions is not terrible, we can
>> leave resources in untenable states, requiring administrative cleanup in
>> the worst scenario
>> * Any type of resource interaction must be protected in the database
>> with a test-and-set using the appropriate status fields
>> * Any test-and-set must be protected with a lock
>> * Locks must not be held over long running tasks: i.e, RPC Casts, driver
>> invocations etc.
>> * We need more granular state transitions: micro/transitional states
>> must be added per resource and judiciously used for state locking
>> * Ex: Shares need a 'snapshotting' state
>> * Ex: Share servers need states to signify setup phases, a la nova
>> compute instances
>
> Just something that I've always wondered, and I know its not a easy
> answer, but are there any ideas on why such simultaneous issues keep on
> getting discovered so late in the software lifecycle, instead of at
> design time? Not probably just a manilla question, but it strikes me as
> somewhat confusing that keeps on popping up.

In the case of Manila the reason is historical. Manila forked from 
Cinder, and Cinder forked from Nova-Volume. Each inherited 
infrastructure and design choices, as well as design *assumptions* which 
didn't always remain true after the forks.

The basic problem is that the people who wrote (some of) the original 
code are no longer around and new people often assume that old stuff 
isn't broken, even when it is. Issues like concurrency problems can lay 
dormant for a long time before they pop up because they're hard to test.

>> Discussion Item:
>> * Locks in the manila-api service (or specifically, extending usage of
>> locks across all manila services)
>> * Desirable because:
>> * Adding test-and-set logic at the database layer may render code
>> unmaintainable complicated as opposed to using locking abstractions
>> (oslo.concurrency / tooz)
>> * Cinder has evolved an elegant test-and-set solution but we may not be
>> able to benefit from that implementation because of the lack of being
>> able to do multi-table updates and because the code references OVO which
>> manila doesn't yet support.
>> * Un-desirable because:
>> * Most distributors (RedHat/Suse/Kubernetes-based/MOS) want to run more
>> than one API service in active-active H/A.
>> * If a true distributed locking mechanism isn't used/supported, the
>> current file-locks would be useless in the above scenario.
>> * Running file locks on shared file systems is a possibility, but
>> applies configuration/set-up burden
>> * Having all the locks on the share service would allow scale out of the
>> API service and the share manager is really the place where things are
>> going wrong
>> * With a limited form of test-and-set, atomic state changes can still be
>> achieved for the API service.
>>
>> Agreed:
>> * File locks will not help
>>
>> Action Items:
>> (bswartz): Will propose a spec for the locking strategy
>> (volunteers): Act on the spec ^ and help add more transitional states
>> and locks (or test-and-set if any)
>> (gouthamr): state transition diagrams for shares/share
>> instances/replicas, access rules / instance access rules
>> (volunteers): Review ^ and add state transition diagrams for
>> snapshots/snapshot instances, share servers
>> (mkoderer): will help with determining race conditions within
>> manila-share with tests
>>
>> =====================================
>> Fishbowl 2: Data Service / Jobs Table
>> =====================================
>> Thursday 27th Oct / 11:50 - 12:30 / AC Hotel - Salon Barcelona - P1
>> Etherpad:
>> https://etherpad.openstack.org/p/ocata-manila-data-service-jobs-table
>> Video: https://www.youtube.com/watch?v=Sajy2Qjqbmk
>
> Will https://review.openstack.org/#/c/260246/ help here instead?
>
> It's the equivalent of:
>
> http://docs.openstack.org/developer/taskflow/jobs.html
>
> Something to think about...
>
>>
>> Gist:
>> * Currently, a synchronous RPC call is made from the API to the
>> share-manager/data-service that's performing a migration to get the
>> progress of a migration
>> * We need a way to record progress of long running tasks: migration,
>> backup, data copy etc.
>> * We need to introduce a jobs table so that the respective service
>> performing the long running task can write to the database and the API
>> relies on the database
>>
>> Discussion Items:
>> * There was a suggestion to extend the jobs table to all tasks on the
>> share: snapshotting, creating share from snapshot, extending, shrinking,
>> etc.
>> * We agreed not to do this because the table can easily go out of
>> control; and there isn't a solid use case to register all jobs. Maybe
>> asynchronous user messages is a better answer to this feature request
>> * "restartable" jobs would benefit from the jobs table
>> * service heartbeats could be used to react to services dying while
>> running long running jobs
>> * When running the data service in active-active mode, a service going
>> down can pass on its jobs to the other data service
>>
>> Action Items:
>> (ganso): Will determine the structure of the jobs table model in his spec
>> (ganso): Will determine the benefit of the data service reacting to
>> additions in the database rather than acting upon RPC requests
>>
>> =====================================
>> Working Sessions 1: High Availability
>> =====================================
>> Thursday 27th Oct / 14:40 - 15:20 / CCIB - Centre de Convencions
>> Internacional de Barcelona - P1 - Room 130
>> Etherpad: https://etherpad.openstack.org/p/ocata-manila-high-availability
>> Video: https://www.youtube.com/watch?v=xFk8ShK6qxU
>>
>> Gist:
>> * We have a patch to introduce the tooz abstraction library to manila,
>> it currently creates a tooz coordinator for the manila-share service and
>> demonstrates replacing oslo concurrency locks to tooz locks:
>> https://review.openstack.org/#/c/318336/
>> * The heartbeat seems to have issues, needs debugging
>> * The owner/committer have tested this patch with both FileDriver and
>> Kazoo/Zookeeper as tooz backends. We need to test other tooz backends
>> * Distributors do not package dependencies for all tooz backends
>> * We plan to introduce leader election via tooz. We plan to use this in
>> cleanups, designate the service that performs polling (migration,
>> replication of shares and snapshots, share server cleanup)
>> * Code needs to be written to integrate the use of tooz/dlm via the
>> manila devstack plugin so it can be gate tested
>>
>> Action Items:
>> (gouthamr): Will document how to set up tooz with 2 or more share
>> services
>> (bswartz): Will set up a sub group of contributors to code/test H/A
>> solutions in this release
>>
>
> <cut>
>
> -Josh
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list