[openstack-dev] [manila] Barcelona Design Summit summary

Joshua Harlow harlowja at fastmail.com
Fri Nov 4 18:00:07 UTC 2016

Ben Swartzlander wrote:
> Thanks to gouthamr for doing these writeups and for recording!
> We had a great turn out at the manila Fishbowl and working sessions.
> Important notes and Action Items are below:
> ===========================
> Fishbowl 1: Race Conditions
> ===========================
> Thursday 27th Oct / 11:00 - 11:40 / AC Hotel -Salon Barcelona - P1
> Etherpad: https://etherpad.openstack.org/p/ocata-manila-race-conditions
> Video: https://www.youtube.com/watch?v=__P7zQobAQw
> Gist:
> * We've some race conditions that have worsened over time:
> * Deleting a share while snapshotting the share
> * Two simultaneous delete-share calls
> * Two simultaneous create-snapshot calls
> * Though the end result of the race conditions is not terrible, we can
> leave resources in untenable states, requiring administrative cleanup in
> the worst scenario
> * Any type of resource interaction must be protected in the database
> with a test-and-set using the appropriate status fields
> * Any test-and-set must be protected with a lock
> * Locks must not be held over long running tasks: i.e, RPC Casts, driver
> invocations etc.
> * We need more granular state transitions: micro/transitional states
> must be added per resource and judiciously used for state locking
> * Ex: Shares need a 'snapshotting' state
> * Ex: Share servers need states to signify setup phases, a la nova
> compute instances

Just something that I've always wondered, and I know its not a easy 
answer, but are there any ideas on why such simultaneous issues keep on 
getting discovered so late in the software lifecycle, instead of at 
design time? Not probably just a manilla question, but it strikes me as 
somewhat confusing that keeps on popping up.

> Discussion Item:
> * Locks in the manila-api service (or specifically, extending usage of
> locks across all manila services)
> * Desirable because:
> * Adding test-and-set logic at the database layer may render code
> unmaintainable complicated as opposed to using locking abstractions
> (oslo.concurrency / tooz)
> * Cinder has evolved an elegant test-and-set solution but we may not be
> able to benefit from that implementation because of the lack of being
> able to do multi-table updates and because the code references OVO which
> manila doesn't yet support.
> * Un-desirable because:
> * Most distributors (RedHat/Suse/Kubernetes-based/MOS) want to run more
> than one API service in active-active H/A.
> * If a true distributed locking mechanism isn't used/supported, the
> current file-locks would be useless in the above scenario.
> * Running file locks on shared file systems is a possibility, but
> applies configuration/set-up burden
> * Having all the locks on the share service would allow scale out of the
> API service and the share manager is really the place where things are
> going wrong
> * With a limited form of test-and-set, atomic state changes can still be
> achieved for the API service.
> Agreed:
> * File locks will not help
> Action Items:
> (bswartz): Will propose a spec for the locking strategy
> (volunteers): Act on the spec ^ and help add more transitional states
> and locks (or test-and-set if any)
> (gouthamr): state transition diagrams for shares/share
> instances/replicas, access rules / instance access rules
> (volunteers): Review ^ and add state transition diagrams for
> snapshots/snapshot instances, share servers
> (mkoderer): will help with determining race conditions within
> manila-share with tests
> =====================================
> Fishbowl 2: Data Service / Jobs Table
> =====================================
> Thursday 27th Oct / 11:50 - 12:30 / AC Hotel - Salon Barcelona - P1
> Etherpad:
> https://etherpad.openstack.org/p/ocata-manila-data-service-jobs-table
> Video: https://www.youtube.com/watch?v=Sajy2Qjqbmk

Will https://review.openstack.org/#/c/260246/ help here instead?

It's the equivalent of:


Something to think about...

> Gist:
> * Currently, a synchronous RPC call is made from the API to the
> share-manager/data-service that's performing a migration to get the
> progress of a migration
> * We need a way to record progress of long running tasks: migration,
> backup, data copy etc.
> * We need to introduce a jobs table so that the respective service
> performing the long running task can write to the database and the API
> relies on the database
> Discussion Items:
> * There was a suggestion to extend the jobs table to all tasks on the
> share: snapshotting, creating share from snapshot, extending, shrinking,
> etc.
> * We agreed not to do this because the table can easily go out of
> control; and there isn't a solid use case to register all jobs. Maybe
> asynchronous user messages is a better answer to this feature request
> * "restartable" jobs would benefit from the jobs table
> * service heartbeats could be used to react to services dying while
> running long running jobs
> * When running the data service in active-active mode, a service going
> down can pass on its jobs to the other data service
> Action Items:
> (ganso): Will determine the structure of the jobs table model in his spec
> (ganso): Will determine the benefit of the data service reacting to
> additions in the database rather than acting upon RPC requests
> =====================================
> Working Sessions 1: High Availability
> =====================================
> Thursday 27th Oct / 14:40 - 15:20 / CCIB - Centre de Convencions
> Internacional de Barcelona - P1 - Room 130
> Etherpad: https://etherpad.openstack.org/p/ocata-manila-high-availability
> Video: https://www.youtube.com/watch?v=xFk8ShK6qxU
> Gist:
> * We have a patch to introduce the tooz abstraction library to manila,
> it currently creates a tooz coordinator for the manila-share service and
> demonstrates replacing oslo concurrency locks to tooz locks:
> https://review.openstack.org/#/c/318336/
> * The heartbeat seems to have issues, needs debugging
> * The owner/committer have tested this patch with both FileDriver and
> Kazoo/Zookeeper as tooz backends. We need to test other tooz backends
> * Distributors do not package dependencies for all tooz backends
> * We plan to introduce leader election via tooz. We plan to use this in
> cleanups, designate the service that performs polling (migration,
> replication of shares and snapshots, share server cleanup)
> * Code needs to be written to integrate the use of tooz/dlm via the
> manila devstack plugin so it can be gate tested
> Action Items:
> (gouthamr): Will document how to set up tooz with 2 or more share services
> (bswartz): Will set up a sub group of contributors to code/test H/A
> solutions in this release



