[openstack-dev] [Manila] Design summit notes
Ben Swartzlander
ben at swartzlander.org
Tue Nov 10 21:25:55 UTC 2015
I wasn't going to write a wrap-up email this time around since so many
people were able to attend in person, but enough people missed it that I
changed my mind and decided to write down my own impressions of the
sessions.
Wednesday working session: Migration Improvements
-------------------------------------------------
During this session we covered the status of the migration feature so
far (it's merged but experimental) and the existing gaps:
1) Will fail on shares on most backends with share servers
2) Controller node in the data path -- needs data copy service
3) No implementations of optimized migration yet
4) Confusion around task_state vs state
5) Need to disable most existing operations during a migration
6) Possibly need to change driver interface for getting mount info
Basically there is a lot of work left to be done on migration but we're
happy with the direction it's going. If we can address the gaps we could
make the APIs supported in Mitaka. We're eager to get to building the
valuable APIs on top of migration, but we can't do that until migration
itself is solid.
I also suggested that migration might benefit from an API change to
allow a 2-phase migration which would allow the user (the admin in this
case) to control when the final cutover happens instead of letting it
happen by surprise.
Wednesday working session: Access Allow/Deny Driver Interface
-------------------------------------------------------------
During this session I proposed a new driver interface for
allowing/denying access to shares which is a single "sync access" API
that the manager would call and pass all of the rules to. The main
benefits of the change would be:
1) More reliable cleanup of errors
2) Support for atomically updating multiple rules
3) Simpler/more efficient implementation on some backends
Most vendors agreed that the new interface would be superior, and
generally speaking most vendors said that the new interface would be
simpler and more efficient than the existing one.
There were some who were unsure and one who specifically said an access
sync would be inefficient compared to the current allow/deny semantics.
We need to see if we can provide enough information in the new interface
to allow them to be more efficient (such as providing the new rules AND
the diff against the old rules).
It was also pointed out that error reporting would be different using
this new interface, because errors applying rules couldn't be associated
with the specific rule that caused them. We need a solution to that problem.
Thursday fishbowl session: Share Replication
--------------------------------------------
There was a demo of the POC code from NetApp and general education on
the new design. Since the idea was new to some and the implementation
was new to everyone, there was not a lot of feedback.
We did discuss a few issues, such as whether multiple shares should be
allowed in a single AZ.
We agreed that this replication feature will be exposed to the same kind
of race conditions that exist in active/active HA, so there is
additional pressure to solve the distributed locking problem.
Fortunately the community seems to be converging on a solution to that
problem -- the tooz library.
We agreed that support for replication in a first party driver is
essential for the feature to be accepted -- otherwise developers who
don't have proprietary storage systems would be unable to develop/test
on the feature.
Thursday fishbowl session: Alternative Snapshot Semantics
---------------------------------------------------------
During this session I proposed 2 new things you can do with snapshots:
1) Revert a share to a snapshot
2) Exporting snapshots directly as readonly shares
For reverting snapshots, we agreed that the operation should preserve
all existing snapshots. If a backend is unable to revert without
deleting snapshots, it should not advertise the capability.
For mounting snapshots, it was pointed out that we need to define the
access rules for the share. I proposed simply inheriting the rules for
the parent share with rw rules squashed to ro. That approach has
downsides though because it links to the access on the snapshot and the
share (which may no be desired) and also forces us to pass a list of
snapshots into the access calls so the driver can update snapshot access
when updating share access.
Sage proposed creating a new concept of a readonly share and simply
overloading the existing create-share-from-snapshot logic with a
-readonly flag which gives us the semantics we want with much
complexity. The downside to that approach is that we will have to add
code to handle readonly shares.
There was an additional proposal to allow the create/delete snapshot
calls without any other snapshot-related calls because some backends
have in-band snapshot semantics. This is probably unnecessary because
every backend that has snapshots is likely to support at least one of
the proposed semantics so we don't need another mode.
Thursday working session: Export Location Metadata
--------------------------------------------------
In this session we discussed the idea Jason proposed back in the winter
midcycle meetup to allow drivers to tag export locations with various
types of metadata which is meaningful to clients of Manila. There were
several proposed use cases discussed.
The main thing we agreed to was that the metadata itself seems like a
good thing as long as the metadata keys and values are standardized. We
didn't like the possibility of vendor-defined or admin-defined metadata
appearing in the export location API.
One use case discussed what the idea of preferred/non-preferred export
locations. This use case makes sense and nobody was against it.
Another use case discussed was "readonly" export locations which might
allow certain drivers to use different export locations for writing and
reading. There was debate about how much sense this made.
The third discussed use case was Jason's original suggestion: locality
information of different export locations to enable clients to determine
which export location is closer to them. We were generally opposed to
this idea because it's not clear how locality information would be
expressed. We didn't like admin-defined values, and the only standard
thing we have is AZs, which are already exposed through a different
mechanism.
Some time was spent in this session discussing the forthcoming Ceph
driver and its special requirements. We discussed the possbility of
dual-protocol access to shares (Ceph+NFS in this case). Dual protocol
access was previously rejected (NFS+CIFS) due to concerns about
interoperability. We still need to decide if we want to allow Ceph+NFS
as a special case based on the idea that Ceph shares would always
support NFS and there's unlikely to ever be a second Ceph driver. If we
allow this, then the share protocol would make sense to expose as a
metadata value.
Lastly we discussed the Ceph key-sharing requirement where peeople want
to use Manila to discover the Ceph secret. That would require adding
some new metadata, but on the access rules, not on the export locations.
Thursday working session: Interactions Between New Features
--------------------------------------------------
In this session we considered the possible interactions between share
migration, consistency groups, and share replication (the 3 new
experimental features).
We quickly concluded that in the current code, bad things are likely to
happen if you use 2 of these features at the same time, so as a top
priority we must prevent that and return a sensible error message
instead of allowing undefined behavior.
We spent the rest of the session discussing how the feature should
interact and concluded that enabling the behaviors we want require
significant new code for all of the pairings.
Migration+CGs: requires the concept of CG instance or some way of
tracking which share instances make up the original CG and the migrated
CG. Alternatively, requires the ability to disband and reconstruct CGs.
A blueprint with and actual design is needed.
Migration+Replication: can probably be implemented by simply migrating
the primary (active) replica to the destination and re-replicating from
there. This requires significant new code in the migration paths though
because they'll need to rebuild the replication topology on the
destination side. Also, for safety the migration should not complete
until the destination side is fully replicated to avoid the chance of a
failure immediately after migration causing a loss of access. There may
be opportunities for optimized versions of the above, especially when
cross-AZ bandwidth is limited and the migration is within an AZ. More
though is needed, and a blueprint should spell out the design.
Replication+CGs: it doesn't make a lot of sense to replicate individual
shares from a CG -- more likely users will want to replicate the whole
CG. This is an assumption though and we have no supporting data. Either
way, replication at the granularity of a CG would require more logic to
schedule the replicas CGs before scheduling the share replicas. This is
likely to be significant new code and need a blueprint.
Replication+CGs+Migration: this was proposed as a joke, but it's a
serious concern. The above designs should consider what happens if we
have a replicated CG and we wish to migrate it. If the above designs are
done carefully we should get correct behavior here for free.
Friday contributor meetup
-------------------------
On Friday we quickly reviewed the above topics for those that missed
earlier sessions, then we launched into the laundry list of topics that
we weren't able to fit into design sessions/working sessions.
QoS: Zhongjun proposed a QoS implementation similar to Cinder's. After a
brief discussion, there was not agreement that Cinder's model was a good
one to follow, as QoS was introduced to Cinder before some other later
enhancements such as standardized extra specs. We're inclined to use
standardized extra specs instead of QoS policies in Manila. We still
need to agree on which QoS-related extra specs we should standardize.
There are 2 criteria for standardizing on QoS-related extra specs (1)
they should mean the same thing across vendors e.g.
max-bytes-per-second, and (2) they should be something that's widely
implemented. I expect we'll see lots of vendor-specific QoS-related
extra specs, and we need to make sure it's possible to mix vendors in
the same share-type and assign a QoS policy that's equivalent across both.
Minimum required features: The main open question about minimum features
was access control types for each protocol. We agreed that for CIFS,
user-based access control is required, and IP-based access control is
not. Furthermore, support for read-only access rules is required.
Improving gate performance/stability: We briefly discussed the gate
issues that plagued us during Liberty and our plan to address them,
which is to add more first party drivers that don't depend on
quickly-evolving projects. The big offender during Liberty was Neutron,
although Nova and Cinder have both bitten us in the past. To be clear,
the existing Generic driver is not going away, and will still be QA'd,
but we would rather not make it the gating driver.
Manila HA: We briefly discussed tooz and agreed that we will use it in
Manila to address race conditions that exist in active-active HA
scenarios, as well as race conditions introduce by the replication code.
Multiple share servers: There was some confusion about what a share
server is. Some assumed that it would required multiple Manila "share
servers" to implement a highly available shared storage system. In
reality, the "share server" in Manila is a collection of resources which
can include an arbitrary amount of underlying physical or virtual
resources -- enough to implement a highly available solution with a
single share server.
Rolling Upgrades: We punted on this again. There was a discussion about
how slow the progress in Cinder appears to be and we don't want to get
trapped in limbo for multiple cycles. However if the work will
unavoidably take multiple cycles then we need to know that so we can
plan accordingly. Support for rolling upgrades is still viewed as
desirable, but the team is worried about the apparent implementation cost.
More mount automation: We covered what was done during Liberty (2
approaches) and discussed briefly the new approach Sage suggested. The
nova-metadata approach that Clinton proposed in Vancouver which was not
accepted by the Nova team will probably get a warmer reception based on
the additional integration Sage is proposing. We should re-propose it
and continue with our existing plans. There was also broad support for
Sage's NFS-over-vsock proposal and Nova share-attach semantics.
Key-based authentication: We went into more detail on John's Ceph-driver
requirements, and why he thinks it makes sense to use communicate
secrets from the Ceph backend to the end user through Manila. We didn't
really reach a decision, but nobody was strongly against the change John
proposed. I think we're still interested in alternative proposals if
anyone has one.
Make all deleted columns booleans: We discussed soft deletes and the
fact that they're not implemented consistently in Manila today. We also
questioned the value of soft deletes and reasoning for why we use them.
Some believed there was a performance benefit, and others suggested that
it had more to do with preservation of history for auditing and
troubleshooting. Depending on the real motivation, this proposal may
need to be scrapped or modified.
Replication 2.0: We discussed the remaining work for the share
replication feature before it could be accepted. There were 2 main
issues: support for share servers, and a first party driver
implementation of the feature. There was some dispute about the value of
a first party implementation but the strongest argument in favor was the
need for community members who didn't have access to proprietary
hardware to be able to maintain the replication code.
Functional tests for network plugins: We discussed the fact that the
existing network plugins aren't covered by tests in the gate. For
nova-network, we decided that was acceptable since that functionality is
deprecated. For the neutron network plugin and the standalone network
plugin, we need additional test jobs that exercise them.
Capability lists: The addition of standardized extra specs pointed out a
gap, which is that with some extra specs a backend should be able to
advertise both the positive and negative capability. In the past vendors
have achieved that with gross pairs of vendor-specific extra specs (e.g.
netapp_dedupe and netapp_no_dedupe). We agreed it would make more sense
for the backend to simply advertise dedupe=[False,True]. Changes to the
filter scheduler are needed to allow capability lists, so Clinton
volunteered to implement those changes.
Client microversions: There wasn't much to discuss on the topic of
microversions. The client patches are in progress. Cinder seems to be
headed down a similar path. We are happy with the feature so far.
Fix delete operations: We discussed what the force-delete APIs should do
and what they shouldn't do. It was agreed that force-delete means
"remove it from the manila DB no matter what, and make a best effort
attempt to delete it off the backend". Some changes are needed to
support that semantic. We also discussed the common problem of "stuck"
shares and how to clean them up. We agreed that admins should typically
use the reset-state API and retry an ordinary delete. The force-delete
approach can leave garbage behind on the backend. The reset-state and
retry-delete approach should never leave garbage behind, so it's safer
to use.
Remove all extensions: We discussed the current effort to move
extensions into core. This work is mostly done and not much discussion
was needed.
Removing task state: We agreed to remove the task-state column
introduced by migration and use ordinary states for migration.
Interaction with Nova attach file system API: We went into more detail
on Sage's Nova file-system-attach proposal and concluded that it should
"just work" without changes from Manila.
-Ben Swartzlander
More information about the OpenStack-dev
mailing list