[openstack-dev] [trove][all][tc] A proposal to rearchitect Trove
Amrith Kumar
amrith.kumar at gmail.com
Wed Jul 12 11:14:28 UTC 2017
All:
First, let me thank all of you who responded and provided feedback
on what I wrote. I've summarized what I heard below and am posting
it as one consolidated response rather than responding to each
of your messages and making this thread even deeper.
As I say at the end of this email, I will be setting up a session at
the Denver PTG to specifically continue this conversation and hope
you will all be able to attend. As soon as time slots for PTG are
announced, I will try and pick this slot and request that you please
attend.
----
Thierry: naming issue; call it Hoard if it does not have a migration
path.
----
Kevin: use a container approach with k8s as the orchestration
mechanism, addresses multiple issues including performance. Trove to
provide containers for multiple components which cooperate to provide
a single instance of a database or cluster. Don't put all components
(agent, monitoring, database) in a single VM, decoupling makes
migraiton and upgrades easier and allows trove to reuse database
vendor supplied containers. Performance of databases in VM's poor
compared to databases on bare-metal.
----
Doug Hellmann:
> Does "service VM" need to be a first-class thing? Akanda creates
> them, using a service user. The VMs are tied to a "router" which is
> the billable resource that the user understands and interacts with
> through the API.
Amrith: Doug, yes because we're looking not just for service VM's but all
resources provisioned by a service. So, to Matt's comment about a
blackbox DBaaS, the VM's, storage, snapshots, ... they should all be
owned by the service, charged to a users quota but not visible to the
user directly.
----
Jay:
> Frankly, I believe all of these types of services should be built
> as applications that run on OpenStack (or other)
> infrastructure. In other words, they should not be part of the
> infrastructure itself.
>
> There's really no need for a user of a DBaaS to have access to the
> host or hosts the DB is running on. If the user really wanted
> that, they would just spin up a VM/baremetal server and install
> the thing themselves.
and subsequently in follow-up with Zane:
> Think only in terms of what a user of a DBaaS really wants. At the
> end of the day, all they want is an address in the cloud where they
> can point their application to write and read data from.
> ...
> At the end of the day, I think Trove is best implemented as a hosted
> application that exposes an API to its users that is entirely
> separate from the underlying infrastructure APIs like
> Cinder/Nova/Neutron.
Amrith: Yes, I agree, +1000
----
Clint (in response to Jay's proposal regarding the service making all
resources multi-tenant) raised a concern about having multi-tenant
shared resources. The issue is with ensuring separation between
tenants (don't want to use the word isolation because this is database
related).
Amrith: yes, definitely a concern and one that we don't have today
because each DB is a VM of its own. Personally, I'd rather stick with
that construct, one DB per VM/container/baremetal and leave that be
the separation boundary.
----
Zane: Discomfort over throwing out working code, grass is greener on
the other side, is there anything to salvage?
Amrith: Yes, there is certainly a 'grass is greener with a rewrite'
fallacy. But, there is stuff that can be salvaged. The elements are
still good, they are separable and can be used with the new
project. Much of the controller logic however will fall by the
wayside.
In a similar vein, Clint asks about the elements that Trove provides,
"how has that worked out".
Amrith: Honestly, not well. Trove only provided reference elements
suitable for development use. Never really production hardened
ones. For example, the image elements trove provides don't bake the
guest agent in; they assume that at VM launch, the guest agent code
will be slurped (technical term) from the controller and
launched. Great for debugging, not great for production. That is
something that should change. But, equally, I've heard disagreements
saying that slurping the guest agent at runtime is clever and good
in production.
----
Zane: consider using Mistral for workflow.
> The disadvantage, obviously, is that it requires the cloud to offer
> Mistral as-a-Service, which currently doesn't include nearly as many
> clouds as I'd like.
Amrith: Yes, as we discussed, we are in agreement with both parts of
this recommendation.
Zane, Jay and Dims: a subtle distinction between Tessmaster and Magnum
(I want a database figure out the lower layers, vs. I want a k8s
cluster).
----
Zane: Fun fact: Trove started out as a *complete fork* of Nova(!).
Amrith: Not fun at all :) Never, ever, ever, ever f5g do that
again. Yeah, sure, if you can have i18n, and k8s, I can have f5g :)
----
Thierry:
> We generally need to be very careful about creating dependencies
> between OpenStack projects.
> ...
> I understand it's a hard trade-off: you want to reuse functionality
> rather than reinvent it in every project... we just need to
> recognize the cost of doing that.
Amrith: Yes, this is part of my concern re: Mistral, and earlier in
trove's life on depending on Manila for Oracle RAC. Clint raised a
similar concern about the dependency on Heat.
In response, Kevin:
> That view of dependencies is why Kubernetes development is outpacing
> OpenStacks and some users are leaving IMO. Not trying to be mean
> here but trying to shine some light on this issue.
I disagree, but that's a topic for another email thread and maybe not
even an email thread but an in-person conversation with suitable
beverages. It is a religious discussion which is best handled in a
different forum; such as the emacs-vi forum.
----
I wrote:
> - A guest agent running on a tenant instance, with connectivity to a
> shared management message bus is a security loophole; encrypting
> traffic, per-tenant-passwords, and any other scheme is merely
> lipstick on a security hole
Clint asks:
This is a broad statement, and I'm not sure I understand the actual
risk you're presenting here as "a security loophole".
How else would you administer a database server than through some
kind of agent? Whether that agent is a python daemon of our making,
sshd, or whatever kubernetes component lets you change things,
they're all administrative pieces that sit next to the resource.
Amrith: The issue is that the guest agent (currently) running in a
tenants context needs to establish a connection to a shared rabbitmq
server running in the service (control plane) context. I am fine with
a guest agent running in the control plan establishing a connection
into a guest VM if required, not the other way around.
----
Clint makes a distinction between a database cluster within an
OpenStack deployment and an uber database cluster spanning clouds,
recommending that the latter is best left to a tertiary
orchestrator. Further, these are two distinct things, pick one and do
it well.
Amrith: A valid approach and one that will allow Trove to focus on the
high value single OpenStack deployment of a db cluster (and to Jay's
point, do it well).
----
Consensus:
Trove should expose (what Matt Fischer calls) BlackBox DB, not storage +
compute.
Address rabbitmq security concerns differently; move guest agent off
instance.
Don't reinvent the orchestration piece.
Fewer DB's better support
Clusters are first class citizens, not an afterthought
Clusters spanning regions and openstack deployments
Restart the service VM's discussion:
https://review.openstack.org/#/c/438134/
----
Several people emailed me privately and said they (or their companies)
would like to invest resources in Trove. Some indicated that they (or
their companies) would like to invest resources in Trove if the
commitment was towards a certain direction or technology choice.
Others have offered resources if the direction would be to provide
an AWS compatible API.
To anyone who wants to contribute resources to a project, please do
it. Big companies considering contributing one or two people to a
project and making it seem like a big decision is really an indication
of a lack of seriousness. If the project is really valuable to you,
you'd have put people on it already. The fact that you haven't speaks
volumes.
To those who want to place pre-conditions on technology choice, I have
no (good) words for you.
Thanks to all who participated, I appreciate all the input. I will be
setting up a session at the Denver PTG to specifically continue this
conversation and hope you will all be able to attend. As soon as time
slots for PTG are announced, I will try and pick this slot and request
that you please attend.
Thanks,
-amrith
On Sun, Jun 18, 2017 at 6:35 AM, Amrith Kumar <amrith.kumar at gmail.com>
wrote:
> Trove has evolved rapidly over the past several years, since integration
> in IceHouse when it only supported single instances of a few databases.
> Today it supports a dozen databases including clusters and replication.
>
> The user survey [1] indicates that while there is strong interest in the
> project, there are few large production deployments that are known of (by
> the development team).
>
> Recent changes in the OpenStack community at large (company realignments,
> acquisitions, layoffs) and the Trove community in particular, coupled with
> a mounting burden of technical debt have prompted me to make this proposal
> to re-architect Trove.
>
> This email summarizes several of the issues that face the project, both
> structurally and architecturally. This email does not claim to include a
> detailed specification for what the new Trove would look like, merely the
> recommendation that the community should come together and develop one so
> that the project can be sustainable and useful to those who wish to use it
> in the future.
>
> TL;DR
>
> Trove, with support for a dozen or so databases today, finds itself in a
> bind because there are few developers, and a code-base with a significant
> amount of technical debt.
>
> Some architectural choices which the team made over the years have
> consequences which make the project less than ideal for deployers.
>
> Given that there are no major production deployments of Trove at present,
> this provides us an opportunity to reset the project, learn from our v1 and
> come up with a strong v2.
>
> An important aspect of making this proposal work is that we seek to
> eliminate the effort (planning, and coding) involved in migrating existing
> Trove v1 deployments to the proposed Trove v2. Effectively, with work
> beginning on Trove v2 as proposed here, Trove v1 as released with Pike will
> be marked as deprecated and users will have to migrate to Trove v2 when it
> becomes available.
>
> While I would very much like to continue to support the users on Trove v1
> through this transition, the simple fact is that absent community
> participation this will be impossible. Furthermore, given that there are no
> production deployments of Trove at this time, it seems pointless to build
> that upgrade path from Trove v1 to Trove v2; it would be the proverbial
> bridge from nowhere.
>
> This (previous) statement is, I realize, contentious. There are those who
> have told me that an upgrade path must be provided, and there are those who
> have told me of unnamed deployments of Trove that would suffer. To this,
> all I can say is that if an upgrade path is of value to you, then please
> commit the development resources to participate in the community to make
> that possible. But equally, preventing a v2 of Trove or delaying it will
> only make the v1 that we have today less valuable.
>
> We have learned a lot from v1, and the hope is that we can address that in
> v2. Some of the more significant things that I have learned are:
>
> - We should adopt a versioned front-end API from the very beginning;
> making the REST API versioned is not a ‘v2 feature’
>
> - A guest agent running on a tenant instance, with connectivity to a
> shared management message bus is a security loophole; encrypting traffic,
> per-tenant-passwords, and any other scheme is merely lipstick on a security
> hole
>
> - Reliance on Nova for compute resources is fine, but dependence on Nova
> VM specific capabilities (like instance rebuild) is not; it makes things
> like containers or bare-metal second class citizens
>
> - A fair portion of what Trove does is resource orchestration; don’t
> reinvent the wheel, there’s Heat for that. Admittedly, Heat wasn’t as far
> along when Trove got started but that’s not the case today and we have an
> opportunity to fix that now
>
> - A similarly significant portion of what Trove does is to implement a
> state-machine that will perform specific workflows involved in implementing
> database specific operations. This makes the Trove taskmanager a stateful
> entity. Some of the operations could take a fair amount of time. This is a
> serious architectural flaw
>
> - Tenants should not ever be able to directly interact with the underlying
> storage and compute used by database instances; that should be the default
> configuration, not an untested deployment alternative
>
> - The CI should test all databases that are considered to be ‘supported’
> without excessive use of resources in the gate; better code modularization
> will help determine the tests which can safely be skipped in testing changes
>
> - Clusters should be first class citizens not an afterthought, single
> instance databases may be the ‘special case’, not the other way around
>
> - The project must provide guest images (or at least complete tooling for
> deployers to build these); while the project can’t distribute operating
> systems and database software, the current deployment model merely impedes
> adoption
>
> - Clusters spanning OpenStack deployments are a real thing that must be
> supported
>
> This might sound harsh, that isn’t the intent. Each of these is the
> consequence of one or more perfectly rational decisions. Some of those
> decisions have had unintended consequences, and others were made knowing
> that we would be incurring some technical debt; debt we have not had the
> time or resources to address. Fixing all these is not impossible, it just
> takes the dedication of resources by the community.
>
> I do not have a complete design for what the new Trove would look like.
> For example, I don’t know how we will interact with other projects (like
> Heat). Many questions remain to be explored and answered.
>
> Would it suffice to just use the existing Heat resources and build
> templates around those, or will it be better to implement custom Trove
> resources and then orchestrate things based on those resources?
>
> Would Trove implement the workflows required for multi-stage database
> operations by itself, or would it rely on some other project (say Mistral)
> for this? Is Mistral really a workflow service, or just cron on steroids? I
> don’t know the answer but I would like to find out.
>
> While we don’t have the answers to these questions, I think this is a
> conversation that we must have, one that we must decide on, and then as a
> community commit the resources required to make a Trove v2 which delivers
> on the mission of the project; “To provide scalable and reliable Cloud
> Database as a Service provisioning functionality for both relational and
> non-relational database engines, and to continue to improve its
> fully-featured and extensible open source framework.”[2]
>
> Thanks,
>
> -amrith
>
>
> [1] https://www.openstack.org/assets/survey/April2017SurveyReport.pdf
> [2] https://wiki.openstack.org/wiki/Trove#Mission_Statement
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170712/d31f57b2/attachment.html>
More information about the OpenStack-dev
mailing list