[openstack-dev] [trove][all][tc] A proposal to rearchitect Trove

Clint Byrum clint at fewbar.com
Thu Jun 22 18:17:01 UTC 2017


tl;dr - I think Trove's successor has a future, but there are two
conflicting ideas presented and Trove should pick one or the other.

Excerpts from Amrith Kumar's message of 2017-06-18 07:35:49 -0400:
> 
> We have learned a lot from v1, and the hope is that we can address that in
> v2. Some of the more significant things that I have learned are:
> 
> - We should adopt a versioned front-end API from the very beginning; making
> the REST API versioned is not a ‘v2 feature’
> 

+1

> - A guest agent running on a tenant instance, with connectivity to a shared
> management message bus is a security loophole; encrypting traffic,
> per-tenant-passwords, and any other scheme is merely lipstick on a security
> hole
>

This is a broad statement, and I'm not sure I understand the actual risk
you're presenting here as "a security loophole".

How else would you administer a database server than through some kind
of agent? Whether that agent is a python daemon of our making, sshd, or
whatever kubernetes component lets you change things, they're all
administrative pieces that sit next to the resource.

> - Reliance on Nova for compute resources is fine, but dependence on Nova VM
> specific capabilities (like instance rebuild) is not; it makes things like
> containers or bare-metal second class citizens
> 

I whole heartedly agree that rebuild is a poor choice for database
servers. In fact, I believe it is a completely non-scalable feature that
should not even exist in Nova.

This is kind of a "we shouldn't be this". What should we be running
database clusters on?

> - A fair portion of what Trove does is resource orchestration; don’t
> reinvent the wheel, there’s Heat for that. Admittedly, Heat wasn’t as far
> along when Trove got started but that’s not the case today and we have an
> opportunity to fix that now
> 

Yeah. You can do that. I'm not really sure what it gets you at this
level. There was an effort a few years ago to use Heat for Trove and
some other pieces, but they fell short at the point where they had to
ask Heat for a few features like, oddly enough, rebuild confirmation
after test. Also, it increases friction to your project if your project
requires Heat in a cloud. That's a whole new service that one would have
to choose to expose or not to users and manage just for Trove. That's
a massive dependency, and it should come with something significant. I
don't see what it actually gets you when you already have to keep track
of your resources for cluster membership purposes anyway.

> - A similarly significant portion of what Trove does is to implement a
> state-machine that will perform specific workflows involved in implementing
> database specific operations. This makes the Trove taskmanager a stateful
> entity. Some of the operations could take a fair amount of time. This is a
> serious architectural flaw
> 

A state driven workflow is unavoidable if you're going to do cluster
manipulation. So you can defer this off to Mistral or some other
workflow engine, but I don't think it's an architectural flaw _that
Trove does it_. Clusters have states. They have to be tracked. Do that
well and your users will be happy.

> - Tenants should not ever be able to directly interact with the underlying
> storage and compute used by database instances; that should be the default
> configuration, not an untested deployment alternative
> 

Agreed. There's no point in having an "inside the cloud" service if
you're just going to hand them the keys to the VMs and volumes anyway.

The point of something like Trove is to be able to retain control at the
operator level, and only give users the interface you promised,
optimized without the limitations of the cloud.

> - The CI should test all databases that are considered to be ‘supported’
> without excessive use of resources in the gate; better code modularization
> will help determine the tests which can safely be skipped in testing changes
> 

Take the same approach as the other driver-hosting things. If it's
in-tree, it has to have a gate test.

> - Clusters should be first class citizens not an afterthought, single
> instance databases may be the ‘special case’, not the other way around
> 

+1

> - The project must provide guest images (or at least complete tooling for
> deployers to build these); while the project can’t distribute operating
> systems and database software, the current deployment model merely impedes
> adoption
> 

IIRC the project provides dib elements and a basic command line to build
images for your cloud, yes? Has that not worked out?

> - Clusters spanning OpenStack deployments are a real thing that must be
> supported
> 

This is the most problematic thing you asserted. There are two basic
desires I see that drive a Trove adoption:

1) I need database clusters and I don't know how to do them right.
2) I need _high performance/availability/capacity_ databases and my
cloud's standard VM flavors/hosts/networks/disks/etc. stand in the way
of that.

For the openstack-spanning cluster, thing, (1) is fine. But (1) can and
probably should be handled by things like Helm, Juju, Ansible, Habitat,
Docker Compose, etc.

(2) is much more likely to draw people into an official "inside the cloud"
Trove deployment. Let the operators install Ironic, wire up some baremetal
with huge disks or powerful RAID controllers or an infiniband mesh,
and build their own images with tuned kernels and tightly controlled
builds of MySQL/MariaDB/Postgres/MongoDB/etc.

Don't let the users know anything about the computers their database
cluster runs on. They get cluster access details, and knobs that
are workload specific. But not all the knobs, just the knobs that an
operator can't possibly know. And in return you give them highly capable
databases.

But (2) is directly counter to (1). I would say pick one, and focus on
that for Trove. To me, (2) is the more interesting story. (1) is a place
to let 1000 flowers bloom (in many cases they already have, and just
need porting from AWS/GCE/Azure/DigitalOcean to OpenStack). If you want
to run cross cloud, you are accepting the limitations of multi-cloud,
and should likely be building cloud-native apps that don't rely on a
beefy database cluster.




More information about the OpenStack-dev mailing list