Re: [openstack-dev] [TripleO][Edge] Reduce base layer of containers for security and size of images (maintenance) sakes

29 Nov 2018

      On 11/29/2018 04:28 AM, Bogdan Dobrelya wrote:
...
On 11/28/18 8:55 PM, Doug Hellmann wrote:
...
I thought the preferred solution for more complex settings was config 
maps. Did that approach not work out?
Regardless, now that the driver work is done if someone wants to take 
another stab at etcd integration it’ll be more straightforward today.
Doug
While sharing configs is a feasible option to consider for large scale 
configuration management, Etcd only provides a strong consistency, which 
is also known as "Unavailable" [0]. For edge scenarios, to configure 
40,000 remote computes over WAN connections, we'd rather want instead 
weaker consistency models, like "Sticky Available" [0]. That would allow 
services to fetch their configuration either from a central "uplink" or 
locally as well, when the latter is not accessible from remote edge 
sites. Etcd cannot provide 40,000 local endpoints to fit that case I'm 
afraid, even if those would be read only replicas. That is also 
something I'm highlighting in the paper [1] drafted for ICFC-2019.
But had we such a sticky available key value storage solution, we would 
indeed have solved the problem of multiple configuration management 
system execution for thousands of nodes as James describes it.
It's not that etcd is incapable of providing something like this. It's 
that a *single* etcd KVS used by 40K compute nodes across a 
disaggregated control plane would not be available to all of those nodes 
simultaneously.

But you could certainly use etcd as the data store to build a sticky 
available configuration data store. If, for example, you had many local 
[1] etcd KVS that stored local data and synchronized the local data set 
with other etcd KVS endpoints when a network partition was restored, you 
could get such a system that was essentially "sticky available" for all 
intents and purposes.

Come to think of it, you could do the same with a SQLite DB, ala Swift's 
replication of SQLite DBs via rsync.

But, at the risk of sounding like a broken record, at the end of the 
day, many of OpenStack's core services -- notably Nova -- were not 
designed for disaggregated control planes. They were designed for the 
datacenter, with tightly-packed compute resources and low-latency links 
for the control plane.

The entire communication bus and state management system would need to 
be redesigned from the nova-compute to the nova-conductor for (far) edge 
case clouds to be a true reality.

Instead of sending all data updates synchronously from each nova-compute 
to nova-conductor, the communication bus needs to be radically 
redesigned so that the nova-compute uses a local data store *as its 
primary data storage* and then asynchronously sends batched updates to 
known control plane endpoints when those regular network partitions 
correct themselves.

The nova-compute manager will need to be substantially hardened to keep 
itself up and running (and writing to that local state storage) for long 
periods of time and contain all the logic to resync itself when network 
uplinks become available again.

Finally, if those local nova-computes need to actually *do* anything 
other than keep existing VMs/baremetal machines up and running, then a 
local Compute API service needs to be made available in the far edge 
sites themselves -- offering some subset of Compute API functionality to 
control the VMs in that local site. Otherwise, the whole "multiple 
department stores running an edge OpenStack site that can tolerate the 
Mother Ship being down" isn't a thing that will work.

Like I said, pretty much a complete redesign of the nova control plane...

Best,
-jay

[1] or local-ish, think POPs or even local to the compute node itself...
...
[0] https://jepsen.io/consistency
[1] 
https://github.com/bogdando/papers-ieee/blob/master/ICFC-2019/LaTeX/position...
On 11/28/18 11:22 PM, Dan Prince wrote:
...
On Wed, 2018-11-28 at 13:28 -0500, James Slagle wrote:
...
On Wed, Nov 28, 2018 at 12:31 PM Bogdan Dobrelya <bdobreli@redhat.com
...
wrote:
Long story short, we cannot shoot both rabbits with a single shot,
not
with puppet :) May be we could with ansible replacing puppet
fully...
So splitting config and runtime images is the only choice yet to
address
the raised security concerns. And let's forget about edge cases for
now.
Tossing around a pair of extra bytes over 40,000 WAN-distributed
computes ain't gonna be our the biggest problem for sure.
I think it's this last point that is the crux of this discussion. We
can agree to disagree about the merits of this proposal and whether
it's a pre-optimzation or micro-optimization, which I admit are
somewhat subjective terms. Ultimately, it seems to be about the "why"
do we need to do this as to the reason why the conversation seems to
be going in circles a bit.
I'm all for reducing container image size, but the reality is that
this proposal doesn't necessarily help us with the Edge use cases we
are talking about trying to solve.
Why would we even run the exact same puppet binary + manifest
individually 40,000 times so that we can produce the exact same set
of
configuration files that differ only by things such as IP address,
hostnames, and passwords? Maybe we should instead be thinking about
how we can do that *1* time centrally, and produce a configuration
that can be reused across 40,000 nodes with little effort. The
opportunity for a significant impact in terms of how we can scale
TripleO is much larger if we consider approaching these problems with
a wider net of what we could do. There's opportunity for a lot of
better reuse in TripleO, configuration is just one area. The plan and
Heat stack (within the ResourceGroup) are some other areas.
We run Puppet for configuration because that is what we did on
baremetal and we didn't break backwards compatability for our
configuration options for upgrades. Our Puppet model relies on being
executed on each local host in order to splice in the correct IP
address and hostname. It executes in a distributed fashion, and works
fairly well considering the history of the project. It is robust,
guarantees no duplicate configs are being set, and is backwards
compatible with all the options TripleO supported on baremetal. Puppet
is arguably better for configuration than Ansible (which is what I hear
people most often suggest we replace it with). It suits our needs fine,
but it is perhaps a bit overkill considering we are only generating
config files.
I think the answer here is moving to something like Etcd. Perhaps
Not Etcd I think, see my comment above. But you're absolutely right Dan.
...
skipping over Ansible entirely as a config management tool (it is
arguably less capable than Puppet in this category anyway). Or we could
use Ansible for "legacy" services only, switch to Etcd for a majority
of the OpenStack services, and drop Puppet entirely (my favorite
option). Consolidating our technology stack would be wise.
We've already put some work and analysis into the Etcd effort. Just
need to push on it some more. Looking at the previous Kubernetes
prototypes for TripleO would be the place to start.
Config management migration is going to be tedious. Its technical debt
that needs to be handled at some point anyway. I think it is a general
TripleO improvement that could benefit all clouds, not just Edge.
Dan
...
At the same time, if some folks want to work on smaller optimizations
(such as container image size), with an approach that can be agreed
upon, then they should do so. We just ought to be careful about how
we
justify those changes so that we can carefully weigh the effort vs
the
payoff. In this specific case, I don't personally see this proposal
helping us with Edge use cases in a meaningful way given the scope of
the changes. That's not to say there aren't other use cases that
could
justify it though (such as the security points brought up earlier).

Re: [openstack-dev] [TripleO][Edge] Reduce base layer of containers for security and size of images (maintenance) sakes

Jay Pipes