On 11/29/18 2:00 PM, Jay Pipes wrote:
On 11/29/2018 04:28 AM, Bogdan Dobrelya wrote:
On 11/28/18 8:55 PM, Doug Hellmann wrote:
I thought the preferred solution for more complex settings was config maps. Did that approach not work out?
Regardless, now that the driver work is done if someone wants to take another stab at etcd integration it’ll be more straightforward today.
Doug
While sharing configs is a feasible option to consider for large scale configuration management, Etcd only provides a strong consistency, which is also known as "Unavailable" [0]. For edge scenarios, to configure 40,000 remote computes over WAN connections, we'd rather want instead weaker consistency models, like "Sticky Available" [0]. That would allow services to fetch their configuration either from a central "uplink" or locally as well, when the latter is not accessible from remote edge sites. Etcd cannot provide 40,000 local endpoints to fit that case I'm afraid, even if those would be read only replicas. That is also something I'm highlighting in the paper [1] drafted for ICFC-2019.
But had we such a sticky available key value storage solution, we would indeed have solved the problem of multiple configuration management system execution for thousands of nodes as James describes it.
It's not that etcd is incapable of providing something like this. It's that a *single* etcd KVS used by 40K compute nodes across a disaggregated control plane would not be available to all of those nodes simultaneously.
But you could certainly use etcd as the data store to build a sticky available configuration data store. If, for example, you had many local [1] etcd KVS that stored local data and synchronized the local data set with other etcd KVS endpoints when a network partition was restored, you could get such a system that was essentially "sticky available" for all intents and purposes.
Come to think of it, you could do the same with a SQLite DB, ala Swift's replication of SQLite DBs via rsync.
But, at the risk of sounding like a broken record, at the end of the day, many of OpenStack's core services -- notably Nova -- were not designed for disaggregated control planes. They were designed for the datacenter, with tightly-packed compute resources and low-latency links for the control plane.
The entire communication bus and state management system would need to be redesigned from the nova-compute to the nova-conductor for (far) edge case clouds to be a true reality.
Instead of sending all data updates synchronously from each nova-compute to nova-conductor, the communication bus needs to be radically redesigned so that the nova-compute uses a local data store *as its primary data storage* and then asynchronously sends batched updates to known control plane endpoints when those regular network partitions correct themselves.
The nova-compute manager will need to be substantially hardened to keep itself up and running (and writing to that local state storage) for long periods of time and contain all the logic to resync itself when network uplinks become available again.
Finally, if those local nova-computes need to actually *do* anything other than keep existing VMs/baremetal machines up and running, then a local Compute API service needs to be made available in the far edge sites themselves -- offering some subset of Compute API functionality to control the VMs in that local site. Otherwise, the whole "multiple department stores running an edge OpenStack site that can tolerate the Mother Ship being down" isn't a thing that will work.
Like I said, pretty much a complete redesign of the nova control plane...
We derived a little bit off the topic... but that all is valid for the post-MVP Edge architecture phases [0] targeted for multiple (aka disaggregated/autonomous/local vs central) control planes, indeed. Although there are more options than that complete redesign. IIUC, does the latter assume supporting alternative to SQL/AMQP-ish data/messaging backends for Nova and OpenStack in general? That is only an option (see such backends examples [1][2]), though I love it the most :) Other options may be creating client libraries acting on top of APIs or existing DB/MQ backends and performing low-level data synchronization, or acting as an API re-translators, over multiple control planes. And AFAICT that would *not* require complete redesign of supported backends nor types of transactions in Nova et al. And for MQ, a brokerless qdr or something (there was a nice presentation at the summit)... But in the end, indeed, it is kinda proved in multiple R&D papers, like [3],[4] that only causal sticky consistent synchronization with advanced conflicts resolving [5] is the best Edge-y/Fog-y choice for both such client libraries and causal consistent DB/KVS/MQ backends. I think that is something similar what you (Jay) diescribed for multiple Etcd cluster exchanging its data? So for that example, such client libraries should be maintaining sticky sessions to groups of those Etcd clusters and replicate data around doing it the best of causal consistent ways. PS. That a nice SQLight & rsync combo would not provide us the best of eventual consistency world, no, it would rather be something of a "Total Available" [6] thing, the lowest of it, like Read Uncommited or Monotonic Writes, and would be a very (very) poor choice IMO. [0] https://wiki.openstack.org/w/index.php?title=OpenStack_Edge_Discussions_Dubl... [1] https://www.ronpub.com/OJDB_2015v2i1n02_Elbushra.pdf [2] http://rainbowfs.lip6.fr/data/RainbowFS-2016-04-12.pdf [3] https://www.cs.cmu.edu/~dga/papers/cops-sosp2011.pdf [4] http://www.cs.cornell.edu/lorenzo/papers/cac-tr.pdf [5] https://ieeexplore.ieee.org/document/8104644 [6] https://jepsen.io/consistency
Best, -jay
[1] or local-ish, think POPs or even local to the compute node itself...
[0] https://jepsen.io/consistency [1] https://github.com/bogdando/papers-ieee/blob/master/ICFC-2019/LaTeX/position...
-- Best regards, Bogdan Dobrelya, Irc #bogdando