[openstack-dev] [TripleO]Addressing Edge/Multi-site/Multi-cloud deployment use cases (new squad)
Dmitry Tantsur
dtantsur at redhat.com
Mon Aug 27 09:03:19 UTC 2018
Hi,
Some additions inline.
On 08/20/2018 10:47 PM, James Slagle wrote:
> As we start looking at how TripleO will address next generation deployment
> needs such as Edge, multi-site, and multi-cloud, I'd like to kick off a
> discussion around how TripleO can evolve and adapt to meet these new
> challenges.
>
> What are these challenges? I think the OpenStack Edge Whitepaper does a good
> job summarizing some of them:
>
> https://www.openstack.org/assets/edge/OpenStack-EdgeWhitepaper-v3-online.pdf
>
> They include:
>
> - management of distributed infrastructure
> - massive scale (thousands instead of hundreds)
> - limited network connectivity
> - isolation of distributed sites
> - orchestration of federated services across multiple sites
>
> We already have a lot of ongoing work that directly or indirectly starts to
> address some of these challenges. That work includes things like
> config-download, split-controlplane, metalsmith integration, validations,
> all-in-one, and standalone.
>
> I laid out some initial ideas in a previous message:
>
> http://lists.openstack.org/pipermail/openstack-dev/2018-July/132398.html
>
> I'll be reviewing some of that here and going into a bit more detail.
>
> These are some of the high level ideas I'd like to see TripleO start to
> address:
>
> - More separation between planning and deploying (likely to be further defined
> in spec discussion). We've had these concepts for a while, but we need to do
> a better job of surfacing them to users as deployments grow in size and
> complexity.
>
> With config-download, we can more easily separate the phases of rendering,
> downloading, validating, and applying the configuration. As we increase in
> scale to managing many deployments, we should take advantage of what each of
> those phases offer.
>
> The separation also makes the deployment more portable, as we should
> eliminate any restrictions that force the undercloud to be the control node
> applying the configuration.
>
> - Management of multiple deployments from a single undercloud. This is of
> course already possible today, but we need better docs and polish and more
> testing to flush out any bugs.
>
> - Plan and template management in git.
>
> This could be an iterative step towards eliminating Swift in the undercloud.
> Swift seemed like a natural choice at the time because it was an existing
> OpenStack service. However, I think git would do a better job at tracking
> history and comparing changes and is much more lightweight than Swift. We've
> been managing the config-download directory as a git repo, and I like this
> direction. For now, we are just putting the whole git repo in Swift, but I
> wonder if it makes sense to consider eliminating Swift entirely. We need to
> consider the scale of managing thousands of plans for separate edge
> deployments.
>
> I also think this would be a step towards undercloud simplification.
>
> - Orchestration between plans. I think there's general agreement around scaling
> up the undercloud to be more effective at managing and deploying multiple
> plans.
>
> The plans could be different OpenStack deployments potentially sharing some
> resources. Or, they could be deployments of different software stacks
> (Kubernetes/OpenShift, Ceph, etc).
>
> We'll need to develop some common interfaces for some basic orchestration
> between plans. It could include dependencies, ordering, and sharing parameter
> data (such as passwords or connection info). There is already some ongoing
> discussion about some of this work:
>
> http://lists.openstack.org/pipermail/openstack-dev/2018-August/133247.html
>
> I would suspect this would start out as collecting specific use cases, and
> then figuring out the right generic interfaces.
>
> - Multiple deployments of a single plan. This could be useful for doing many
> deployments that are all the same. Of course some info might be different
> such as network IP's, hostnames, and node specific details. We could have
> some generic input interfaces for those sorts of things without having to
> create new Heat stacks, which would allow re-using the same plan/stack for
> multiple deployments. When scaling to hundreds/thousands of edge deployments
> this could be really effective at side-stepping managing hundreds/thousands
> of Heat stacks.
>
> We may also need further separation between a plan and it's deployment state
> to have this modularity.
>
> - Distributed management/application of configuration. Even though the
> configuration is portable (config-download), we may still want some
> automation around applying the deployment when not using the undercloud as a
> control node. I think things like ansible-runner or Ansible AWX could help
> here, or perhaps mistral-executor agents, or "mistral as a library". This
> would also make our workflows more portable.
>
> - New documentation highlighting some or all of the above features and how to
> take advantage of it for new use cases (thousands of edge deployments, etc).
> I see this as a sort of "TripleO Edge Deployment Guide" that would highlight
> how to take advantage of TripleO for Edge/multi-site use cases.
I would like to also consider a distributed undercloud. For example, we have a
central management node at Location0 where it all starts. Then we have more
management nodes at Location1 and Location1. We deploy the undercloud on all three:
1. The one at Location0 is a typical undercloud.
2. The two at Location{1,2} contain ironic-api, ironic-conductor,
ironic-inspector and neutron-dhcp-agent. The conductors have their
conductor_group [*] set to Location1 and Location2 accordingly. The conductor in
Location0 is left with the default (empty string).
Then we can install stuff at locations. We enroll nodes in ironic using the
conductor_group matching their location. The TFTP, iPXE, DHCP and IPMI/Redfish
traffic will thus be contained within a location. I think the routed ctlplane
feature from Queens will allow us to do networking correctly otherwise.
With the metalsmith switch (if we ever move forward with it, wink-wink) we will
not have problems with explaining Nova the notion of locations. We can just
extend metalsmith to understand conductor_group as a valid scheduling hint.
Any thoughts?
[*] Introduced in the Rocky cycle, the conductor group feature allows defining
affinity between nodes and ironic-conductor instances, so that nodes with a
conductor_group set are only managed by conductors with the same conductor_group.
>
> Obviously all the ideas are a lot of work, and not something I think we'll
> complete in a single cycle.
>
> I'd like to pull a squad together focused on Edge/multi-site/multi-cloud and
> TripleO. On that note, this squad could also work together with other
> deployment projects that are looking at similar use cases and look to
> collaborate.
>
> If you're interested in working on this squad, I'd see our first tasks as
> being:
>
> - Brainstorming additional ideas to the above
> - Breaking down ideas into actionable specs/blueprints for stein (and possibly
> future releases).
> - Coming up with a consistent message around direction and vision for solving
> these deployment challenges.
> - Bringing together ongoing work that relates to these use cases together so
> that we're all collaborating with shared vision and purpose and we can help
> prioritize reviews/ci/etc.
> - Identifying any discussion items we need to work through in person at the
> upcoming Denver PTG.
Count me in (modulo the PTG).
Dmitry
>
> I'm happy to help facilitate the squad. If you have any feedback on these ideas
> or would like to join the squad, reply to the thread or sign up in the
> etherpad:
>
> https://etherpad.openstack.org/p/tripleo-edge-squad-status
>
> I'm just referring to the squad as "Edge" for now, but we can also pick a
> cooler owl themed name :).
>
More information about the OpenStack-dev
mailing list