[TripleO] Scaling node counts with only Ansible (N=1)
There's been a fair amount of recent work around simplifying our Heat templates and migrating the software configuration part of our deployment entirely to Ansible. As part of this effort, it became apparent that we could render much of the data that we need out of Heat in a way that is generic per node, and then have Ansible render the node specific data during config-download runtime. To illustrate the point, consider when we specify ComputeCount:10 in our templates, that much of the work that Heat is doing across those 10 sets of resources for each Compute node is duplication. However, it's been necessary so that Heat can render data structures such as list of IP's, lists of hostnames, contents of /etc/hosts files, etc etc etc. If all that was driven by Ansible using host facts, then Heat doesn't need to do those 10 sets of resources to begin with. The goal is to get to a point where we can deploy the Heat stack with a count of 1 for each role, and then deploy any number of nodes per role using Ansible. To that end, I've been referring to this effort as N=1. The value in this work is that it directly addresses our scaling issues with Heat (by just deploying a much smaller stack). Obviously we'd still be relying heavily on Ansible to scale to the required levels, but I feel that is much better understood challenge at this point in the evolution of configuration tools. With the patches that we've been working on recently, I've got a POC running where I can deploy additional compute nodes with just Ansible. This is done by just adding the additional nodes to the Ansible inventory with a small set of facts to include IP addresses on each enabled network and a hostname. These patches are at https://review.opendev.org/#/q/topic:bp/reduce-deployment-resources and reviews/feedback are welcome. Other points: - Baremetal provisioning and port creation are presently handled by Heat. With the ongoing efforts to migrate baremetal provisioning out of Heat (nova-less deploy), I think these efforts are very complimentary. Eventually, we get to a point where Heat is not actually creating any other OpenStack API resources. For now, the patches only work when using pre-provisioned nodes. - We need to consider how we'd manage the Ansible inventory going forward if we open up an interface for operators to manipulate it directly. That's something we'd want to manage and preserve (version control) as it's critical data for the deployment. Given the progress that we've made with the POC, my sense is that we'll keep pushing in this overall direction. I'd like to get some feedback on the approach. We have an etherpad we are using to track some of the work at a high level: https://etherpad.openstack.org/p/tripleo-reduce-deployment-resources I'll be adding some notes on how I setup the POC to that etherpad if others would like to try it out. -- -- James Slagle --
On Wed, Jul 10, 2019 at 4:24 PM James Slagle <james.slagle@gmail.com> wrote:
There's been a fair amount of recent work around simplifying our Heat templates and migrating the software configuration part of our deployment entirely to Ansible.
As part of this effort, it became apparent that we could render much of the data that we need out of Heat in a way that is generic per node, and then have Ansible render the node specific data during config-download runtime.
To illustrate the point, consider when we specify ComputeCount:10 in our templates, that much of the work that Heat is doing across those 10 sets of resources for each Compute node is duplication. However, it's been necessary so that Heat can render data structures such as list of IP's, lists of hostnames, contents of /etc/hosts files, etc etc etc. If all that was driven by Ansible using host facts, then Heat doesn't need to do those 10 sets of resources to begin with.
The goal is to get to a point where we can deploy the Heat stack with a count of 1 for each role, and then deploy any number of nodes per role using Ansible. To that end, I've been referring to this effort as N=1.
The value in this work is that it directly addresses our scaling issues with Heat (by just deploying a much smaller stack). Obviously we'd still be relying heavily on Ansible to scale to the required levels, but I feel that is much better understood challenge at this point in the evolution of configuration tools.
With the patches that we've been working on recently, I've got a POC running where I can deploy additional compute nodes with just Ansible. This is done by just adding the additional nodes to the Ansible inventory with a small set of facts to include IP addresses on each enabled network and a hostname.
These patches are at https://review.opendev.org/#/q/topic:bp/reduce-deployment-resources and reviews/feedback are welcome.
This is a fabulous proposal in my opinion. I've added (and will continue to add) TODO ideas in the etherpad. Anyone willing to help, please ping us if needed. Another point, somewhat related: I took the opportunity of this work to reduce the complexity around the number of hieradata files. I would like to investigate if we can generate one data file which would be loaded by both Puppet and Ansible for doing the configuration management. I'll create a separated thread on that effort very soon.
Other points:
- Baremetal provisioning and port creation are presently handled by Heat. With the ongoing efforts to migrate baremetal provisioning out of Heat (nova-less deploy), I think these efforts are very complimentary. Eventually, we get to a point where Heat is not actually creating any other OpenStack API resources. For now, the patches only work when using pre-provisioned nodes.
- We need to consider how we'd manage the Ansible inventory going forward if we open up an interface for operators to manipulate it directly. That's something we'd want to manage and preserve (version control) as it's critical data for the deployment.
Given the progress that we've made with the POC, my sense is that we'll keep pushing in this overall direction. I'd like to get some feedback on the approach. We have an etherpad we are using to track some of the work at a high level:
https://etherpad.openstack.org/p/tripleo-reduce-deployment-resources
I'll be adding some notes on how I setup the POC to that etherpad if others would like to try it out.
-- -- James Slagle --
-- Emilien Macchi
Hi James, On Wed, Jul 10, 2019 at 4:20 PM James Slagle <james.slagle@gmail.com> wrote:
There's been a fair amount of recent work around simplifying our Heat templates and migrating the software configuration part of our deployment entirely to Ansible.
As part of this effort, it became apparent that we could render much of the data that we need out of Heat in a way that is generic per node, and then have Ansible render the node specific data during config-download runtime.
I find this endeavour very exciting. Do you have any early indications of performance gains that you can share? Cheers, David
On Fri, Jul 12, 2019 at 9:46 AM David Peacock <dpeacock@redhat.com> wrote:
Hi James,
On Wed, Jul 10, 2019 at 4:20 PM James Slagle <james.slagle@gmail.com> wrote:
There's been a fair amount of recent work around simplifying our Heat templates and migrating the software configuration part of our deployment entirely to Ansible.
As part of this effort, it became apparent that we could render much of the data that we need out of Heat in a way that is generic per node, and then have Ansible render the node specific data during config-download runtime.
I find this endeavour very exciting. Do you have any early indications of performance gains that you can share?
No hard numbers yet, but I can say that I can get to the Ansible stage of the deployment with any number of nodes with an undercloud that just meets the minimum requirements. This is significant because previously we could not get to this stage without first deploying a huge Heat stack which required a lot of physical resources, tuning, tweaking, or going the undercloud minion route. Also, it's less about performance and more about scale. Certainly the Heat stack operation will be much faster as the number of nodes in the deployment increases. The stack operation time will in fact be constant in relation to the number of nodes in the deployment. It will depend on the number of *roles*, but typically those are ~< 5 per deployment, and the most I've seen is 12. The total work done by Ansible does increase as we move more logic into roles and tasks. However, I expect the total Ansible run time to be roughly equivalent to what we have today since the sum of all that Ansible applies is roughly equal. In terms of scale however, it allows us to move beyond the ~300 node limit we're at today. And it keeps the Heat performance constant as opposed to increasing with the node count. -- -- James Slagle --
On Wed, 2019-07-10 at 16:17 -0400, James Slagle wrote:
There's been a fair amount of recent work around simplifying our Heat templates and migrating the software configuration part of our deployment entirely to Ansible.
As part of this effort, it became apparent that we could render much of the data that we need out of Heat in a way that is generic per node, and then have Ansible render the node specific data during config-download runtime.
To illustrate the point, consider when we specify ComputeCount:10 in our templates, that much of the work that Heat is doing across those 10 sets of resources for each Compute node is duplication. However, it's been necessary so that Heat can render data structures such as list of IP's, lists of hostnames, contents of /etc/hosts files, etc etc etc. If all that was driven by Ansible using host facts, then Heat doesn't need to do those 10 sets of resources to begin with.
The goal is to get to a point where we can deploy the Heat stack with a count of 1 for each role, and then deploy any number of nodes per role using Ansible. To that end, I've been referring to this effort as N=1.
The value in this work is that it directly addresses our scaling issues with Heat (by just deploying a much smaller stack). Obviously we'd still be relying heavily on Ansible to scale to the required levels, but I feel that is much better understood challenge at this point in the evolution of configuration tools.
With the patches that we've been working on recently, I've got a POC running where I can deploy additional compute nodes with just Ansible. This is done by just adding the additional nodes to the Ansible inventory with a small set of facts to include IP addresses on each enabled network and a hostname.
These patches are at https://review.opendev.org/#/q/topic:bp/reduce-deployment-resources and reviews/feedback are welcome.
Other points:
- Baremetal provisioning and port creation are presently handled by Heat. With the ongoing efforts to migrate baremetal provisioning out of Heat (nova-less deploy), I think these efforts are very complimentary. Eventually, we get to a point where Heat is not actually creating any other OpenStack API resources. For now, the patches only work when using pre-provisioned nodes.
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1. What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack. Currently the puppet/role-role.yaml creates all the network ports etc. As you only want to create it once, it instead could simply output the UUID of the networks+subnets. These are identical for all servers in the role. So we end up with a small heat stack. Once the stack is created we could use that generic "server" role data to feed into something (ansible?, python?, mistral?) that calls metalsmith to build the servers, then create ports for each server in neutron, one port for each network+subnet defined in the role. Then feed that output into the json (hieradata) that is pushed to each node and used during service configuration, all the things we need to configure network interfaces, /etc/hosts and so on. We need a way to keep track of which ports belong to wich node, but I guess something simple like using the node's ironic UUID in either the name, description or tag field of the neutron port will work. There is also the extra filed in Ironic which is json type, so we could place a map of network->port_uuid in there as well. Another idea I've been pondering is if we put credentials on the overcloud nodes so that the node itself could make the call to neutron on the undercloud to create ports in neutron. I.e we just push the UUID of the correct network and subnet where the resource should be created, and let the overcloud node do the create. The problem with this is that we wouldn't have a way to build the /etc/hosts and probably other things that include ips etc for all the nodes. Maby if all the nodes was part of an etcd cluster, and pushed it's data there? I think the creation of the actual Networks and Subnets can be left in heat, it's typically 5-6 networks and 5-6 subnets so it's not a lot of resources. Even in a large DCN deployment having 50-100 subnets per network or even 50-100 networks I think this is'nt a problem.
- We need to consider how we'd manage the Ansible inventory going forward if we open up an interface for operators to manipulate it directly. That's something we'd want to manage and preserve (version control) as it's critical data for the deployment.
Given the progress that we've made with the POC, my sense is that we'll keep pushing in this overall direction. I'd like to get some feedback on the approach. We have an etherpad we are using to track some of the work at a high level:
https://etherpad.openstack.org/p/tripleo-reduce-deployment-resources
I'll be adding some notes on how I setup the POC to that etherpad if others would like to try it out.
On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com> wrote:
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1.
What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack.
I'm not entirely following what you're saying is backwards. What I've proposed is that we *don't* have any node specific data in the stack. It sounds like you're saying the way we do it today is backwards. It's correct that what's been proposed with metalsmith currently still requires the full ResourceGroup with a member for each node. With the template changes I'm proposing, that wouldn't be required, so we could actually do the Heat stack first, then metalsmith.
Currently the puppet/role-role.yaml creates all the network ports etc. As you only want to create it once, it instead could simply output the UUID of the networks+subnets. These are identical for all servers in the role. So we end up with a small heat stack.
Once the stack is created we could use that generic "server" role data to feed into something (ansible?, python?, mistral?) that calls metalsmith to build the servers, then create ports for each server in neutron, one port for each network+subnet defined in the role. Then feed that output into the json (hieradata) that is pushed to each node and used during service configuration, all the things we need to configure network interfaces, /etc/hosts and so on. We need a way to keep track of which ports belong to wich node, but I guess something simple like using the node's ironic UUID in either the name, description or tag field of the neutron port will work. There is also the extra filed in Ironic which is json type, so we could place a map of network->port_uuid in there as well.
It won't matter whether we do baremetal provisioning before or after the Heat stack. Heat won't care, as it won't have any expectation to create any servers or that they are already created. We can define where we end up calling the metalsmith piece as it should be independent of the Heat stack if we make these template changes.
Another idea I've been pondering is if we put credentials on the overcloud nodes so that the node itself could make the call to neutron on the undercloud to create ports in neutron. I.e we just push the UUID of the correct network and subnet where the resource should be created, and let the overcloud node do the create. The problem with this is that
I don't think it would be a good idea to have undercloud credentials on the overcloud nodes.
I think the creation of the actual Networks and Subnets can be left in heat, it's typically 5-6 networks and 5-6 subnets so it's not a lot of resources. Even in a large DCN deployment having 50-100 subnets per network or even 50-100 networks I think this is'nt a problem.
Agreed, I'm not specifically proposing we move those pieces at this time. -- -- James Slagle --
On Sat, 2019-07-13 at 16:19 -0400, James Slagle wrote:
On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com> wrote:
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1.
What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack.
I'm not entirely following what you're saying is backwards. What I've proposed is that we *don't* have any node specific data in the stack. It sounds like you're saying the way we do it today is backwards.
What I mean to say is that I think the way we are integrating nova-less by first deploying the servers, to then provide the data to Heat to create the resource groups as we do today becomes backwards when your work on N=1 is introduced.
It's correct that what's been proposed with metalsmith currently still requires the full ResourceGroup with a member for each node. With the template changes I'm proposing, that wouldn't be required, so we could actually do the Heat stack first, then metalsmith.
Yes, this is what I think we should do. Especially if your changes here removes the resource group entirely. It makes more sense to create the stack, and once that is created we can do deployment, scaling etc without updating the stack again.
Currently the puppet/role-role.yaml creates all the network ports etc. As you only want to create it once, it instead could simply output the UUID of the networks+subnets. These are identical for all servers in the role. So we end up with a small heat stack.
Once the stack is created we could use that generic "server" role data to feed into something (ansible?, python?, mistral?) that calls metalsmith to build the servers, then create ports for each server in neutron, one port for each network+subnet defined in the role. Then feed that output into the json (hieradata) that is pushed to each node and used during service configuration, all the things we need to configure network interfaces, /etc/hosts and so on. We need a way to keep track of which ports belong to wich node, but I guess something simple like using the node's ironic UUID in either the name, description or tag field of the neutron port will work. There is also the extra filed in Ironic which is json type, so we could place a map of network->port_uuid in there as well.
It won't matter whether we do baremetal provisioning before or after the Heat stack. Heat won't care, as it won't have any expectation to create any servers or that they are already created. We can define where we end up calling the metalsmith piece as it should be independent of the Heat stack if we make these template changes.
This is true. But, in your previous mail in this thread you wrote: """ Other points: - Baremetal provisioning and port creation are presently handled by Heat. With the ongoing efforts to migrate baremetal provisioning out of Heat (nova-less deploy), I think these efforts are very complimentary. Eventually, we get to a point where Heat is not actually creating any other OpenStack API resources. For now, the patches only work when using pre-provisioned nodes. """ IMO "baremetal provision and port creation" fit together. (I read the above statement so as well.) Currently nova-less creates the ctlplane port and provision the baremetal node. If we want to do both baremetal provisioning and port creation togheter (I think this makes sense), we have to do it after the stack has created the networks. What I envision is to have one method that creates all the ports, ctlplane + composable networks in a unified way. Today these are created differently, the ctlplane port is part of the server resource (or metalsmith in nova-less case) and the other ports are created by heat.
I think the creation of the actual Networks and Subnets can be left in heat, it's typically 5-6 networks and 5-6 subnets so it's not a lot of resources. Even in a large DCN deployment having 50-100 subnets per network or even 50-100 networks I think this is'nt a problem.
Agreed, I'm not specifically proposing we move those pieces at this time.
+1
On Mon, Jul 15, 2019 at 2:13 AM Harald Jensås <hjensas@redhat.com> wrote:
On Sat, 2019-07-13 at 16:19 -0400, James Slagle wrote:
On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com> wrote:
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1.
What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack.
I'm not entirely following what you're saying is backwards. What I've proposed is that we *don't* have any node specific data in the stack. It sounds like you're saying the way we do it today is backwards.
What I mean to say is that I think the way we are integrating nova-less by first deploying the servers, to then provide the data to Heat to create the resource groups as we do today becomes backwards when your work on N=1 is introduced.
It's correct that what's been proposed with metalsmith currently still requires the full ResourceGroup with a member for each node. With the template changes I'm proposing, that wouldn't be required, so we could actually do the Heat stack first, then metalsmith.
Yes, this is what I think we should do. Especially if your changes here removes the resource group entirely. It makes more sense to create the stack, and once that is created we can do deployment, scaling etc without updating the stack again.
Currently the puppet/role-role.yaml creates all the network ports etc. As you only want to create it once, it instead could simply output the UUID of the networks+subnets. These are identical for all servers in the role. So we end up with a small heat stack.
Once the stack is created we could use that generic "server" role data to feed into something (ansible?, python?, mistral?) that calls metalsmith to build the servers, then create ports for each server in neutron, one port for each network+subnet defined in the role. Then feed that output into the json (hieradata) that is pushed to each node and used during service configuration, all the things we need to configure network interfaces, /etc/hosts and so on. We need a way to keep track of which ports belong to wich node, but I guess something simple like using the node's ironic UUID in either the name, description or tag field of the neutron port will work. There is also the extra filed in Ironic which is json type, so we could place a map of network->port_uuid in there as well.
It won't matter whether we do baremetal provisioning before or after the Heat stack. Heat won't care, as it won't have any expectation to create any servers or that they are already created. We can define where we end up calling the metalsmith piece as it should be independent of the Heat stack if we make these template changes.
This is true. But, in your previous mail in this thread you wrote:
""" Other points:
- Baremetal provisioning and port creation are presently handled by Heat. With the ongoing efforts to migrate baremetal provisioning out of Heat (nova-less deploy), I think these efforts are very complimentary. Eventually, we get to a point where Heat is not actually creating any other OpenStack API resources. For now, the patches only work when using pre-provisioned nodes. """
IMO "baremetal provision and port creation" fit together. (I read the above statement so as well.) Currently nova-less creates the ctlplane port and provision the baremetal node. If we want to do both baremetal provisioning and port creation togheter (I think this makes sense), we have to do it after the stack has created the networks.
What I envision is to have one method that creates all the ports, ctlplane + composable networks in a unified way. Today these are created differently, the ctlplane port is part of the server resource (or metalsmith in nova-less case) and the other ports are created by heat.
This is my main question about this proposal. When TripleO was in its infancy, there wasn't a mechanism to create Neutron ports separately from the server, so we created a Nova Server resource that specified which network the port was on (originally there was only one port created, now we create additional ports in Neutron). This can be seen in the puppet/<role>-role.yaml file, for example: resources: Controller: type: OS::TripleO::ControllerServer deletion_policy: {get_param: ServerDeletionPolicy} metadata: os-collect-config: command: {get_param: ConfigCommand} splay: {get_param: ConfigCollectSplay} properties: [...] networks: - if: - ctlplane_fixed_ip_set - network: ctlplane subnet: {get_param: ControllerControlPlaneSubnet} fixed_ip: yaql: expression: $.data.where(not isEmpty($)).first() data: - get_param: [ControllerIPs, 'ctlplane', {get_param: NodeIndex}] - network: ctlplane subnet: {get_param: ControllerControlPlaneSubnet} This has the side-effect that the ports are created by Nova calling Neutron rather than by Heat calling Neutron for port creation. We have maintained this mechanism even in the latest versions of THT for backwards compatibility. This would all be easier if we were creating the Neutron ctlplane port and then assigning it to the server, but that breaks backwards-compatibility. How would the creation of the ctlplane port be handled in this proposal? If metalsmith is creating the ctlplane port, do we still need a separate Server resource for every node? If so, I imagine it would have a much smaller stack than what we currently create for each server. If not, would metalsmith create a port on the ctlplane as part of the provisioning steps, and then pass this port back? We still need to be able to support fixed IPs for ctlplane ports, so we need to be able to pass a specific IP to metalsmith.
I think the creation of the actual Networks and Subnets can be left in heat, it's typically 5-6 networks and 5-6 subnets so it's not a lot of resources. Even in a large DCN deployment having 50-100 subnets per network or even 50-100 networks I think this is'nt a problem.
Agreed, I'm not specifically proposing we move those pieces at this time.
+1
-- Dan Sneddon | Senior Principal Software Engineer dsneddon@redhat.com | redhat.com/cloud dsneddon:irc | @dxs:twitter
On Mon, 2019-07-15 at 11:25 -0700, Dan Sneddon wrote:
This is my main question about this proposal. When TripleO was in its infancy, there wasn't a mechanism to create Neutron ports separately from the server, so we created a Nova Server resource that specified which network the port was on (originally there was only one port created, now we create additional ports in Neutron). This can be seen in the puppet/<role>-role.yaml file, for example:
resources: Controller: type: OS::TripleO::ControllerServer deletion_policy: {get_param: ServerDeletionPolicy} metadata: os-collect-config: command: {get_param: ConfigCommand} splay: {get_param: ConfigCollectSplay} properties: [...] networks: - if: - ctlplane_fixed_ip_set - network: ctlplane subnet: {get_param: ControllerControlPlaneSubnet} fixed_ip: yaql: expression: $.data.where(not isEmpty($)).first() data: - get_param: [ControllerIPs, 'ctlplane', {get_param: NodeIndex}] - network: ctlplane subnet: {get_param: ControllerControlPlaneSubnet}
This has the side-effect that the ports are created by Nova calling Neutron rather than by Heat calling Neutron for port creation. We have maintained this mechanism even in the latest versions of THT for backwards compatibility. This would all be easier if we were creating the Neutron ctlplane port and then assigning it to the server, but that breaks backwards-compatibility.
This is indeed an issue that both nova-less and N=1 need to find a solution for. As soon as the nova server resources are removed from a stack the server and ctlplane port will be deleted. We loose track of which IP was assigned to which server at that point. I believe the plan in nova-less is to use the "protected" flag for Ironic nodes to ensure the baremetal node is not unprovisioned (destroyed). So the overcloud node will keep running. This however does'nt solve the problem with the ctlplane port being deleted. We need to ensure that the port is either not deleted, or that a new port is immediately created using the same IP address as before. If we don't we will very likely have duplicate IP issues on next scale out.
How would the creation of the ctlplane port be handled in this proposal? If metalsmith is creating the ctlplane port, do we still need a separate Server resource for every node? If so, I imagine it would have a much smaller stack than what we currently create for each server. If not, would metalsmith create a port on the ctlplane as part of the provisioning steps, and then pass this port back? We still need to be able to support fixed IPs for ctlplane ports, so we need to be able to pass a specific IP to metalsmith.
The way nova-less works is that "openstack overcloud node provision" call's metalsmith to create a port and deploy the server. Once done the data for the servers are placed in a heat environment file defining the 'DeployedServerPortMap' parameter etc so that the already existing pre- deployed-server workflow[1] can be utilized. Using fixed IPs for ctlplane ports is possible with nova-less. But the interface to do so is changed, see[2]. [1] https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/d... [2] https://specs.openstack.org/openstack/tripleo-specs/specs/stein/nova-less-de...
On Tue, Jul 16, 2019 at 7:29 AM Harald Jensås <hjensas@redhat.com> wrote:
As soon as the nova server resources are removed from a stack the server and ctlplane port will be deleted. We loose track of which IP was assigned to which server at that point.
I believe the plan in nova-less is to use the "protected" flag for Ironic nodes to ensure the baremetal node is not unprovisioned (destroyed). So the overcloud node will keep running. This however does'nt solve the problem with the ctlplane port being deleted. We need to ensure that the port is either not deleted, or that a new port is immediately created using the same IP address as before. If we don't we will very likely have duplicate IP issues on next scale out.
Heat provides a supported interface mechanism to override it's built-in resource types. We in fact already do this for both OS::Nova::Server and OS::Neutron::Port. When we manage resources of those types in our stack, we are actually using our custom plugins from tripleo-common. We can add additional logic there to handle this case and define whatever we want to have happen when the resources are deleted from the stack. This would address that issue for both N=1 and nova-less. -- -- James Slagle --
On Mon, Jul 15, 2019 at 2:25 PM Dan Sneddon <dsneddon@redhat.com> wrote:
On Mon, Jul 15, 2019 at 2:13 AM Harald Jensås <hjensas@redhat.com> wrote:
On Sat, 2019-07-13 at 16:19 -0400, James Slagle wrote:
On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com> wrote:
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1.
What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack.
I'm not entirely following what you're saying is backwards. What I've proposed is that we *don't* have any node specific data in the stack. It sounds like you're saying the way we do it today is backwards.
What I mean to say is that I think the way we are integrating nova-less by first deploying the servers, to then provide the data to Heat to create the resource groups as we do today becomes backwards when your work on N=1 is introduced.
It's correct that what's been proposed with metalsmith currently still requires the full ResourceGroup with a member for each node. With the template changes I'm proposing, that wouldn't be required, so we could actually do the Heat stack first, then metalsmith.
Yes, this is what I think we should do. Especially if your changes here removes the resource group entirely. It makes more sense to create the stack, and once that is created we can do deployment, scaling etc without updating the stack again.
Currently the puppet/role-role.yaml creates all the network ports etc. As you only want to create it once, it instead could simply output the UUID of the networks+subnets. These are identical for all servers in the role. So we end up with a small heat stack.
Once the stack is created we could use that generic "server" role data to feed into something (ansible?, python?, mistral?) that calls metalsmith to build the servers, then create ports for each server in neutron, one port for each network+subnet defined in the role. Then feed that output into the json (hieradata) that is pushed to each node and used during service configuration, all the things we need to configure network interfaces, /etc/hosts and so on. We need a way to keep track of which ports belong to wich node, but I guess something simple like using the node's ironic UUID in either the name, description or tag field of the neutron port will work. There is also the extra filed in Ironic which is json type, so we could place a map of network->port_uuid in there as well.
It won't matter whether we do baremetal provisioning before or after the Heat stack. Heat won't care, as it won't have any expectation to create any servers or that they are already created. We can define where we end up calling the metalsmith piece as it should be independent of the Heat stack if we make these template changes.
This is true. But, in your previous mail in this thread you wrote:
""" Other points:
- Baremetal provisioning and port creation are presently handled by Heat. With the ongoing efforts to migrate baremetal provisioning out of Heat (nova-less deploy), I think these efforts are very complimentary. Eventually, we get to a point where Heat is not actually creating any other OpenStack API resources. For now, the patches only work when using pre-provisioned nodes. """
IMO "baremetal provision and port creation" fit together. (I read the above statement so as well.) Currently nova-less creates the ctlplane port and provision the baremetal node. If we want to do both baremetal provisioning and port creation togheter (I think this makes sense), we have to do it after the stack has created the networks.
What I envision is to have one method that creates all the ports, ctlplane + composable networks in a unified way. Today these are created differently, the ctlplane port is part of the server resource (or metalsmith in nova-less case) and the other ports are created by heat.
This is my main question about this proposal. When TripleO was in its infancy, there wasn't a mechanism to create Neutron ports separately from the server, so we created a Nova Server resource that specified which network the port was on (originally there was only one port created, now we create additional ports in Neutron). This can be seen in the puppet/<role>-role.yaml file, for example:
resources: Controller: type: OS::TripleO::ControllerServer deletion_policy: {get_param: ServerDeletionPolicy} metadata: os-collect-config: command: {get_param: ConfigCommand} splay: {get_param: ConfigCollectSplay} properties: [...] networks: - if: - ctlplane_fixed_ip_set - network: ctlplane subnet: {get_param: ControllerControlPlaneSubnet} fixed_ip: yaql: expression: $.data.where(not isEmpty($)).first() data: - get_param: [ControllerIPs, 'ctlplane', {get_param: NodeIndex}] - network: ctlplane subnet: {get_param: ControllerControlPlaneSubnet}
This has the side-effect that the ports are created by Nova calling Neutron rather than by Heat calling Neutron for port creation. We have maintained this mechanism even in the latest versions of THT for backwards compatibility. This would all be easier if we were creating the Neutron ctlplane port and then assigning it to the server, but that breaks backwards-compatibility.
How would the creation of the ctlplane port be handled in this proposal? If metalsmith is creating the ctlplane port, do we still need a separate Server resource for every node? If so, I imagine it would have a much smaller stack than what we currently create for each server. If not, would metalsmith create a port on the ctlplane as part of the provisioning steps, and then pass this port back? We still need to be able to support fixed IPs for ctlplane ports, so we need to be able to pass a specific IP to metalsmith.
I think most of your questions pertain to defining the right interface for baremetal provisioning with metalsmith. We more or less have a clean slate there in terms of how we want that to look going forward. Given that it won't use Nova, my understanding is that the port(s) will be created via Neutron directly. We won't need separate server resources in the stack for every node once provisioning is not part of the stack. We will need to look at how we are creating the other network isolation ports per server however. It's something that we'll need to look at to see if we want to keep using Neutron just for IPAM. It seems a little wasteful to me, but perhaps it's not an issue even with thousands of ports. Initially, you'd be able to scale with just Ansible as long as the operator does mistakenly use overlapping IP's. We could also add ansible tasks that created the ports in Neutron (or verified they were already created) so that the actual IPAM usage is properly reflected in Neutron. -- -- James Slagle --
On 15/07/19 9:12 PM, Harald Jensås wrote:
On Sat, 2019-07-13 at 16:19 -0400, James Slagle wrote:
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1.
What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack. I'm not entirely following what you're saying is backwards. What I've
On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com> wrote: proposed is that we *don't* have any node specific data in the stack. It sounds like you're saying the way we do it today is backwards.
What I mean to say is that I think the way we are integrating nova-less by first deploying the servers, to then provide the data to Heat to create the resource groups as we do today becomes backwards when your work on N=1 is introduced.
It's correct that what's been proposed with metalsmith currently still requires the full ResourceGroup with a member for each node. With the template changes I'm proposing, that wouldn't be required, so we could actually do the Heat stack first, then metalsmith.
Yes, this is what I think we should do. Especially if your changes here removes the resource group entirely. It makes more sense to create the stack, and once that is created we can do deployment, scaling etc without updating the stack again.
I think this is something we can move towards after James has finished this work. It would probably mean deprecating "openstack overcloud node provision" and providing some other way of running the baremetal provisioning in isolation after the heat stack operation, like an equivalent to "openstack overcloud deploy --config-download-only"
On 7/16/19 12:26 AM, Steve Baker wrote:
On 15/07/19 9:12 PM, Harald Jensås wrote:
On Sat, 2019-07-13 at 16:19 -0400, James Slagle wrote:
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1.
What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack. I'm not entirely following what you're saying is backwards. What I've
On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com> wrote: proposed is that we *don't* have any node specific data in the stack. It sounds like you're saying the way we do it today is backwards.
What I mean to say is that I think the way we are integrating nova-less by first deploying the servers, to then provide the data to Heat to create the resource groups as we do today becomes backwards when your work on N=1 is introduced.
It's correct that what's been proposed with metalsmith currently still requires the full ResourceGroup with a member for each node. With the template changes I'm proposing, that wouldn't be required, so we could actually do the Heat stack first, then metalsmith.
Yes, this is what I think we should do. Especially if your changes here removes the resource group entirely. It makes more sense to create the stack, and once that is created we can do deployment, scaling etc without updating the stack again.
I think this is something we can move towards after James has finished this work. It would probably mean deprecating "openstack overcloud node provision" and providing some other way of running the baremetal provisioning in isolation after the heat stack operation, like an equivalent to "openstack overcloud deploy --config-download-only"
I'm very much against on deprecating "openstack overcloud node provision", it's one of the reasons of this whole effort. I'm equally -2 on making the bare metal provisioning depending on heat in any way for the same reason. Dmitry
On Tue, Jul 16, 2019 at 12:19 AM Dmitry Tantsur <dtantsur@redhat.com> wrote:
On 7/16/19 12:26 AM, Steve Baker wrote:
On 15/07/19 9:12 PM, Harald Jensås wrote:
On Sat, 2019-07-13 at 16:19 -0400, James Slagle wrote:
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1.
What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack. I'm not entirely following what you're saying is backwards. What I've
On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com> wrote: proposed is that we *don't* have any node specific data in the stack. It sounds like you're saying the way we do it today is backwards.
What I mean to say is that I think the way we are integrating nova-less by first deploying the servers, to then provide the data to Heat to create the resource groups as we do today becomes backwards when your work on N=1 is introduced.
It's correct that what's been proposed with metalsmith currently still requires the full ResourceGroup with a member for each node. With the template changes I'm proposing, that wouldn't be required, so we could actually do the Heat stack first, then metalsmith.
Yes, this is what I think we should do. Especially if your changes here removes the resource group entirely. It makes more sense to create the stack, and once that is created we can do deployment, scaling etc without updating the stack again.
I think this is something we can move towards after James has finished
work. It would probably mean deprecating "openstack overcloud node
this provision"
and providing some other way of running the baremetal provisioning in isolation after the heat stack operation, like an equivalent to "openstack overcloud deploy --config-download-only"
I'm very much against on deprecating "openstack overcloud node provision", it's one of the reasons of this whole effort. I'm equally -2 on making the bare metal provisioning depending on heat in any way for the same reason.
Dmitry
My concerns about network ports boil down to technical debt with Heat. It would be great if we can make the individual nodes completely independent of Heat, and somehow migrate from the old Heat-based definition for upgrades. -- Dan Sneddon
On 16.07.2019 9:34, Dan Sneddon wrote:
On Tue, Jul 16, 2019 at 12:19 AM Dmitry Tantsur <dtantsur@redhat.com <mailto:dtantsur@redhat.com>> wrote:
On 7/16/19 12:26 AM, Steve Baker wrote: > > On 15/07/19 9:12 PM, Harald Jensås wrote: >> On Sat, 2019-07-13 at 16:19 -0400, James Slagle wrote: >>> On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com <mailto:hjensas@redhat.com>> >>> wrote: >>>> I've said this before, but I think we should turn this nova-less >>>> around. Now with nova-less we create a bunch of servers, and write >>>> up >>>> the parameters file to use the deployed-server approach. >>>> Effectively we >>>> still neet to have the resource group in heat making a server >>>> resource >>>> for every server. Creating the fake server resource is fast, >>>> because >>>> Heat does'nt call Nova,Ironic to create any resources. But the >>>> stack is >>>> equally big, with a stack for every node. i.e not N=1. >>>> >>>> What you are doing here, is essentially to say we don't create a >>>> resource group that then creates N number of role stacks, one for >>>> each >>>> overcloud node. You are creating a single generic "server" >>>> definition >>>> per Role. So we drop the resource group and create >>>> OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to >>>> push >>>> a large struct with properties for N=many nodes into the creation >>>> of >>>> that stack. >>> I'm not entirely following what you're saying is backwards. What I've >>> proposed is that we *don't* have any node specific data in the stack. >>> It sounds like you're saying the way we do it today is backwards. >>> >> What I mean to say is that I think the way we are integrating nova-less >> by first deploying the servers, to then provide the data to Heat to >> create the resource groups as we do today becomes backwards when your >> work on N=1 is introduced. >> >> >>> It's correct that what's been proposed with metalsmith currently >>> still >>> requires the full ResourceGroup with a member for each node. With the >>> template changes I'm proposing, that wouldn't be required, so we >>> could >>> actually do the Heat stack first, then metalsmith. >>> >> Yes, this is what I think we should do. Especially if your changes here >> removes the resource group entirely. It makes more sense to create the >> stack, and once that is created we can do deployment, scaling etc >> without updating the stack again. > > I think this is something we can move towards after James has finished this > work. It would probably mean deprecating "openstack overcloud node provision" > and providing some other way of running the baremetal provisioning in isolation > after the heat stack operation, like an equivalent to "openstack overcloud > deploy --config-download-only" > >
I'm very much against on deprecating "openstack overcloud node provision", it's one of the reasons of this whole effort. I'm equally -2 on making the bare metal provisioning depending on heat in any way for the same reason.
Dmitry
My concerns about network ports boil down to technical debt with Heat. It would be great if we can make the individual nodes completely independent of Heat, and somehow migrate from the old Heat-based definition for upgrades.
As it was earlier mentioned in the thread, we'll highly likely need some external data store to migrate/upgrade things out of Heat smoothly. That probably should be etcd? I don't think a clever ansible inventory could handle that fully replacing such a data store.
-- Dan Sneddon
-- Best regards, Bogdan Dobrelya, Irc #bogdando
On Tue, Jul 16, 2019 at 3:23 AM Dmitry Tantsur <dtantsur@redhat.com> wrote:
On 7/16/19 12:26 AM, Steve Baker wrote:
On 15/07/19 9:12 PM, Harald Jensås wrote:
On Sat, 2019-07-13 at 16:19 -0400, James Slagle wrote:
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1.
What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack. I'm not entirely following what you're saying is backwards. What I've
On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com> wrote: proposed is that we *don't* have any node specific data in the stack. It sounds like you're saying the way we do it today is backwards.
What I mean to say is that I think the way we are integrating nova-less by first deploying the servers, to then provide the data to Heat to create the resource groups as we do today becomes backwards when your work on N=1 is introduced.
It's correct that what's been proposed with metalsmith currently still requires the full ResourceGroup with a member for each node. With the template changes I'm proposing, that wouldn't be required, so we could actually do the Heat stack first, then metalsmith.
Yes, this is what I think we should do. Especially if your changes here removes the resource group entirely. It makes more sense to create the stack, and once that is created we can do deployment, scaling etc without updating the stack again.
I think this is something we can move towards after James has finished this work. It would probably mean deprecating "openstack overcloud node provision" and providing some other way of running the baremetal provisioning in isolation after the heat stack operation, like an equivalent to "openstack overcloud deploy --config-download-only"
I'm very much against on deprecating "openstack overcloud node provision", it's one of the reasons of this whole effort. I'm equally -2 on making the bare metal provisioning depending on heat in any way for the same reason.
I think what's being proposed here is just that we'd change the ordering of the workflow in that we'd do the Heat stack first. That being said, I see the lack of dependency working both ways. Baremetal provisioning should not depend on Heat, and Heat should not depend on baremetal provisioning. You should be able to create the Heat stack without the servers actually existing (same as you can do today with pre-provisioned nodes). -- -- James Slagle --
On 7/16/19 2:25 PM, James Slagle wrote:
On Tue, Jul 16, 2019 at 3:23 AM Dmitry Tantsur <dtantsur@redhat.com> wrote:
On 7/16/19 12:26 AM, Steve Baker wrote:
On 15/07/19 9:12 PM, Harald Jensås wrote:
On Sat, 2019-07-13 at 16:19 -0400, James Slagle wrote:
I've said this before, but I think we should turn this nova-less around. Now with nova-less we create a bunch of servers, and write up the parameters file to use the deployed-server approach. Effectively we still neet to have the resource group in heat making a server resource for every server. Creating the fake server resource is fast, because Heat does'nt call Nova,Ironic to create any resources. But the stack is equally big, with a stack for every node. i.e not N=1.
What you are doing here, is essentially to say we don't create a resource group that then creates N number of role stacks, one for each overcloud node. You are creating a single generic "server" definition per Role. So we drop the resource group and create OS::Triple::{{Role}}.Server 1-time (once). To me it's backwards to push a large struct with properties for N=many nodes into the creation of that stack. I'm not entirely following what you're saying is backwards. What I've
On Fri, Jul 12, 2019 at 3:59 PM Harald Jensås <hjensas@redhat.com> wrote: proposed is that we *don't* have any node specific data in the stack. It sounds like you're saying the way we do it today is backwards.
What I mean to say is that I think the way we are integrating nova-less by first deploying the servers, to then provide the data to Heat to create the resource groups as we do today becomes backwards when your work on N=1 is introduced.
It's correct that what's been proposed with metalsmith currently still requires the full ResourceGroup with a member for each node. With the template changes I'm proposing, that wouldn't be required, so we could actually do the Heat stack first, then metalsmith.
Yes, this is what I think we should do. Especially if your changes here removes the resource group entirely. It makes more sense to create the stack, and once that is created we can do deployment, scaling etc without updating the stack again.
I think this is something we can move towards after James has finished this work. It would probably mean deprecating "openstack overcloud node provision" and providing some other way of running the baremetal provisioning in isolation after the heat stack operation, like an equivalent to "openstack overcloud deploy --config-download-only"
I'm very much against on deprecating "openstack overcloud node provision", it's one of the reasons of this whole effort. I'm equally -2 on making the bare metal provisioning depending on heat in any way for the same reason.
I think what's being proposed here is just that we'd change the ordering of the workflow in that we'd do the Heat stack first.
That being said, I see the lack of dependency working both ways. Baremetal provisioning should not depend on Heat, and Heat should not depend on baremetal provisioning. You should be able to create the Heat stack without the servers actually existing (same as you can do today with pre-provisioned nodes).
Right, and we should be able to provision bare metals without pre-creating heat stack.. and then I don't understand why we want to change the current proposal.
participants (8)
-
Bogdan Dobrelya
-
Dan Sneddon
-
David Peacock
-
Dmitry Tantsur
-
Emilien Macchi
-
Harald Jensås
-
James Slagle
-
Steve Baker