<p dir="ltr">The issue with the availability zone solution is that we now force availability zones in Nova to be constrained to network configuration. In the L3 ToR/no overlay configuration, this means every rack is its own availability zone. This is pretty annoying for users to deal with because they have to choose from potentially hundreds of availability zones and it rules out making AZs based on other things (e.g. current phase, cooling systems, etc).</p>
<p dir="ltr">I may be misunderstanding and you could be suggesting to not expose this availability zone to the end user and only make it available to the scheduler. However, this defeats one of the purposes of availability zones which is to let users select different AZs to spread their instances across failure domains. </p>
<div class="gmail_quote">On Jul 22, 2015 2:41 PM, "Assaf Muller" <<a href="mailto:amuller@redhat.com">amuller@redhat.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I added a summary of my thoughts about the enhancements I think we could<br>
make to the Nova scheduler in order to better support the Neutron provider<br>
networks use case.<br>
<br>
----- Original Message -----<br>
> On Tue, Jul 21, 2015 at 1:11 PM, John Belamaric <<a href="mailto:jbelamaric@infoblox.com">jbelamaric@infoblox.com</a>><br>
> wrote:<br>
> > Wow, a lot to digest in these threads. If I can summarize my understanding<br>
> > of the two proposals. Let me know whether I get this right. There are a<br>
> > couple problems that need to be solved:<br>
> ><br>
> > a. Scheduling based on host reachability to the segments<br>
> > b. Floating IP functionality across the segments. I am not sure I am clear<br>
> > on this one but it sounds like you want the routers attached to the<br>
> > segments<br>
> > to advertise routes to the specific floating IPs. Presumably then they<br>
> > would<br>
> > do NAT or the instance would assign both the fixed IP and the floating IP<br>
> > to<br>
> > its interface?<br>
> ><br>
> > In Proposal 1, (a) is solved by associating segments to the front network<br>
> > via a router - that association is used to provide a single hook into the<br>
> > existing API that limits the scope of segment selection to those associated<br>
> > with the front network. (b) is solved by tying the floating IP ranges to<br>
> > the<br>
> > same front network and managing the reachability with dynamic routing.<br>
> ><br>
> > In Proposal 2, (a) is solved by tagging each network with some meta-data<br>
> > that the IPAM system uses to make a selection. This implies an IP<br>
> > allocation<br>
> > request that passes something other than a network/port to the IPAM<br>
> > subsystem. This fine from the IPAM point of view but there is no<br>
> > corresponding API for this right now. To solve (b) either the IPAM system<br>
> > has to publish the routes or the higher level management has to ALSO be<br>
> > aware of the mappings (rather than just IPAM).<br>
><br>
> John, from your summary above, you seem to have the best understanding<br>
> of the whole of what I was weakly attempting to communicate. Thank<br>
> you for summarizing.<br>
><br>
> > To throw some fuel on the fire, I would argue also that (a) is not<br>
> > sufficient and address availability needs to be considered as well (as<br>
> > described in [1]). Selecting a host based on reachability alone will fail<br>
> > when addresses are exhausted. Similarly, with (b) I think there needs to be<br>
> > consideration during association of a floating IP to the effect on routing.<br>
> > That is, rather than a huge number of host routes it would be ideal to<br>
> > allocate the floating IPs in blocks that can be associated with the backing<br>
> > networks (though we would want to be able to split these blocks as small as<br>
> > a /32 if necessary - but avoid it/optimize as much as possible).<br>
><br>
> Yes, address availability is a factor and must be considered in either<br>
> case. My email was getting long already and I thought that could be<br>
> considered separately since I believe it applies regardless of the<br>
> outcome of this thread. But, since it seems to be an essential part<br>
> of this conversation, let me say something about it.<br>
><br>
> Ultimately, we need to match up the host scheduled by Nova to the<br>
> addresses available to that host. We could do this by delaying<br>
> address assignment until after host binding or we could do it by<br>
> including segment information from Neutron during scheduling. The<br>
> latter has the advantage that we can consider IP availability during<br>
> scheduling. That is why GoDaddy implemented it that way.<br>
><br>
> > In fact, I think that these proposals are more or less the same - it's just<br>
> > in #1 the meta-data used to tie the backing networks together is another<br>
> > network. This allows it to fit in neatly with the existing APIs. You would<br>
> > still need to implement something prior to IPAM or within IPAM that would<br>
> > select the appropriate backing network.<br>
><br>
> They are similar but to say they're the same is going a bit too far.<br>
> If they were the same then we'd be done with this conversation. ;)<br>
><br>
> > As a (gulp) third alternative, we should consider that the front network<br>
> > here is in essence a layer 3 domain, and we have modeled layer 3 domains as<br>
> > address scopes in Liberty. The user is essentially saying "give me an<br>
> > address that is routable in this scope" - they don't care which actual<br>
> > subnet it gets allocated on. This is conceptually more in-line with [2] -<br>
> > modeling L3 domain separately from the existing Neutron concept of a<br>
> > network<br>
> > being a broadcast domain.<br>
><br>
> I will consider this some more. This is an interesting thought.<br>
> Address scopes and subnet pools could play a role here. I don't yet<br>
> see how it can all fit together but it is worth some thought.<br>
><br>
> One nit: the neutron network might have been conceived as being just<br>
> "a broadcast domain" but, in practice, it is L2 and L3. The Neutron<br>
> subnet is not really an L3 construct; it is just a cidr and doesn't<br>
> make sense on its own without considering its association with a<br>
> network and the other subnets associated with the same network.<br>
><br>
> > Fundamentally, however we associate the segments together, this comes down<br>
> > to a scheduling problem. Nova needs to be able to incorporate data from<br>
> > Neutron in its scheduling decision. Rather than solving this with a single<br>
> > piece of meta-data like network_id as described in proposal 1, it probably<br>
> > makes more sense to build out the general concept of utilizing network data<br>
> > for nova scheduling. We could still model this as in #1, or using address<br>
> > scopes, or some arbitrary data as in #2. But the harder problem to solve is<br>
> > the scheduling, not how we tag these things to inform that scheduling.<br>
><br>
> Yet how we tag these things seems to be a significant point of<br>
> interest. Maybe not with you but with Ian and Assaf it certainly is.<br>
><br>
> As I said above, I agree that the scheduling part is very important<br>
> and needs to be discussed but I still separate them in my mind from<br>
> this question.<br>
<br>
I'm basing these ideas off my understanding of the GoDaddy, YY and Yahoo requirements in<br>
<a href="https://etherpad.openstack.org/p/Network_Segmentation_Usecases" rel="noreferrer" target="_blank">https://etherpad.openstack.org/p/Network_Segmentation_Usecases</a>. I am purposely not looking<br>
at the problems being presented by Calico or similar /32's BGP advertising implementations,<br>
nor the idea of injecting floating IPs, as I believe those to be separate problems, and conflating<br>
them with everything else presented in that Etherpad would be a mistake. In other words I'm not<br>
trying to solve all of the problems that have ever existed, just some of them :) I'd love to get<br>
feedback from the authors of that Etherpad to see how much progress we'd be making here and if it's in the right direction.<br>
<br>
Context:<br>
Neutron supports self service networking, often implemented by overlay networks. An overlay (GRE, VXLAN) based<br>
network is not location-sensitive, that is, all compute nodes would have access to such a network, as long as<br>
the compute nodes can ping each other (And this may be realized over layer 2 or via routing in your data center).<br>
Some deployments opt out of this type of solution, and instead an admin pre-creates and shares a network(s) that is<br>
realized via VLANs. Tenants connect their VMs to these pre-created networks and don't create networks of their own.<br>
It may be the case where not all compute nodes would have access to such a network. Here's some pretty graphics:<br>
<a href="http://i.imgur.com/bHPgcTw.png" rel="noreferrer" target="_blank">http://i.imgur.com/bHPgcTw.png</a><br>
<br>
Problem 1:<br>
In this example VLANs 11 and 12, and subnets <a href="http://10.0.1.0/24" rel="noreferrer" target="_blank">10.0.1.0/24</a>, <a href="http://10.0.2.0/24" rel="noreferrer" target="_blank">10.0.2.0/24</a> are only available in rack 1.<br>
In this case the admin would create four Neutron networks (VLANs 11, 12, 13 and 14 with their respective subnets).<br>
However, the Nova scheduler is not exposed to this information. This means that if a VM is booted on network 1<br>
(And an AZ is not specified), Nova may try to start it in rack 2, where network 1 is not available.<br>
Neutron port binding would fail in this case and the VM will end up in the error state.<br>
<br>
Solution:<br>
Tag Neutron networks with an AZ as detailed here <a href="http://specs.openstack.org/openstack/neutron-specs/specs/liberty/availability-zone.html" rel="noreferrer" target="_blank">http://specs.openstack.org/openstack/neutron-specs/specs/liberty/availability-zone.html</a>.<br>
This means that when the admin creates network 1 he'll put it in AZ1. When a VM is booted on network 1,<br>
the Nova scheduler will only consider hosts in AZ 1. If an AZ is specified, then Nova will fail-fast and yell<br>
if the specified AZ doesn't match the AZ the Neutron network is in. If a network is not in an AZ (AZ == None)<br>
then the behavior is backwards compatible. Currently it is assumed that all hosts have access to all networks,<br>
while now the assumption will be that all hosts in the same AZ have access to all networks in that AZ.<br>
<br>
Problem & Solution 2:<br>
With tenant networking it makes sense to select the network a VM would boot on.<br>
For example if a tenant created three networks (Say: DB, backend and web tiers, each with its own network and security group)<br>
then each VM would need to go on a specific network according to its role.<br>
With provider networking, you may want to boot a VM and let Nova select the appropriate network for you.<br>
To clarify, you would not specify a network_id or port_id when booting a VM. Nova would schedule the VM to host 1 in rack 1,<br>
and then randomly select network 1 or 2 (Because those are available to the AZ that host 1 is in, and the Nova scheduler would know this with problem 1 solved).<br>
<br>
Problem 3:<br>
In case the 'nova selects a network for you' (Marked as problem/solution 2) proposal is implemented,<br>
you could run in to an issue where the IP addresses on the Nova selected network are exhausted,<br>
and that another network available in that rack/AZ should have been chosen instead.<br>
<br>
Solution:<br>
The nova scheduler could depend on: <a href="https://review.openstack.org/#/c/180803/" rel="noreferrer" target="_blank">https://review.openstack.org/#/c/180803/</a> - A new API to report IP availability per network.<br>
Then, as an additional built-in scheduling filter, when choosing a network, make sure that the network has an IP address available.<br>
<br>
Problem 4 (Disclaimer: This one isn't as well thought out):<br>
When Nova selects a network (Either from a specific AZ or not), solution 2 suggests that it'll essentially be a random choice,<br>
apart from IP availability as defined in the solution to problem 3. It may be the case where the user doesn't want to specify which<br>
network the VM will be connected to, but specify a property that the network must satisfy, such as the security zone<br>
(This is taken straight from the Etherpad).<br>
<br>
Solution:<br>
A new 'tags' property will be added to the network model, as a list of strings (Or perhaps couples consisting of a tag and its description).<br>
When creating a network you could specify arbitrary data to be placed in those tags. When a user boots a VM he could specify a tag (Or tags)<br>
instead of a network/port_id, and the Nova scheduler will filter out any networks that do not have that tag(s).<br>
Tags will be writable by the owner of the network (So an admin in case of provider networks) and readable by anyone else.<br>
Here's some more pretty graphics that would perhaps explain things better: <a href="http://i.imgur.com/89apoA8.png" rel="noreferrer" target="_blank">http://i.imgur.com/89apoA8.png</a>.<br>
<br>
><br>
> > The optimization of routing for floating IPs is also a scheduling problem,<br>
> > though one that would require a lot more changes to how FIP are allocated<br>
> > and associated to solve.<br>
> ><br>
> > John<br>
> ><br>
> > [1] <a href="https://review.openstack.org/#/c/180803/" rel="noreferrer" target="_blank">https://review.openstack.org/#/c/180803/</a><br>
> > [2] <a href="https://bugs.launchpad.net/neutron/+bug/1458890/comments/7" rel="noreferrer" target="_blank">https://bugs.launchpad.net/neutron/+bug/1458890/comments/7</a><br>
><br>
> __________________________________________________________________________<br>
> OpenStack Development Mailing List (not for usage questions)<br>
> Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
><br>
<br>
__________________________________________________________________________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</blockquote></div>