On Thu, Nov 21, 2019 at 17:28, Eric Fried openstack@fried.cc wrote:
Hi all. Sean mentioned a "downstream bug" in IRC today [1], which we discussed a little bit, but without gibi; and then Matt and I discussed it more later [2], but without gibi *or* Sean. Since I don't know if there's a bug report I can comment on, I wanted to summarize here for now so I don't forget.
The Problem
Neutron needs to know the compute node resource provider on which to hang the child providers for QoS bandwidth. Today it assumes CONF.host is the name of that provider.
That's wrong.
The name of the provider is `hypervisor_hostname`, which for libvirt [3] happens to match the *default* value of CONF.host [4].
Per the bug Sean describes, if you override CONF.host, neutron won't find the compute node provider, and things break.
The problem will be the same for any non-nova wishing to discover the compute node RP -- e.g. cyborg for purposes of creating child providers for accelerators.
As a background I'm pretty sure I made this mistake due to looking at how nova and neutron does the port binding. There nova sends the instance.host (which is coming from the CONF.host) to neutron in the binding:host_id field of the neutron port. The difference between the binding and the inventory handling is that port.binding:host_id is used by neutron to figure out which service host will run the nova-compute service as well as the neutron agent. So there the CONF.host is enough. Libvirt also runs on the same physical host as nova-compute and the neutron agent. BUT while for nova and neutron the hostname can be overwritten by setting the CONF.host to an arbitrary string, for libvirt such configuration cannot be applied. Libvirt will always use the linux's gethostname call [6].
The Right Solution
Neutron (and any $service) should look up the compute node provider by its UUID. That's returned by the /os-hypervisors APIs after microversion 2.53, e.g. [5], but, Catch-22, you can currently only filter those results on hypervisor_hostname. So you would have to e.g. GET /os-hypervisors/detail and then walk the list looking for service.host matching CONF.host. That's way heavy for your CERNs.
So the proposal moving forward is to add (in a new microversion) a ?service_host=XXX qparam to those APIs to let you filter down to just the one entry for your CONF.host. The UUID of that entry will also be the UUID of the compute node resource provider. (At that point you don't even need to ask Placement for that provider; you can just use that UUID directly in the APIs that create the child providers. Yay, you got your extra API call back.)
The result of ?service_host=XXX cannot be a single compute node. It needs to be a list of compute nodes (due to ironic) and neutron needs to *assume* that in any case where neutron does this query there is no more than a single item in the returned list (e.g. neutron cannot mix with ironic).
Now, that's not backportable, and this problem exists in stable releases (at least those that support QoS bandwidth). So we should totally do it, but we also need...
The Backportable Solution
Neutron should use `gethostname()` rather than CONF.host to discover the compute node resource provider.
In theory this sounds a good solution as it will align the nova + libvirt behavior with the neutron behavior regarding placement RP naming.
Where the whole thing goes sideways is the fact that neutron does the inventory handling up in the neutron server. Neutron does the compute RP lookup, child RP creation and inventory reporting based on the information the agents are sending up to the neutron server. As in neutron there is no differentiation between compute/agent service host and hypervisor node the agents sends up a single hostname[7][9] based on the CONF.host configuration[8]. So adding a new piece of information to the agent report is most probably an RPC change between neutron agents and the neutron server.
@Neutron, @Stable: Does such RPC change is backportable in neutron?
I don't consider this a viable permanent solution because it is tightly coupled to knowing that hypervisor_hostname == `gethostname()`, which happens to be true for libvirt, but not necessarily for other drivers. We can get away with it for stable because we happen to know that we're only supporting QoS bandwidth via Placement for libvirt.
Temporary workaround ====================
If you configure bandwidth in neutron then do not change the CONF.host config in your computes. I think we have to document that as a limitation here [10].
Upgrade Concerns
Matt and I didn't nail down whether neutron and compute are allowed to be at different versions on a given host, or what those are allowed to be. But things should be sane if neutron (or any $service) logics like this in >=ussuri:
if new_nova_microversion_available: do_the_os_hypervisors_thing() elif using_new_non_libvirt_feature: raise YouCantDoThisWithOldNova() else: do_the_gethostname_thing()
The 'if' part looks OK to me and when we find a backportable solution, that needs to be in the 'else' branch.
Action Summary
If the above sounds reasonable, it would entail the following actions:
- Neutron(/Cyborg?): backportable patch to
s/CONF.host/socket.gethostname()/
- Nova: GET /os-hypervisors*?service_host=X in a new microversion.
I guess that will be me, starting with a spec that outlines the new API.
- Neutron/Cyborg: master-only patch to do the logic described in
`Upgrade Concerns`_ (though for now without the `elif` branch).
Thanks, efried
[1] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2... [2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2... [3] https://opendev.org/openstack/nova/src/commit/1cd5563f2dd2b218db2422397c8aab... [4] https://opendev.org/openstack/nova/src/commit/1cd5563f2dd2b218db2422397c8aab... [5] https://docs.openstack.org/api-ref/compute/?expanded=list-hypervisors-detail...
@Sean: thanks for finding the bug! @Eric: thanks for the good write up!
Cheers, gibi
[6] https://libvirt.org/docs/libvirt-appdev-guide-python/en-US/html/libvirt_appl... [7] https://github.com/openstack/neutron/blob/67b613b795416406fb4fab143b3ec9ba86... [8] https://github.com/openstack/neutron/blob/67b613b795416406fb4fab143b3ec9ba86... [9] https://github.com/openstack/neutron/blob/67b613b795416406fb4fab143b3ec9ba86... [10] https://docs.openstack.org/neutron/latest/admin/config-qos-min-bw.html#limit...