[nova][neutron][cyborg] Bandwidth (and accel) providers are broken if CONF.host is set

Bal√°zs Gibizer balazs.gibizer at est.tech
Fri Nov 22 11:16:58 UTC 2019

On Thu, Nov 21, 2019 at 17:28, Eric Fried <openstack at fried.cc> wrote:
> Hi all. Sean mentioned a "downstream bug" in IRC today [1], which we
> discussed a little bit, but without gibi; and then Matt and I 
> discussed
> it more later [2], but without gibi *or* Sean. Since I don't know if
> there's a bug report I can comment on, I wanted to summarize here for
> now so I don't forget.
> The Problem
> ===========
> Neutron needs to know the compute node resource provider on which to
> hang the child providers for QoS bandwidth. Today it assumes CONF.host
> is the name of that provider.
> That's wrong.
> The name of the provider is `hypervisor_hostname`, which for libvirt 
> [3]
> happens to match the *default* value of CONF.host [4].
> Per the bug Sean describes, if you override CONF.host, neutron won't
> find the compute node provider, and things break.
> The problem will be the same for any non-nova wishing to discover the
> compute node RP -- e.g. cyborg for purposes of creating child 
> providers
> for accelerators.

As a background I'm pretty sure I made this mistake due to looking at 
how nova and neutron does the port binding. There nova sends the 
instance.host (which is coming from the CONF.host) to neutron in the 
binding:host_id field of the neutron port. The difference between the 
binding and the inventory handling is that port.binding:host_id is used 
by neutron to figure out which service host will run the nova-compute 
service as well as the neutron agent. So there the CONF.host is enough. 
Libvirt also runs on the same physical host as nova-compute and the 
neutron agent. BUT while for nova and neutron the hostname can be 
overwritten by setting the CONF.host to an arbitrary string, for 
libvirt such configuration cannot be applied. Libvirt will always use 
the linux's gethostname call [6].

> The Right Solution
> ==================
> Neutron (and any $service) should look up the compute node provider by
> its UUID. That's returned by the /os-hypervisors APIs after 
> microversion
> 2.53, e.g. [5], but, Catch-22, you can currently only filter those
> results on hypervisor_hostname. So you would have to e.g. GET
> /os-hypervisors/detail and then walk the list looking for service.host
> matching CONF.host. That's way heavy for your CERNs.
> So the proposal moving forward is to add (in a new microversion) a
> ?service_host=XXX qparam to those APIs to let you filter down to just
> the one entry for your CONF.host. The UUID of that entry will also be
> the UUID of the compute node resource provider. (At that point you 
> don't
> even need to ask Placement for that provider; you can just use that 
> directly in the APIs that create the child providers. Yay, you got 
> your
> extra API call back.)

The result of ?service_host=XXX cannot be a single compute node. It 
needs to be a list of compute nodes (due to ironic) and neutron needs 
to *assume* that in any case where neutron does this query there is no 
more than a single item in the returned list (e.g. neutron cannot mix 
with ironic).

> Now, that's not backportable, and this problem exists in stable 
> releases
> (at least those that support QoS bandwidth). So we should totally do 
> it,
> but we also need...
> The Backportable Solution
> =========================
> Neutron should use `gethostname()` rather than CONF.host to discover 
> the
> compute node resource provider.

In theory this sounds a good solution as it will align the nova + 
libvirt behavior with the neutron behavior regarding placement RP 

Where the whole thing goes sideways is the fact that neutron does the 
inventory handling up in the neutron server. Neutron does the compute 
RP lookup, child RP creation and inventory reporting based on the 
information the agents are sending up to the neutron server. As in 
neutron there is no differentiation between compute/agent service host 
and hypervisor node the agents sends up a single hostname[7][9] based 
on the CONF.host configuration[8]. So adding a new piece of information 
to the agent report is most probably an RPC change between neutron 
agents and the neutron server.

@Neutron, @Stable: Does such RPC change is backportable in neutron?

> I don't consider this a viable permanent solution because it is 
> tightly
> coupled to knowing that hypervisor_hostname == `gethostname()`, which
> happens to be true for libvirt, but not necessarily for other drivers.
> We can get away with it for stable because we happen to know that 
> we're
> only supporting QoS bandwidth via Placement for libvirt.

Temporary workaround

If you configure bandwidth in neutron then do not change the CONF.host 
config in your computes. I think we have to document that as a 
limitation here [10].

> Upgrade Concerns
> ================
> Matt and I didn't nail down whether neutron and compute are allowed to
> be at different versions on a given host, or what those are allowed to
> be. But things should be sane if neutron (or any $service) logics like
> this in >=ussuri:
> if new_nova_microversion_available:
>     do_the_os_hypervisors_thing()
> elif using_new_non_libvirt_feature:
>     raise YouCantDoThisWithOldNova()
> else:
>     do_the_gethostname_thing()

The 'if' part looks OK to me and when we find a backportable solution, 
that needs to be in the 'else' branch.

> Action Summary
> ==============
> If the above sounds reasonable, it would entail the following actions:
> - Neutron(/Cyborg?): backportable patch to 
> s/CONF.host/socket.gethostname()/
> - Nova: GET /os-hypervisors*?service_host=X in a new microversion.

I guess that will be me, starting with a spec that outlines the new API.

> - Neutron/Cyborg: master-only patch to do the logic described in
> `Upgrade Concerns`_ (though for now without the `elif` branch).
> Thanks,
> efried
> [1]
> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-11-21.log.html#t2019-11-21T17:59:05
> [2]
> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-11-21.log.html#t2019-11-21T21:53:29
> [3]
> https://opendev.org/openstack/nova/src/commit/1cd5563f2dd2b218db2422397c8aab394d484626/nova/virt/libvirt/host.py#L955
> [4]
> https://opendev.org/openstack/nova/src/commit/1cd5563f2dd2b218db2422397c8aab394d484626/nova/conf/netconf.py#L56
> [5]
> https://docs.openstack.org/api-ref/compute/?expanded=list-hypervisors-details-detail#id309

@Sean: thanks for finding the bug!
@Eric: thanks for the good write up!



More information about the openstack-discuss mailing list