[nova][neutron][cyborg] Bandwidth (and accel) providers are broken if CONF.host is set
Balázs Gibizer
balazs.gibizer at est.tech
Fri Nov 22 11:16:58 UTC 2019
On Thu, Nov 21, 2019 at 17:28, Eric Fried <openstack at fried.cc> wrote:
> Hi all. Sean mentioned a "downstream bug" in IRC today [1], which we
> discussed a little bit, but without gibi; and then Matt and I
> discussed
> it more later [2], but without gibi *or* Sean. Since I don't know if
> there's a bug report I can comment on, I wanted to summarize here for
> now so I don't forget.
>
> The Problem
> ===========
> Neutron needs to know the compute node resource provider on which to
> hang the child providers for QoS bandwidth. Today it assumes CONF.host
> is the name of that provider.
>
> That's wrong.
>
> The name of the provider is `hypervisor_hostname`, which for libvirt
> [3]
> happens to match the *default* value of CONF.host [4].
>
> Per the bug Sean describes, if you override CONF.host, neutron won't
> find the compute node provider, and things break.
>
> The problem will be the same for any non-nova wishing to discover the
> compute node RP -- e.g. cyborg for purposes of creating child
> providers
> for accelerators.
As a background I'm pretty sure I made this mistake due to looking at
how nova and neutron does the port binding. There nova sends the
instance.host (which is coming from the CONF.host) to neutron in the
binding:host_id field of the neutron port. The difference between the
binding and the inventory handling is that port.binding:host_id is used
by neutron to figure out which service host will run the nova-compute
service as well as the neutron agent. So there the CONF.host is enough.
Libvirt also runs on the same physical host as nova-compute and the
neutron agent. BUT while for nova and neutron the hostname can be
overwritten by setting the CONF.host to an arbitrary string, for
libvirt such configuration cannot be applied. Libvirt will always use
the linux's gethostname call [6].
>
> The Right Solution
> ==================
> Neutron (and any $service) should look up the compute node provider by
> its UUID. That's returned by the /os-hypervisors APIs after
> microversion
> 2.53, e.g. [5], but, Catch-22, you can currently only filter those
> results on hypervisor_hostname. So you would have to e.g. GET
> /os-hypervisors/detail and then walk the list looking for service.host
> matching CONF.host. That's way heavy for your CERNs.
>
> So the proposal moving forward is to add (in a new microversion) a
> ?service_host=XXX qparam to those APIs to let you filter down to just
> the one entry for your CONF.host. The UUID of that entry will also be
> the UUID of the compute node resource provider. (At that point you
> don't
> even need to ask Placement for that provider; you can just use that
> UUID
> directly in the APIs that create the child providers. Yay, you got
> your
> extra API call back.)
>
The result of ?service_host=XXX cannot be a single compute node. It
needs to be a list of compute nodes (due to ironic) and neutron needs
to *assume* that in any case where neutron does this query there is no
more than a single item in the returned list (e.g. neutron cannot mix
with ironic).
>
> Now, that's not backportable, and this problem exists in stable
> releases
> (at least those that support QoS bandwidth). So we should totally do
> it,
> but we also need...
>
> The Backportable Solution
> =========================
> Neutron should use `gethostname()` rather than CONF.host to discover
> the
> compute node resource provider.
In theory this sounds a good solution as it will align the nova +
libvirt behavior with the neutron behavior regarding placement RP
naming.
Where the whole thing goes sideways is the fact that neutron does the
inventory handling up in the neutron server. Neutron does the compute
RP lookup, child RP creation and inventory reporting based on the
information the agents are sending up to the neutron server. As in
neutron there is no differentiation between compute/agent service host
and hypervisor node the agents sends up a single hostname[7][9] based
on the CONF.host configuration[8]. So adding a new piece of information
to the agent report is most probably an RPC change between neutron
agents and the neutron server.
@Neutron, @Stable: Does such RPC change is backportable in neutron?
>
> I don't consider this a viable permanent solution because it is
> tightly
> coupled to knowing that hypervisor_hostname == `gethostname()`, which
> happens to be true for libvirt, but not necessarily for other drivers.
> We can get away with it for stable because we happen to know that
> we're
> only supporting QoS bandwidth via Placement for libvirt.
Temporary workaround
====================
If you configure bandwidth in neutron then do not change the CONF.host
config in your computes. I think we have to document that as a
limitation here [10].
>
> Upgrade Concerns
> ================
> Matt and I didn't nail down whether neutron and compute are allowed to
> be at different versions on a given host, or what those are allowed to
> be. But things should be sane if neutron (or any $service) logics like
> this in >=ussuri:
>
> if new_nova_microversion_available:
> do_the_os_hypervisors_thing()
> elif using_new_non_libvirt_feature:
> raise YouCantDoThisWithOldNova()
> else:
> do_the_gethostname_thing()
>
The 'if' part looks OK to me and when we find a backportable solution,
that needs to be in the 'else' branch.
> Action Summary
> ==============
> If the above sounds reasonable, it would entail the following actions:
> - Neutron(/Cyborg?): backportable patch to
> s/CONF.host/socket.gethostname()/
> - Nova: GET /os-hypervisors*?service_host=X in a new microversion.
I guess that will be me, starting with a spec that outlines the new API.
> - Neutron/Cyborg: master-only patch to do the logic described in
> `Upgrade Concerns`_ (though for now without the `elif` branch).
>
> Thanks,
> efried
>
> [1]
> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-11-21.log.html#t2019-11-21T17:59:05
> [2]
> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-11-21.log.html#t2019-11-21T21:53:29
> [3]
> https://opendev.org/openstack/nova/src/commit/1cd5563f2dd2b218db2422397c8aab394d484626/nova/virt/libvirt/host.py#L955
> [4]
> https://opendev.org/openstack/nova/src/commit/1cd5563f2dd2b218db2422397c8aab394d484626/nova/conf/netconf.py#L56
> [5]
> https://docs.openstack.org/api-ref/compute/?expanded=list-hypervisors-details-detail#id309
>
@Sean: thanks for finding the bug!
@Eric: thanks for the good write up!
Cheers,
gibi
[6]
https://libvirt.org/docs/libvirt-appdev-guide-python/en-US/html/libvirt_application_development_guide_using_python-Connections-Host_Info.html
[7]
https://github.com/openstack/neutron/blob/67b613b795416406fb4fab143b3ec9ba8657711f/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L295-L324
[8]
https://github.com/openstack/neutron/blob/67b613b795416406fb4fab143b3ec9ba8657711f/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L171
[9]
https://github.com/openstack/neutron/blob/67b613b795416406fb4fab143b3ec9ba8657711f/neutron/plugins/ml2/drivers/mech_sriov/agent/sriov_nic_agent.py#L161-L175
[10]
https://docs.openstack.org/neutron/latest/admin/config-qos-min-bw.html#limitations
More information about the openstack-discuss
mailing list