Re: [ironic][nova-scheduler][blazar] "host_state" in nova filters has stale aggregate information

23 Apr 2024

      After some digging, I believe we’re experiencing either this bug, or one much like it.
https://bugs.launchpad.net/nova/+bug/1542491

With debug logging, I see inconsistent values for the contents of “aggregate.hosts” in the host_manager’s “_update_aggregate” method. https://github.com/openstack/nova/blob/ac1c6a8c7d85502babdf617bc49b5bb2301c2...

The “causative change” we had made that triggers this issue was modifying blazar to add/remove hosts from aggregates in parallel.
I can reliably reproduce the issue when moving as few as 3 hosts between aggregates.

Any thoughts?

Thanks!
-Mike Sherman

Adding debug logs verified

On 4/16/24, 8:53 AM, "Michael Sherman" <shermanm@uchicago.edu> wrote:

Hey, thank you both for all the info. Pierre is absolutely correct, we are running a fork and I should have stated that up front.
I’m still getting up to speed on the inner working of these components.

Host aggregate support for ironic nodes would be of interest to us, and (If I understand correctly) would allow us to retire our fork of nova.

In the meantime, I’ve tested a workaround on our fork by implementing blazar-nova’s aggregate metadata checks as a nova prefilter, rather than in the existing filter plugin. I can confirm that the incorrect aggregate information was only present on the “host_state” objects,  while both the nova and placement DBs have correct and current information.

My next steps are to add more debugging to the methods updating the host_state, but I’d also be interested to discuss the merits of scheduler prefilters vs regular filters.

Thank you again!
Mike Sherman

On 4/16/24, 5:23 AM, "smooney@redhat.com" <smooney@redhat.com> wrote:

On Mon, 2024-04-15 at 23:07 +0200, Pierre Riteau wrote:
...
On Mon, 15 Apr 2024 at 19:09, <smooney@redhat.com> wrote:
...
...
As per the ironic docs for configuring nova, we’ve had that flag
disabled.
Right now we’re running with a single ironic-conductor, so I don’t think
On Mon, 2024-04-15 at 14:53 +0000, Michael Sherman wrote:
the hash-ring behavior should affect things?
the hash ring behavior is not related to ironic conductors its related to
the peer-list option in the nova compute
service config.
the ironic driver only supported a single nova-comptue for the entire
ironic
at that time it was possibel to map all your ironics to via that single
compute service to a host aggrate/az
in newton
https://specs.openstack.org/openstack/nova-specs/specs/newton/implemented/ir...
support for runnign multiple nova-compute agents with the ironic virt
dirver was added.
That intoduced the conceph of a hashring that blanced compute nodes
between compute services at runtime based
on teh up state of the compute service. with that change it because
impossible to relibly manage
ironic compute nodes with host aggregates as the dirver would
uncondtionally blance across all ironic compute services.
with
https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro...
the ironic
driver was enhanced with awareness of ironic conductor groups. this
intoduced a peer-list option and partition key
in pricniapl it was possible to create host aggrate that mapped to
conductor groups by incluing all host listed in the
peer_list in the host aggreated
i.e. if you had partition_key=conductor_group_1 peer_list=ironic-1,ironic-2
and you created a host aggreate with ironic-1 and ironic-2 that can work
however its not tested or supproted in general as without carfully
configuring the ironic driver
the hashrign can voilate the constraits and move compute_nodes between
aggrates by balancing them to compute services
not listed in teh host aggrate
in antelope/bobcat we deprecated the hashring mechaium and intoduced a new
ha model and a new ironic sharding mechanium
this was finally implemnted in caracal 2024.1
https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ir...
with this deployment topology it is not guarenteed that ironic will not
rebalnce comptue between compute service
which means you can now staticaly map compute services  (and the ironic
shard it manages) to a host aggrate again
without worriying that the ironic driver will violate the aggreate
expections.
...
I’m very surprised to hear that nova aggregates are not supported with
ironic, it doesn’t seem to be indicated
...
anywhere that I could find in the docs? We’ve been using this
configuration (Ironic + aggregates) since Rocky, and the
Blazar project’s support for ironic depends on host aggregates.
we did have this documented at one point but i agree its not something
that is widely know and what makes matters worse
is it almost works in some cases.
what people often doen realise is that the host aggregate api
https://docs.openstack.org/api-ref/compute/#host-aggregates-os-aggregates
is written in terms of compute services not compute nodes.
so when trying to use it with ironic they expect to be able to add
indivigual ironic server to a host aggreate
https://docs.openstack.org/api-ref/compute/#add-host but they can only
add the compute services.
that means you cant use the tenat isolation filter to isolate a subset of
ironic nodes for a given tenant.
im not sure how blazar was tryign to use aggreates with ironic but i
suspect there integration was incomplete if
not fundementally broken by the limitations of how aggreate function when
used with ironic.
if blazar is using placement aggreates rather then nova host aggrate that
might change things but there is no
nova api to request a instance ot be created in a placement aggreate.
each ironic node has its own resouce provider in placmenet and placement
aggregate work on the resouce provider level
that means you can create aggregate of ironic nodes in placement.
while that is nice since you can use that aggreate in a nova api request
that is realy only useful if blazar is
going directly to ironic.
Upstream Blazar is known to be incompatible with Ironic [1], we don't
claim that it would work. There is work in progress to extend the instance
reservation plugin, which I hope will eventually support reserving bare
metal nodes without involving Nova host aggregates.
The reason that Chameleon has been able to use Nova host aggregates with
Ironic is that they are using a fork of Nova with changes such as this one
proposed by Jay many years ago:
https://review.opendev.org/c/openstack/nova/+/526753
ah ok that make sesne there has been other efforts to enable similar functionality a year
or two ago.

if changing the hostaggarte api to allow mappign compute node to aggreate is really something
that is desired it is something we could bring back up for discussion within the nova team
it would be a new feature requireing a new api microverion, it is not a bug.
as a result this would not be backportable but we coudl disucss fi this is a change we want to do.
previously we have said no but if this is an operator pain point and there is a desire for this
to be changed we shoudl at least consier what what woudl look like and asset it again.
...
As for what could have triggered this change of behaviour, do you know if
it started happening on both sites around the same time? Can it be
correlated with a software change?
Another thought I had was whether a growing number of records in the
database could have resulted in some tasks taking longer to complete and
reaching a tipping point, with side effects on the scheduler. It's a bit
far-fetched though.
[1] https://blueprints.launchpad.net/blazar/+spec/ironic-compatibility