[ironic][nova-scheduler] "host_state" in nova filters has stale aggregate information
Hi all, We’re running into an issue, where for two sites with 150-250 ironic nodes on a single conductor and nova-compute instance, we’ve started to get “no hosts available” errors from nova scheduler. We’re using the blazar-nova filter to match on hosts in specifically tagged aggregates. After adding some debug logs, I found that the “host_state” object passed to the filter seems to have out-of-date aggregate information. Specifically, if I query the system with “openstack aggregate show …” or “openstack allocation candidate list”, I see the correct aggregate for the nodes in question, but the contents of “host_state” reflect a previous state. This “staleness” does not seem to correct itself over time, but is resolved by restarting the nova-scheduler process (actually restarting the kolla docker container, but the same effect). However the issues return over the course of a couple hours. We haven’t increased the number of nodes, or otherwise changed the hardware, so I’m not sure what could have triggered this issue. Any advice on further debugging steps would be greatly appreciated. Thank you! -- Michael Sherman Infrastructure Lead – Chameleon Computer Science, University of Chicago MCS, Argonne National Lab
On Sat, 2024-04-13 at 12:52 +0000, Michael Sherman wrote:
Hi all,
We’re running into an issue, where for two sites with 150-250 ironic nodes on a single conductor and nova-compute instance, we’ve started to get “no hosts available” errors from nova scheduler.
We’re using the blazar-nova filter to match on hosts in specifically tagged aggregates. After adding some debug logs, I found that the “host_state” object passed to the filter seems to have out-of-date aggregate information. so first thing to be aware of is that host aggrates are not support with ironic virt driver. until the the caracal release the ironic virt driver uses a hash ring to balance comptue nodes between compute services with amoung other things broke host aggrates. From a nova project perspective using hostaggrates with ironic compute services is unsupported.
in caracal they might now work when ironic sharding is used. host aggrates are used to map compute servivices not compute nodes to an aggrate so when using shards you can map a given shard to a host aggrated by mapping the compute service for that shard to an aggrate.
Specifically, if I query the system with “openstack aggregate show …” or “openstack allocation candidate list”, I see the correct aggregate for the nodes in question, but the contents of “host_state” reflect a previous state.
This “staleness” does not seem to correct itself over time, but is resolved by restarting the nova-scheduler process (actually restarting the kolla docker container, but the same effect). However the issues return over the course of a couple hours.
this is likely caused by a combination of the hash ring and the host cache. again your current toplogy is unsupproted as we do not offically supprot using host aggrates with ironic nodes. with that said you could try disabling the cacheing by setting https://docs.openstack.org/nova/latest/configuration/config.html#filter_sche... on the scheduler and all compute services. that may or may not work depending on why but my guess is its becauses the compute nodes that are "stale" have been reblance by the hash ring. the other way to work around this might eb to ensure you do not use peer list and have exactly one compute service per conductor group. again im not sure if that will work because your trying to use a feature (host aggarates) that is not supported by the ironic virt dirver but it might mitigate the incompatiblity in older releases.
We haven’t increased the number of nodes, or otherwise changed the hardware, so I’m not sure what could have triggered this issue.
Any advice on further debugging steps would be greatly appreciated. Thank you!
-- Michael Sherman Infrastructure Lead – Chameleon Computer Science, University of Chicago MCS, Argonne National Lab
As per the ironic docs for configuring nova, we’ve had that flag disabled. Right now we’re running with a single ironic-conductor, so I don’t think the hash-ring behavior should affect things? I’m very surprised to hear that nova aggregates are not supported with ironic, it doesn’t seem to be indicated anywhere that I could find in the docs? We’ve been using this configuration (Ironic + aggregates) since Rocky, and the Blazar project’s support for ironic depends on host aggregates. Best, -Mike Sherman On 4/15/24, 6:13 AM, "smooney@redhat.com" <smooney@redhat.com> wrote: On Sat, 2024-04-13 at 12:52 +0000, Michael Sherman wrote:
Hi all,
We’re running into an issue, where for two sites with 150-250 ironic nodes on a single conductor and nova-compute instance, we’ve started to get “no hosts available” errors from nova scheduler.
We’re using the blazar-nova filter to match on hosts in specifically tagged aggregates. After adding some debug logs, I found that the “host_state” object passed to the filter seems to have out-of-date aggregate information. so first thing to be aware of is that host aggrates are not support with ironic virt driver. until the the caracal release the ironic virt driver uses a hash ring to balance comptue nodes between compute services with amoung other things broke host aggrates. From a nova project perspective using hostaggrates with ironic compute services is unsupported.
in caracal they might now work when ironic sharding is used. host aggrates are used to map compute servivices not compute nodes to an aggrate so when using shards you can map a given shard to a host aggrated by mapping the compute service for that shard to an aggrate.
Specifically, if I query the system with “openstack aggregate show …” or “openstack allocation candidate list”, I see the correct aggregate for the nodes in question, but the contents of “host_state” reflect a previous state.
This “staleness” does not seem to correct itself over time, but is resolved by restarting the nova-scheduler process (actually restarting the kolla docker container, but the same effect). However the issues return over the course of a couple hours.
this is likely caused by a combination of the hash ring and the host cache. again your current toplogy is unsupproted as we do not offically supprot using host aggrates with ironic nodes. with that said you could try disabling the cacheing by setting https://docs.openstack.org/nova/latest/configuration/config.html#filter_sche... on the scheduler and all compute services. that may or may not work depending on why but my guess is its becauses the compute nodes that are "stale" have been reblance by the hash ring. the other way to work around this might eb to ensure you do not use peer list and have exactly one compute service per conductor group. again im not sure if that will work because your trying to use a feature (host aggarates) that is not supported by the ironic virt dirver but it might mitigate the incompatiblity in older releases.
We haven’t increased the number of nodes, or otherwise changed the hardware, so I’m not sure what could have triggered this issue.
Any advice on further debugging steps would be greatly appreciated. Thank you!
-- Michael Sherman Infrastructure Lead – Chameleon Computer Science, University of Chicago MCS, Argonne National Lab
As per the ironic docs for configuring nova, we’ve had that flag disabled. Right now we’re running with a single ironic-conductor, so I don’t think the hash-ring behavior should affect things?
On Mon, 2024-04-15 at 14:53 +0000, Michael Sherman wrote: the hash ring behavior is not related to ironic conductors its related to the peer-list option in the nova compute service config. the ironic driver only supported a single nova-comptue for the entire ironic at that time it was possibel to map all your ironics to via that single compute service to a host aggrate/az in newton https://specs.openstack.org/openstack/nova-specs/specs/newton/implemented/ir... support for runnign multiple nova-compute agents with the ironic virt dirver was added. That intoduced the conceph of a hashring that blanced compute nodes between compute services at runtime based on teh up state of the compute service. with that change it because impossible to relibly manage ironic compute nodes with host aggregates as the dirver would uncondtionally blance across all ironic compute services. with https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro... the ironic driver was enhanced with awareness of ironic conductor groups. this intoduced a peer-list option and partition key in pricniapl it was possible to create host aggrate that mapped to conductor groups by incluing all host listed in the peer_list in the host aggreated i.e. if you had partition_key=conductor_group_1 peer_list=ironic-1,ironic-2 and you created a host aggreate with ironic-1 and ironic-2 that can work however its not tested or supproted in general as without carfully configuring the ironic driver the hashrign can voilate the constraits and move compute_nodes between aggrates by balancing them to compute services not listed in teh host aggrate in antelope/bobcat we deprecated the hashring mechaium and intoduced a new ha model and a new ironic sharding mechanium this was finally implemnted in caracal 2024.1 https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ir... with this deployment topology it is not guarenteed that ironic will not rebalnce comptue between compute service which means you can now staticaly map compute services (and the ironic shard it manages) to a host aggrate again without worriying that the ironic driver will violate the aggreate expections.
I’m very surprised to hear that nova aggregates are not supported with ironic, it doesn’t seem to be indicated anywhere that I could find in the docs? We’ve been using this configuration (Ironic + aggregates) since Rocky, and the Blazar project’s support for ironic depends on host aggregates.
we did have this documented at one point but i agree its not something that is widely know and what makes matters worse is it almost works in some cases. what people often doen realise is that the host aggregate api https://docs.openstack.org/api-ref/compute/#host-aggregates-os-aggregates is written in terms of compute services not compute nodes. so when trying to use it with ironic they expect to be able to add indivigual ironic server to a host aggreate https://docs.openstack.org/api-ref/compute/#add-host but they can only add the compute services. that means you cant use the tenat isolation filter to isolate a subset of ironic nodes for a given tenant. im not sure how blazar was tryign to use aggreates with ironic but i suspect there integration was incomplete if not fundementally broken by the limitations of how aggreate function when used with ironic. if blazar is using placement aggreates rather then nova host aggrate that might change things but there is no nova api to request a instance ot be created in a placement aggreate. each ironic node has its own resouce provider in placmenet and placement aggregate work on the resouce provider level that means you can create aggregate of ironic nodes in placement. while that is nice since you can use that aggreate in a nova api request that is realy only useful if blazar is going directly to ironic.
Best, -Mike Sherman
On 4/15/24, 6:13 AM, "smooney@redhat.com" <smooney@redhat.com> wrote:
On Sat, 2024-04-13 at 12:52 +0000, Michael Sherman wrote:
Hi all,
We’re running into an issue, where for two sites with 150-250 ironic nodes on a single conductor and nova-compute instance, we’ve started to get “no hosts available” errors from nova scheduler.
We’re using the blazar-nova filter to match on hosts in specifically tagged aggregates. After adding some debug logs, I found that the “host_state” object passed to the filter seems to have out-of-date aggregate information. so first thing to be aware of is that host aggrates are not support with ironic virt driver. until the the caracal release the ironic virt driver uses a hash ring to balance comptue nodes between compute services with amoung other things broke host aggrates. From a nova project perspective using hostaggrates with ironic compute services is unsupported.
in caracal they might now work when ironic sharding is used. host aggrates are used to map compute servivices not compute nodes to an aggrate so when using shards you can map a given shard to a host aggrated by mapping the compute service for that shard to an aggrate.
Specifically, if I query the system with “openstack aggregate show …” or “openstack allocation candidate list”, I see the correct aggregate for the nodes in question, but the contents of “host_state” reflect a previous state.
This “staleness” does not seem to correct itself over time, but is resolved by restarting the nova-scheduler process (actually restarting the kolla docker container, but the same effect). However the issues return over the course of a couple hours.
this is likely caused by a combination of the hash ring and the host cache. again your current toplogy is unsupproted as we do not offically supprot using host aggrates with ironic nodes. with that said you could try disabling the cacheing by setting https://docs.openstack.org/nova/latest/configuration/config.html#filter_sche... on the scheduler and all compute services.
that may or may not work depending on why but my guess is its becauses the compute nodes that are "stale" have been reblance by the hash ring.
the other way to work around this might eb to ensure you do not use peer list and have exactly one compute service per conductor group. again im not sure if that will work because your trying to use a feature (host aggarates) that is not supported by the ironic virt dirver but it might mitigate the incompatiblity in older releases.
We haven’t increased the number of nodes, or otherwise changed the hardware, so I’m not sure what could have triggered this issue.
Any advice on further debugging steps would be greatly appreciated. Thank you!
-- Michael Sherman Infrastructure Lead – Chameleon Computer Science, University of Chicago MCS, Argonne National Lab
On Mon, 15 Apr 2024 at 19:09, <smooney@redhat.com> wrote:
As per the ironic docs for configuring nova, we’ve had that flag disabled. Right now we’re running with a single ironic-conductor, so I don’t think
On Mon, 2024-04-15 at 14:53 +0000, Michael Sherman wrote: the hash-ring behavior should affect things? the hash ring behavior is not related to ironic conductors its related to the peer-list option in the nova compute service config.
the ironic driver only supported a single nova-comptue for the entire ironic at that time it was possibel to map all your ironics to via that single compute service to a host aggrate/az in newton https://specs.openstack.org/openstack/nova-specs/specs/newton/implemented/ir... support for runnign multiple nova-compute agents with the ironic virt dirver was added. That intoduced the conceph of a hashring that blanced compute nodes between compute services at runtime based on teh up state of the compute service. with that change it because impossible to relibly manage ironic compute nodes with host aggregates as the dirver would uncondtionally blance across all ironic compute services.
with https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro... the ironic driver was enhanced with awareness of ironic conductor groups. this intoduced a peer-list option and partition key in pricniapl it was possible to create host aggrate that mapped to conductor groups by incluing all host listed in the peer_list in the host aggreated
i.e. if you had partition_key=conductor_group_1 peer_list=ironic-1,ironic-2 and you created a host aggreate with ironic-1 and ironic-2 that can work
however its not tested or supproted in general as without carfully configuring the ironic driver the hashrign can voilate the constraits and move compute_nodes between aggrates by balancing them to compute services not listed in teh host aggrate
in antelope/bobcat we deprecated the hashring mechaium and intoduced a new ha model and a new ironic sharding mechanium this was finally implemnted in caracal 2024.1
https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ir...
with this deployment topology it is not guarenteed that ironic will not rebalnce comptue between compute service which means you can now staticaly map compute services (and the ironic shard it manages) to a host aggrate again without worriying that the ironic driver will violate the aggreate expections.
I’m very surprised to hear that nova aggregates are not supported with
ironic, it doesn’t seem to be indicated
anywhere that I could find in the docs? We’ve been using this configuration (Ironic + aggregates) since Rocky, and the Blazar project’s support for ironic depends on host aggregates.
we did have this documented at one point but i agree its not something that is widely know and what makes matters worse is it almost works in some cases.
what people often doen realise is that the host aggregate api https://docs.openstack.org/api-ref/compute/#host-aggregates-os-aggregates is written in terms of compute services not compute nodes. so when trying to use it with ironic they expect to be able to add indivigual ironic server to a host aggreate https://docs.openstack.org/api-ref/compute/#add-host but they can only add the compute services.
that means you cant use the tenat isolation filter to isolate a subset of ironic nodes for a given tenant. im not sure how blazar was tryign to use aggreates with ironic but i suspect there integration was incomplete if not fundementally broken by the limitations of how aggreate function when used with ironic.
if blazar is using placement aggreates rather then nova host aggrate that might change things but there is no nova api to request a instance ot be created in a placement aggreate. each ironic node has its own resouce provider in placmenet and placement aggregate work on the resouce provider level that means you can create aggregate of ironic nodes in placement.
while that is nice since you can use that aggreate in a nova api request that is realy only useful if blazar is going directly to ironic.
Upstream Blazar is known to be incompatible with Ironic [1], we don't claim that it would work. There is work in progress to extend the instance reservation plugin, which I hope will eventually support reserving bare metal nodes without involving Nova host aggregates. The reason that Chameleon has been able to use Nova host aggregates with Ironic is that they are using a fork of Nova with changes such as this one proposed by Jay many years ago: https://review.opendev.org/c/openstack/nova/+/526753 As for what could have triggered this change of behaviour, do you know if it started happening on both sites around the same time? Can it be correlated with a software change? Another thought I had was whether a growing number of records in the database could have resulted in some tasks taking longer to complete and reaching a tipping point, with side effects on the scheduler. It's a bit far-fetched though. [1] https://blueprints.launchpad.net/blazar/+spec/ironic-compatibility
On Mon, 2024-04-15 at 23:07 +0200, Pierre Riteau wrote:
On Mon, 15 Apr 2024 at 19:09, <smooney@redhat.com> wrote:
As per the ironic docs for configuring nova, we’ve had that flag disabled. Right now we’re running with a single ironic-conductor, so I don’t think
On Mon, 2024-04-15 at 14:53 +0000, Michael Sherman wrote: the hash-ring behavior should affect things? the hash ring behavior is not related to ironic conductors its related to the peer-list option in the nova compute service config.
the ironic driver only supported a single nova-comptue for the entire ironic at that time it was possibel to map all your ironics to via that single compute service to a host aggrate/az in newton https://specs.openstack.org/openstack/nova-specs/specs/newton/implemented/ir... support for runnign multiple nova-compute agents with the ironic virt dirver was added. That intoduced the conceph of a hashring that blanced compute nodes between compute services at runtime based on teh up state of the compute service. with that change it because impossible to relibly manage ironic compute nodes with host aggregates as the dirver would uncondtionally blance across all ironic compute services.
with https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro... the ironic driver was enhanced with awareness of ironic conductor groups. this intoduced a peer-list option and partition key in pricniapl it was possible to create host aggrate that mapped to conductor groups by incluing all host listed in the peer_list in the host aggreated
i.e. if you had partition_key=conductor_group_1 peer_list=ironic-1,ironic-2 and you created a host aggreate with ironic-1 and ironic-2 that can work
however its not tested or supproted in general as without carfully configuring the ironic driver the hashrign can voilate the constraits and move compute_nodes between aggrates by balancing them to compute services not listed in teh host aggrate
in antelope/bobcat we deprecated the hashring mechaium and intoduced a new ha model and a new ironic sharding mechanium this was finally implemnted in caracal 2024.1
https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ir...
with this deployment topology it is not guarenteed that ironic will not rebalnce comptue between compute service which means you can now staticaly map compute services (and the ironic shard it manages) to a host aggrate again without worriying that the ironic driver will violate the aggreate expections.
I’m very surprised to hear that nova aggregates are not supported with
ironic, it doesn’t seem to be indicated
anywhere that I could find in the docs? We’ve been using this configuration (Ironic + aggregates) since Rocky, and the Blazar project’s support for ironic depends on host aggregates.
we did have this documented at one point but i agree its not something that is widely know and what makes matters worse is it almost works in some cases.
what people often doen realise is that the host aggregate api https://docs.openstack.org/api-ref/compute/#host-aggregates-os-aggregates is written in terms of compute services not compute nodes. so when trying to use it with ironic they expect to be able to add indivigual ironic server to a host aggreate https://docs.openstack.org/api-ref/compute/#add-host but they can only add the compute services.
that means you cant use the tenat isolation filter to isolate a subset of ironic nodes for a given tenant. im not sure how blazar was tryign to use aggreates with ironic but i suspect there integration was incomplete if not fundementally broken by the limitations of how aggreate function when used with ironic.
if blazar is using placement aggreates rather then nova host aggrate that might change things but there is no nova api to request a instance ot be created in a placement aggreate. each ironic node has its own resouce provider in placmenet and placement aggregate work on the resouce provider level that means you can create aggregate of ironic nodes in placement.
while that is nice since you can use that aggreate in a nova api request that is realy only useful if blazar is going directly to ironic.
Upstream Blazar is known to be incompatible with Ironic [1], we don't claim that it would work. There is work in progress to extend the instance reservation plugin, which I hope will eventually support reserving bare metal nodes without involving Nova host aggregates.
The reason that Chameleon has been able to use Nova host aggregates with Ironic is that they are using a fork of Nova with changes such as this one proposed by Jay many years ago: https://review.opendev.org/c/openstack/nova/+/526753
ah ok that make sesne there has been other efforts to enable similar functionality a year or two ago. if changing the hostaggarte api to allow mappign compute node to aggreate is really something that is desired it is something we could bring back up for discussion within the nova team it would be a new feature requireing a new api microverion, it is not a bug. as a result this would not be backportable but we coudl disucss fi this is a change we want to do. previously we have said no but if this is an operator pain point and there is a desire for this to be changed we shoudl at least consier what what woudl look like and asset it again.
As for what could have triggered this change of behaviour, do you know if it started happening on both sites around the same time? Can it be correlated with a software change?
Another thought I had was whether a growing number of records in the database could have resulted in some tasks taking longer to complete and reaching a tipping point, with side effects on the scheduler. It's a bit far-fetched though.
[1] https://blueprints.launchpad.net/blazar/+spec/ironic-compatibility
Hey, thank you both for all the info. Pierre is absolutely correct, we are running a fork and I should have stated that up front. I’m still getting up to speed on the inner working of these components. Host aggregate support for ironic nodes would be of interest to us, and (If I understand correctly) would allow us to retire our fork of nova. In the meantime, I’ve tested a workaround on our fork by implementing blazar-nova’s aggregate metadata checks as a nova prefilter, rather than in the existing filter plugin. I can confirm that the incorrect aggregate information was only present on the “host_state” objects, while both the nova and placement DBs have correct and current information. My next steps are to add more debugging to the methods updating the host_state, but I’d also be interested to discuss the merits of scheduler prefilters vs regular filters. Thank you again! Mike Sherman On 4/16/24, 5:23 AM, "smooney@redhat.com" <smooney@redhat.com> wrote: On Mon, 2024-04-15 at 23:07 +0200, Pierre Riteau wrote:
On Mon, 15 Apr 2024 at 19:09, <smooney@redhat.com> wrote:
As per the ironic docs for configuring nova, we’ve had that flag disabled. Right now we’re running with a single ironic-conductor, so I don’t think
On Mon, 2024-04-15 at 14:53 +0000, Michael Sherman wrote: the hash-ring behavior should affect things? the hash ring behavior is not related to ironic conductors its related to the peer-list option in the nova compute service config.
the ironic driver only supported a single nova-comptue for the entire ironic at that time it was possibel to map all your ironics to via that single compute service to a host aggrate/az in newton https://specs.openstack.org/openstack/nova-specs/specs/newton/implemented/ir... support for runnign multiple nova-compute agents with the ironic virt dirver was added. That intoduced the conceph of a hashring that blanced compute nodes between compute services at runtime based on teh up state of the compute service. with that change it because impossible to relibly manage ironic compute nodes with host aggregates as the dirver would uncondtionally blance across all ironic compute services.
with https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro... the ironic driver was enhanced with awareness of ironic conductor groups. this intoduced a peer-list option and partition key in pricniapl it was possible to create host aggrate that mapped to conductor groups by incluing all host listed in the peer_list in the host aggreated
i.e. if you had partition_key=conductor_group_1 peer_list=ironic-1,ironic-2 and you created a host aggreate with ironic-1 and ironic-2 that can work
however its not tested or supproted in general as without carfully configuring the ironic driver the hashrign can voilate the constraits and move compute_nodes between aggrates by balancing them to compute services not listed in teh host aggrate
in antelope/bobcat we deprecated the hashring mechaium and intoduced a new ha model and a new ironic sharding mechanium this was finally implemnted in caracal 2024.1
https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ir...
with this deployment topology it is not guarenteed that ironic will not rebalnce comptue between compute service which means you can now staticaly map compute services (and the ironic shard it manages) to a host aggrate again without worriying that the ironic driver will violate the aggreate expections.
I’m very surprised to hear that nova aggregates are not supported with
ironic, it doesn’t seem to be indicated
anywhere that I could find in the docs? We’ve been using this configuration (Ironic + aggregates) since Rocky, and the Blazar project’s support for ironic depends on host aggregates.
we did have this documented at one point but i agree its not something that is widely know and what makes matters worse is it almost works in some cases.
what people often doen realise is that the host aggregate api https://docs.openstack.org/api-ref/compute/#host-aggregates-os-aggregates is written in terms of compute services not compute nodes. so when trying to use it with ironic they expect to be able to add indivigual ironic server to a host aggreate https://docs.openstack.org/api-ref/compute/#add-host but they can only add the compute services.
that means you cant use the tenat isolation filter to isolate a subset of ironic nodes for a given tenant. im not sure how blazar was tryign to use aggreates with ironic but i suspect there integration was incomplete if not fundementally broken by the limitations of how aggreate function when used with ironic.
if blazar is using placement aggreates rather then nova host aggrate that might change things but there is no nova api to request a instance ot be created in a placement aggreate. each ironic node has its own resouce provider in placmenet and placement aggregate work on the resouce provider level that means you can create aggregate of ironic nodes in placement.
while that is nice since you can use that aggreate in a nova api request that is realy only useful if blazar is going directly to ironic.
Upstream Blazar is known to be incompatible with Ironic [1], we don't claim that it would work. There is work in progress to extend the instance reservation plugin, which I hope will eventually support reserving bare metal nodes without involving Nova host aggregates.
The reason that Chameleon has been able to use Nova host aggregates with Ironic is that they are using a fork of Nova with changes such as this one proposed by Jay many years ago: https://review.opendev.org/c/openstack/nova/+/526753
ah ok that make sesne there has been other efforts to enable similar functionality a year or two ago. if changing the hostaggarte api to allow mappign compute node to aggreate is really something that is desired it is something we could bring back up for discussion within the nova team it would be a new feature requireing a new api microverion, it is not a bug. as a result this would not be backportable but we coudl disucss fi this is a change we want to do. previously we have said no but if this is an operator pain point and there is a desire for this to be changed we shoudl at least consier what what woudl look like and asset it again.
As for what could have triggered this change of behaviour, do you know if it started happening on both sites around the same time? Can it be correlated with a software change?
Another thought I had was whether a growing number of records in the database could have resulted in some tasks taking longer to complete and reaching a tipping point, with side effects on the scheduler. It's a bit far-fetched though.
[1] https://blueprints.launchpad.net/blazar/+spec/ironic-compatibility
After some digging, I believe we’re experiencing either this bug, or one much like it. https://bugs.launchpad.net/nova/+bug/1542491 With debug logging, I see inconsistent values for the contents of “aggregate.hosts” in the host_manager’s “_update_aggregate” method. https://github.com/openstack/nova/blob/ac1c6a8c7d85502babdf617bc49b5bb2301c2... The “causative change” we had made that triggers this issue was modifying blazar to add/remove hosts from aggregates in parallel. I can reliably reproduce the issue when moving as few as 3 hosts between aggregates. Any thoughts? Thanks! -Mike Sherman Adding debug logs verified On 4/16/24, 8:53 AM, "Michael Sherman" <shermanm@uchicago.edu> wrote: Hey, thank you both for all the info. Pierre is absolutely correct, we are running a fork and I should have stated that up front. I’m still getting up to speed on the inner working of these components. Host aggregate support for ironic nodes would be of interest to us, and (If I understand correctly) would allow us to retire our fork of nova. In the meantime, I’ve tested a workaround on our fork by implementing blazar-nova’s aggregate metadata checks as a nova prefilter, rather than in the existing filter plugin. I can confirm that the incorrect aggregate information was only present on the “host_state” objects, while both the nova and placement DBs have correct and current information. My next steps are to add more debugging to the methods updating the host_state, but I’d also be interested to discuss the merits of scheduler prefilters vs regular filters. Thank you again! Mike Sherman On 4/16/24, 5:23 AM, "smooney@redhat.com" <smooney@redhat.com> wrote: On Mon, 2024-04-15 at 23:07 +0200, Pierre Riteau wrote:
On Mon, 15 Apr 2024 at 19:09, <smooney@redhat.com> wrote:
As per the ironic docs for configuring nova, we’ve had that flag disabled. Right now we’re running with a single ironic-conductor, so I don’t think
On Mon, 2024-04-15 at 14:53 +0000, Michael Sherman wrote: the hash-ring behavior should affect things? the hash ring behavior is not related to ironic conductors its related to the peer-list option in the nova compute service config.
the ironic driver only supported a single nova-comptue for the entire ironic at that time it was possibel to map all your ironics to via that single compute service to a host aggrate/az in newton https://specs.openstack.org/openstack/nova-specs/specs/newton/implemented/ir... support for runnign multiple nova-compute agents with the ironic virt dirver was added. That intoduced the conceph of a hashring that blanced compute nodes between compute services at runtime based on teh up state of the compute service. with that change it because impossible to relibly manage ironic compute nodes with host aggregates as the dirver would uncondtionally blance across all ironic compute services.
with https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro... the ironic driver was enhanced with awareness of ironic conductor groups. this intoduced a peer-list option and partition key in pricniapl it was possible to create host aggrate that mapped to conductor groups by incluing all host listed in the peer_list in the host aggreated
i.e. if you had partition_key=conductor_group_1 peer_list=ironic-1,ironic-2 and you created a host aggreate with ironic-1 and ironic-2 that can work
however its not tested or supproted in general as without carfully configuring the ironic driver the hashrign can voilate the constraits and move compute_nodes between aggrates by balancing them to compute services not listed in teh host aggrate
in antelope/bobcat we deprecated the hashring mechaium and intoduced a new ha model and a new ironic sharding mechanium this was finally implemnted in caracal 2024.1
https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ir...
with this deployment topology it is not guarenteed that ironic will not rebalnce comptue between compute service which means you can now staticaly map compute services (and the ironic shard it manages) to a host aggrate again without worriying that the ironic driver will violate the aggreate expections.
I’m very surprised to hear that nova aggregates are not supported with
ironic, it doesn’t seem to be indicated
anywhere that I could find in the docs? We’ve been using this configuration (Ironic + aggregates) since Rocky, and the Blazar project’s support for ironic depends on host aggregates.
we did have this documented at one point but i agree its not something that is widely know and what makes matters worse is it almost works in some cases.
what people often doen realise is that the host aggregate api https://docs.openstack.org/api-ref/compute/#host-aggregates-os-aggregates is written in terms of compute services not compute nodes. so when trying to use it with ironic they expect to be able to add indivigual ironic server to a host aggreate https://docs.openstack.org/api-ref/compute/#add-host but they can only add the compute services.
that means you cant use the tenat isolation filter to isolate a subset of ironic nodes for a given tenant. im not sure how blazar was tryign to use aggreates with ironic but i suspect there integration was incomplete if not fundementally broken by the limitations of how aggreate function when used with ironic.
if blazar is using placement aggreates rather then nova host aggrate that might change things but there is no nova api to request a instance ot be created in a placement aggreate. each ironic node has its own resouce provider in placmenet and placement aggregate work on the resouce provider level that means you can create aggregate of ironic nodes in placement.
while that is nice since you can use that aggreate in a nova api request that is realy only useful if blazar is going directly to ironic.
Upstream Blazar is known to be incompatible with Ironic [1], we don't claim that it would work. There is work in progress to extend the instance reservation plugin, which I hope will eventually support reserving bare metal nodes without involving Nova host aggregates.
The reason that Chameleon has been able to use Nova host aggregates with Ironic is that they are using a fork of Nova with changes such as this one proposed by Jay many years ago: https://review.opendev.org/c/openstack/nova/+/526753
ah ok that make sesne there has been other efforts to enable similar functionality a year or two ago. if changing the hostaggarte api to allow mappign compute node to aggreate is really something that is desired it is something we could bring back up for discussion within the nova team it would be a new feature requireing a new api microverion, it is not a bug. as a result this would not be backportable but we coudl disucss fi this is a change we want to do. previously we have said no but if this is an operator pain point and there is a desire for this to be changed we shoudl at least consier what what woudl look like and asset it again.
As for what could have triggered this change of behaviour, do you know if it started happening on both sites around the same time? Can it be correlated with a software change?
Another thought I had was whether a growing number of records in the database could have resulted in some tasks taking longer to complete and reaching a tipping point, with side effects on the scheduler. It's a bit far-fetched though.
[1] https://blueprints.launchpad.net/blazar/+spec/ironic-compatibility
On Tue, 2024-04-23 at 20:17 +0000, Michael Sherman wrote:
After some digging, I believe we’re experiencing either this bug, or one much like it. https://bugs.launchpad.net/nova/+bug/1542491
With debug logging, I see inconsistent values for the contents of “aggregate.hosts” in the host_manager’s “_update_aggregate” method. https://github.com/openstack/nova/blob/ac1c6a8c7d85502babdf617bc49b5bb2301c2...
The “causative change” we had made that triggers this issue was modifying blazar to add/remove hosts from aggregates in parallel. I can reliably reproduce the issue when moving as few as 3 hosts between aggregates.
Any thoughts?
i would generaly say that it not a race although it is undefined behviaor. nova is a distibuted system which allow many operations to happen in paralle. the race condition in this case is in the clinet code. modifying the aggregate membership when there are concurrent schduilng operations that depend on the aggregate membership has no ordering guarantees. this is futher compounded by the fact you are using a fork with a backend that is not supproted. i.e. upstream nova does not supprot mapping host to aggages only compute services and we dont supprot this with ironic. so in your envionmnet you need serialise these api calls to force the ordering you want. im not sure what the correct approch to that is but there should be no expectation that a concurant change to the aggrteates will cause the scheuler ot see consitent results when its processing requests in parallel.
Thanks! -Mike Sherman
Adding debug logs verified
On 4/16/24, 8:53 AM, "Michael Sherman" <shermanm@uchicago.edu> wrote:
Hey, thank you both for all the info. Pierre is absolutely correct, we are running a fork and I should have stated that up front. I’m still getting up to speed on the inner working of these components.
Host aggregate support for ironic nodes would be of interest to us, and (If I understand correctly) would allow us to retire our fork of nova.
In the meantime, I’ve tested a workaround on our fork by implementing blazar-nova’s aggregate metadata checks as a nova prefilter, rather than in the existing filter plugin. I can confirm that the incorrect aggregate information was only present on the “host_state” objects, while both the nova and placement DBs have correct and current information.
My next steps are to add more debugging to the methods updating the host_state, but I’d also be interested to discuss the merits of scheduler prefilters vs regular filters.
Thank you again! Mike Sherman
On 4/16/24, 5:23 AM, "smooney@redhat.com" <smooney@redhat.com> wrote:
On Mon, 2024-04-15 at 23:07 +0200, Pierre Riteau wrote:
On Mon, 15 Apr 2024 at 19:09, <smooney@redhat.com> wrote:
As per the ironic docs for configuring nova, we’ve had that flag disabled. Right now we’re running with a single ironic-conductor, so I don’t think
On Mon, 2024-04-15 at 14:53 +0000, Michael Sherman wrote: the hash-ring behavior should affect things? the hash ring behavior is not related to ironic conductors its related to the peer-list option in the nova compute service config.
the ironic driver only supported a single nova-comptue for the entire ironic at that time it was possibel to map all your ironics to via that single compute service to a host aggrate/az in newton https://specs.openstack.org/openstack/nova-specs/specs/newton/implemented/ir... support for runnign multiple nova-compute agents with the ironic virt dirver was added. That intoduced the conceph of a hashring that blanced compute nodes between compute services at runtime based on teh up state of the compute service. with that change it because impossible to relibly manage ironic compute nodes with host aggregates as the dirver would uncondtionally blance across all ironic compute services.
with https://specs.openstack.org/openstack/nova-specs/specs/stein/implemented/iro... the ironic driver was enhanced with awareness of ironic conductor groups. this intoduced a peer-list option and partition key in pricniapl it was possible to create host aggrate that mapped to conductor groups by incluing all host listed in the peer_list in the host aggreated
i.e. if you had partition_key=conductor_group_1 peer_list=ironic-1,ironic-2 and you created a host aggreate with ironic-1 and ironic-2 that can work
however its not tested or supproted in general as without carfully configuring the ironic driver the hashrign can voilate the constraits and move compute_nodes between aggrates by balancing them to compute services not listed in teh host aggrate
in antelope/bobcat we deprecated the hashring mechaium and intoduced a new ha model and a new ironic sharding mechanium this was finally implemnted in caracal 2024.1
https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ir...
with this deployment topology it is not guarenteed that ironic will not rebalnce comptue between compute service which means you can now staticaly map compute services (and the ironic shard it manages) to a host aggrate again without worriying that the ironic driver will violate the aggreate expections.
I’m very surprised to hear that nova aggregates are not supported with
ironic, it doesn’t seem to be indicated
anywhere that I could find in the docs? We’ve been using this configuration (Ironic + aggregates) since Rocky, and the Blazar project’s support for ironic depends on host aggregates.
we did have this documented at one point but i agree its not something that is widely know and what makes matters worse is it almost works in some cases.
what people often doen realise is that the host aggregate api https://docs.openstack.org/api-ref/compute/#host-aggregates-os-aggregates is written in terms of compute services not compute nodes. so when trying to use it with ironic they expect to be able to add indivigual ironic server to a host aggreate https://docs.openstack.org/api-ref/compute/#add-host but they can only add the compute services.
that means you cant use the tenat isolation filter to isolate a subset of ironic nodes for a given tenant. im not sure how blazar was tryign to use aggreates with ironic but i suspect there integration was incomplete if not fundementally broken by the limitations of how aggreate function when used with ironic.
if blazar is using placement aggreates rather then nova host aggrate that might change things but there is no nova api to request a instance ot be created in a placement aggreate. each ironic node has its own resouce provider in placmenet and placement aggregate work on the resouce provider level that means you can create aggregate of ironic nodes in placement.
while that is nice since you can use that aggreate in a nova api request that is realy only useful if blazar is going directly to ironic.
Upstream Blazar is known to be incompatible with Ironic [1], we don't claim that it would work. There is work in progress to extend the instance reservation plugin, which I hope will eventually support reserving bare metal nodes without involving Nova host aggregates.
The reason that Chameleon has been able to use Nova host aggregates with Ironic is that they are using a fork of Nova with changes such as this one proposed by Jay many years ago: https://review.opendev.org/c/openstack/nova/+/526753
ah ok that make sesne there has been other efforts to enable similar functionality a year or two ago.
if changing the hostaggarte api to allow mappign compute node to aggreate is really something that is desired it is something we could bring back up for discussion within the nova team it would be a new feature requireing a new api microverion, it is not a bug. as a result this would not be backportable but we coudl disucss fi this is a change we want to do. previously we have said no but if this is an operator pain point and there is a desire for this to be changed we shoudl at least consier what what woudl look like and asset it again.
As for what could have triggered this change of behaviour, do you know if it started happening on both sites around the same time? Can it be correlated with a software change?
Another thought I had was whether a growing number of records in the database could have resulted in some tasks taking longer to complete and reaching a tipping point, with side effects on the scheduler. It's a bit far-fetched though.
[1] https://blueprints.launchpad.net/blazar/+spec/ironic-compatibility
“i would generaly say that it not a race although it is undefined behviaor.” Is just calling the “add host to aggregate API<https://docs.openstack.org/api-ref/compute/#add-host>” concurrently, undefined behavior? My sequence of operations is: 1. Add N hosts to an aggregate concurrently 2. Wait a while (minutes to hours), and verify that “aggregate show” lists the correct hosts in the aggregate 3. Then attempt to schedule N compute instances, with a filter that checks host membership in the aggregate 4. Observe scheduling failures To rule out our code as the issue, I was able to reproduce the behavior using devstack on master, using the nova_fake driver with 10 fake compute services and the aggregate_instance_extra_specs filter instead of ironic and the blazar-nova filter. So long as the N “add_host_to_aggregate” calls to nova_api are made in parallel, there’s a decent probability that the host_state aggregate info passed to the filters will not agree with the values in the DB. This doesn’t depend on launching instances quickly after making the changes, the inconsistency does not seem to ever resolve until nova-scheduler is restarted. -Mike
“i would generaly say that it not a race although it is undefined behviaor.”
Is just calling the “add host to aggregate API<https://docs.openstack.org/api-ref/compute/#add-host>” concurrently, undefined behavior?
On Wed, 2024-04-24 at 21:07 +0000, Michael Sherman wrote: the effect that has on instance that are being scheduled is. the may or may not see the host in the aggreate and we have not api garureentee what will happen. so you cannot depend on the update being seen by the scheduler imitatively. for any request that were made before the add host command was done we cannot guareentee that the will see the new host but we also cannot garunetee that they wont. this is because the request might be sitting in the rabbit queue for a non determinsitic period of time so we do not know if when the schduler processes that request if they will see the old value or new.
My sequence of operations is:
1. Add N hosts to an aggregate concurrently 2. Wait a while (minutes to hours), and verify that “aggregate show” lists the correct hosts in the aggregate 3. Then attempt to schedule N compute instances, with a filter that checks host membership in the aggregate 4. Observe scheduling failures
there is some level of caching in the schduler for example fi you add new host to a cloud and map them to cells you need to restart the scheuler to clear the cell cache you may be hitting a similar caching issue btu i dont think we cache the aggreate membership in the same way.
To rule out our code as the issue, I was able to reproduce the behavior using devstack on master, using the nova_fake driver with 10 fake compute services and the aggregate_instance_extra_specs filter instead of ironic and the blazar- nova filter.
So long as the N “add_host_to_aggregate” calls to nova_api are made in parallel, there’s a decent probability that the host_state aggregate info passed to the filters will not agree with the values in the DB.
you might want to set https://docs.openstack.org/nova/latest/configuration/config.html#filter_sche... to false ``` track_instance_changes Type: boolean Default: True Enable querying of individual hosts for instance information. The scheduler may need information about the instances on a host in order to evaluate its filters and weighers. The most common need for this information is for the (anti-)affinity filters, which need to choose a host based on the instances already running on a host. If the configured filters and weighers do not need this information, disabling this option will improve performance. It may also be disabled when the tracking overhead proves too heavy, although this will cause classes requiring host usage data to query the database on each request instead. ``` i did not think that affectetd the aggrate membership but it might be worth testing
This doesn’t depend on launching instances quickly after making the changes, the inconsistency does not seem to ever resolve until nova-scheduler is restarted.
right so that sounds like you hiting a caching issue. we have a fanout rpc that updates all runnign schduler with the updated aggreate info https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... which is called on create aggreate https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... and update aggreate https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... and when we add or remove hosts form an aggreate https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... that updates the cached aggreate assocations in the hostmanager by calling https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... which calls https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... so that should not require a restart based on what im seeing in the code. the cache in the schduler will get update once the schduler has time to rpocess that rpc call.
-Mike
Right, and I see that chain of calls happen. Specifically what I observe, and can confirm in devstack on the master branch, with no modified code is: In the last section you linked: https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... The hosts listed in aggregate.hosts in the _update_aggregate method are not consistent, and depending on the order in which the RPCs are processed, the host state and contents of “host_aggregates_map“ may still be incorrect after all RPCs have been resolved. -Mike
On Thu, 2024-04-25 at 13:02 +0000, Michael Sherman wrote:
Right, and I see that chain of calls happen. Specifically what I observe, and can confirm in devstack on the master branch, with no modified code is: In the last section you linked: https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... The hosts listed in aggregate.hosts in the _update_aggregate method are not consistent, and depending on the order in which the RPCs are processed, the host state and contents of “host_aggregates_map“ may still be incorrect after all RPCs have been resolved.
i guess the problem might be if an older update is process after a newer update it coudl leave it out of sync if your aggarte addtions are being handled by diffent api request the content in the DB will be synconsied as we will use transaction at the db level and lock as requied. but with the fanout we are fan out we are passing a list of aggreeat object we are not doing a db lookup in the schduler so the order of the rpc calls matters but is not enforced. i.e. we are not passing a generation number so that the schduler can discard any with an older value. from its perspective if they arived out of order or are processed out of order then it would look like a host was removed. can you file a bug for this. im not sure if the better approch is to have the schduler hit the db and get the current membership or if we need a generation number or if we should have some other healing mechanium like a perodic or lifetime for the aggregate cache there is a bug here however if the order is not maintianed so we shoudl do something to adress this so lets start with a bug.
-Mike
I’ve added my findings and steps to reproduce to this existing bug, as I think that it’s the same one. https://bugs.launchpad.net/nova/+bug/1542491 -Mike On 4/25/24, 8:18 AM, "smooney@redhat.com" <smooney@redhat.com> wrote: On Thu, 2024-04-25 at 13:02 +0000, Michael Sherman wrote:
Right, and I see that chain of calls happen. Specifically what I observe, and can confirm in devstack on the master branch, with no modified code is: In the last section you linked: https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... The hosts listed in aggregate.hosts in the _update_aggregate method are not consistent, and depending on the order in which the RPCs are processed, the host state and contents of “host_aggregates_map“ may still be incorrect after all RPCs have been resolved.
i guess the problem might be if an older update is process after a newer update it coudl leave it out of sync if your aggarte addtions are being handled by diffent api request the content in the DB will be synconsied as we will use transaction at the db level and lock as requied. but with the fanout we are fan out we are passing a list of aggreeat object we are not doing a db lookup in the schduler so the order of the rpc calls matters but is not enforced. i.e. we are not passing a generation number so that the schduler can discard any with an older value. from its perspective if they arived out of order or are processed out of order then it would look like a host was removed. can you file a bug for this. im not sure if the better approch is to have the schduler hit the db and get the current membership or if we need a generation number or if we should have some other healing mechanium like a perodic or lifetime for the aggregate cache there is a bug here however if the order is not maintianed so we shoudl do something to adress this so lets start with a bug.
-Mike
participants (3)
-
Michael Sherman
-
Pierre Riteau
-
smooney@redhat.com