I’ve added my findings and steps to reproduce to this existing bug, as I think that it’s the same one. https://bugs.launchpad.net/nova/+bug/1542491 -Mike On 4/25/24, 8:18 AM, "smooney@redhat.com" <smooney@redhat.com> wrote: On Thu, 2024-04-25 at 13:02 +0000, Michael Sherman wrote:
Right, and I see that chain of calls happen. Specifically what I observe, and can confirm in devstack on the master branch, with no modified code is: In the last section you linked: https://github.com/openstack/nova/blob/ca1db54f1bc498528ac3c8601157cb32e5174... The hosts listed in aggregate.hosts in the _update_aggregate method are not consistent, and depending on the order in which the RPCs are processed, the host state and contents of “host_aggregates_map“ may still be incorrect after all RPCs have been resolved.
i guess the problem might be if an older update is process after a newer update it coudl leave it out of sync if your aggarte addtions are being handled by diffent api request the content in the DB will be synconsied as we will use transaction at the db level and lock as requied. but with the fanout we are fan out we are passing a list of aggreeat object we are not doing a db lookup in the schduler so the order of the rpc calls matters but is not enforced. i.e. we are not passing a generation number so that the schduler can discard any with an older value. from its perspective if they arived out of order or are processed out of order then it would look like a host was removed. can you file a bug for this. im not sure if the better approch is to have the schduler hit the db and get the current membership or if we need a generation number or if we should have some other healing mechanium like a perodic or lifetime for the aggregate cache there is a bug here however if the order is not maintianed so we shoudl do something to adress this so lets start with a bug.
-Mike