Hi everyone I had a discussion a few days ago in the #openstack-nova about the possibility of introducing the concept of different failure domains, which would be a component inside of an availability zone. The concept is similar to Azure’s availability set feature: https://learn.microsoft.com/en-us/azure/virtual-machines/availability-set-ov... I’ve built a very similar nova scheduler filter which leverages server groups + another scheduler hint called `different_failure_domain` then it will actually build them inside different failure domains. The failure domains are built similar to AZs in terms of being modeled are using aggregates, so it would use a `failure_domain` aggregate metadata. I’ve managed to build and validate this functionality with my team here: https://github.com/vexxhost/nova-scheduler-filters Now, based on that, I’m wondering if: 1. Is this the best approach to take based on this? 2. Is this something we can upstream into Nova easily as an extra filter? I think this is helpful for a lot of the “VMware” world people. 😊 Thanks, Mohammed
Hi everyone
I had a discussion a few days ago in the #openstack-nova about the possibility of introducing the concept of different failure domains, which would be a component inside of an availability zone. The concept is similar to Azure’s availability set feature:
https://learn.microsoft.com/en-us/azure/virtual-machines/availability-set-ov...
I’ve built a very similar nova scheduler filter which leverages server groups + another scheduler hint called `different_failure_domain` then it will actually build them inside different failure domains. The failure domains are built similar to AZs in terms of being modeled are using aggregates, so it would use a `failure_domain` aggregate metadata.
I’ve managed to build and validate this functionality with my team here:
https://github.com/vexxhost/nova-scheduler-filters
Now, based on that, I’m wondering if:
1. Is this the best approach to take based on this? its a approch and it works so it depend on your definition of best :) certenly its the best way to do this out of tree. as is as i mention on irc i dont think doing this just with a schduler filter is a good approch
2. Is this something we can upstream into Nova easily as an extra filter? I think this is helpful for a lot of the “VMware” world people. 😊 upstream yes enabeld by default no
On Wed, 2024-07-10 at 23:50 +0000, Mohammed Naser wrote: the perfromance i suspect will be too poor given the the current implemetion. this cycle proably not in its current form probably not the main concerns i would have with this are performace at scale and discoverablity/iteroperablity. the discoverablity/iterop issue is nothing new, nova does not have an enduser discoverable way to introspect if a filter is enabeld so they cant really know if the cloud support fault domains. also since the fault domain would just be mettadata on a host aggrate they also have no visibltiy into which fault domains exist and how many there are ectra. in your unit test example you are jsut showing different_failure_domain=["true"] as the hint value but the name implies that you would be able to specify a doamin like rack or room or isp-1 or power_supply_desile_backup. but currently the filter just checks that the hiht exists and the value is not empty. if the hint is present it checks the failure domain of a give host is not already in the set of aggreats where instnaice in the same group are schulded. this casues some problem with haveing a host be in muliple failure domains i.e. have a host be in a rack level host aggreate and a building(power or network failure domain) host aggrete you cant express in the same rack is not ok but same building is the operator has to chosoe by creating non overlapping fault domains which is poteially an impedence miss match with the end user. while we dont need the complexity of cephs crush map we may want to supprot somting like different_failure_domain=["rack"] to mean only consider rack level anti affinity where we woudl check for failure_doamin=rack* in the aggrete metadata but ignore any failure_domain=room-1 metadata. if you wanted rack and room anti affinity different_failure_domain=["rack","room"] you may want to also express diffent rack but same room different_failure_domain=["-rack","+room"] or different_failure_domain=["!rack","=room"] the notation does not really matter but you get the idea that multi level affinity/antiaffinty would likely be required. o to upstream this i think we would want to see a spec that cover the use cases explored how to do heriacical or overlapping fault domains considered if the user should be able to specify the type of domain affinity/anti affinity and some other factors. some of the problems with this approach come form the fact its built on a scheduler hint (making the request a user request not an operator one) isntead of a flavor extra spec. without an api to expose types fo fault domains to end users in a limited fashion more feeling this should be operator driven and therefor based on flavors. there is some discussion of this in https://docs.openstack.org/nova/latest/reference/scheduler-hints-vs-flavor-e... im not totally against this approach but im not sure how applicable this simplifed version is so it would be good to hear form other to know if this filter, even if it has performance implciation is enough or if a more comprehensive feature is needed. to me im not sure if it meet the MVP to be useful in production or not so operator input would be useful. last comment is we try to avoid experimental feature in nova but that might be a route although im not sure how others feel.
Thanks, Mohammed
participants (2)
-
Mohammed Naser
-
smooney@redhat.com