[openstack-dev] [Nova] Concerns around the Extensible Resource Tracker design - revert maybe?

Sylvain Bauza sbauza at redhat.com
Tue Aug 19 07:56:30 UTC 2014

Le 15/08/2014 15:35, Andrew Laski a écrit :
> On 08/14/2014 03:21 AM, Nikola Đipanov wrote:
>> On 08/13/2014 06:05 PM, Sylvain Bauza wrote:
>>> Le 13/08/2014 12:21, Sylvain Bauza a écrit :
>>>> Le 12/08/2014 22:06, Sylvain Bauza a écrit :
>>>>> Le 12/08/2014 18:54, Nikola Đipanov a écrit :
>>>>>> On 08/12/2014 04:49 PM, Sylvain Bauza wrote:
>>>>>>> (sorry for reposting, missed 2 links...)
>>>>>>> Hi Nikola,
>>>>>>> Le 12/08/2014 12:21, Nikola Đipanov a écrit :
>>>>>>>> Hey Nova-istas,
>>>>>>>> While I was hacking on [1] I was considering how to approach 
>>>>>>>> the fact
>>>>>>>> that we now need to track one more thing (NUMA node utilization)
>>>>>>>> in our
>>>>>>>> resources. I went with - "I'll add it to compute nodes table"
>>>>>>>> thinking
>>>>>>>> it's a fundamental enough property of a compute host that it
>>>>>>>> deserves to
>>>>>>>> be there, although I was considering  Extensible Resource Tracker
>>>>>>>> at one
>>>>>>>> point (ERT from now on - see [2]) but looking at the code - it did
>>>>>>>> not
>>>>>>>> seem to provide anything I desperately needed, so I went with
>>>>>>>> keeping it
>>>>>>>> simple.
>>>>>>>> So fast-forward a few days, and I caught myself solving a problem
>>>>>>>> that I
>>>>>>>> kept thinking ERT should have solved - but apparently hasn't, 
>>>>>>>> and I
>>>>>>>> think it is fundamentally a broken design without it - so I'd 
>>>>>>>> really
>>>>>>>> like to see it re-visited.
>>>>>>>> The problem can be described by the following lemma (if you take
>>>>>>>> 'lemma'
>>>>>>>> to mean 'a sentence I came up with just now' :)):
>>>>>>>> """
>>>>>>>> Due to the way scheduling works in Nova (roughly: pick a host
>>>>>>>> based on
>>>>>>>> stale(ish) data, rely on claims to trigger a re-schedule), _same
>>>>>>>> exact_
>>>>>>>> information that scheduling service used when making a placement
>>>>>>>> decision, needs to be available to the compute service when
>>>>>>>> testing the
>>>>>>>> placement.
>>>>>>>> """
>>>>>>>> This is not the case right now, and the ERT does not propose any
>>>>>>>> way to
>>>>>>>> solve it - (see how I hacked around needing to be able to get
>>>>>>>> extra_specs when making claims in [3], without hammering the 
>>>>>>>> DB). The
>>>>>>>> result will be that any resource that we add and needs user 
>>>>>>>> supplied
>>>>>>>> info for scheduling an instance against it, will need a buggy
>>>>>>>> re-implementation of gathering all the bits from the request that
>>>>>>>> scheduler sees, to be able to work properly.
>>>>>>> Well, ERT does provide a plugin mechanism for testing resources 
>>>>>>> at the
>>>>>>> claim level. This is the plugin responsibility to implement a 
>>>>>>> test()
>>>>>>> method [2.1] which will be called when test_claim() [2.2]
>>>>>>> So, provided this method is implemented, a local host check can be
>>>>>>> done
>>>>>>> based on the host's view of resources.
>>>>>> Yes - the problem is there is no clear API to get all the needed
>>>>>> bits to
>>>>>> do so - especially the user supplied one from image and flavors.
>>>>>> On top of that, in current implementation we only pass a hand-wavy
>>>>>> 'usage' blob in. This makes anyone wanting to use this in 
>>>>>> conjunction
>>>>>> with some of the user supplied bits roll their own
>>>>>> 'extract_data_from_instance_metadata_flavor_image' or similar 
>>>>>> which is
>>>>>> horrible and also likely bad for performance.
>>>>> I see your concern where there is no interface for user-facing
>>>>> resources like flavor or image metadata.
>>>>> I also think indeed that the big 'usage' blob is not a good choice
>>>>> for long-term vision.
>>>>> That said, I don't think as we say in French to throw the bath
>>>>> water... ie. the problem is with the RT, not the ERT (apart the
>>>>> mention of third-party API that you noted - I'll go to it later 
>>>>> below)
>>>>>>>> This is obviously a bigger concern when we want to allow users to
>>>>>>>> pass
>>>>>>>> data (through image or flavor) that can affect scheduling, but
>>>>>>>> still a
>>>>>>>> huge concern IMHO.
>>>>>>> And here is where I agree with you : at the moment, ResourceTracker
>>>>>>> (and
>>>>>>> consequently Extensible RT) only provides the view of the resources
>>>>>>> the
>>>>>>> host is knowing (see my point above) and possibly some other 
>>>>>>> resources
>>>>>>> are missing.
>>>>>>> So, whatever your choice of going with or without ERT, your 
>>>>>>> patch [3]
>>>>>>> still deserves it if we want not to lookup DB each time a claim 
>>>>>>> goes.
>>>>>>>> As I see that there are already BPs proposing to use this IMHO 
>>>>>>>> broken
>>>>>>>> ERT ([4] for example), which will surely add to the 
>>>>>>>> proliferation of
>>>>>>>> code that hacks around these design shortcomings in what is 
>>>>>>>> already a
>>>>>>>> messy, but also crucial (for perf as well as features) bit of Nova
>>>>>>>> code.
>>>>>>> Two distinct implementations of that spec (ie. instances and 
>>>>>>> flavors)
>>>>>>> have been proposed [2.3] [2.4] so reviews are welcome. If you 
>>>>>>> see the
>>>>>>> test() method, it's no-op thing for both plugins. I'm open to 
>>>>>>> comments
>>>>>>> because I have the stated problem : how can we define a limit on
>>>>>>> just a
>>>>>>> counter of instances and flavors ?
>>>>>> Will look at these - but none of them seem to hit the issue I am
>>>>>> complaining about, and that is that it will need to consider other
>>>>>> request data for claims, not only data available for on instances.
>>>>>> Also - the fact that you don't implement test() in flavor ones 
>>>>>> tells me
>>>>>> that the implementation is indeed racy (but it is racy atm as 
>>>>>> well) and
>>>>>> two requests can indeed race for the same host, and since no 
>>>>>> claims are
>>>>>> done, both can succeed. This is I believe (at least in case of 
>>>>>> single
>>>>>> flavor hosts) unlikely to happen in practice, but you get the idea.
>>>>> Agreed, these 2 patches probably require another iteration, in
>>>>> particular how we make sure that it won't be racy. So I need another
>>>>> run to think about what to test() for these 2 examples.
>>>>> Another patch has to be done for aggregates, but it's still WIP so
>>>>> not mentioned here.
>>>>> Anyway, as discussed during today's meeting, these 2 patches will not
>>>>> be based on ERT because of the risk it goes for Juno so let's scope
>>>>> them out of this thread.
>>>>>>>> I propose to revert [2] ASAP since it is still fresh, and see how
>>>>>>>> we can
>>>>>>>> come up with a cleaner design.
>>>>>>>> Would like to hear opinions on this, before I propose the patch 
>>>>>>>> tho!
>>>>>>> IMHO, I think the problem is more likely that the regular RT misses
>>>>>>> some
>>>>>>> information for each host so it requires to handle it on a
>>>>>>> case-by-case
>>>>>>> basis, but I don't think ERT either increases complexity or creates
>>>>>>> another issue.
>>>>>> RT does not miss info about the host, but about the particular 
>>>>>> request
>>>>>> which we have to fish out of different places like image_metadata
>>>>>> extra_specs etc, yet - it can't really work without them. This is
>>>>>> definitely a RT issue that is not specific to ERT.
>>>>> +1, I agree with you, that's an API issue for RT : how do we pass out
>>>>> user-defined metrics ?
>>>>> I still need to figure out which kind of usecases are requiring such
>>>>> examples, albeit the NUMA usecase of course.
>>>> After a night and morning of thinking about this problem, I'm writing
>>>> down some ideas that could help fixing the problem (although I don't
>>>> think it's a long-term scenario, rather a workaround) :
>>>> Considering you need the metadata of the flavor asked by user when
>>>> claiming a resource :
>>>> 1. Compute Manager knows request_spec when building an instance, so it
>>>> could pass out request_spec as paramater (optional or not) to
>>>> rt.instance_claim() in the claim context
>>>> 2. As ERT does, when claiming, it calls the ERT plugin .test() method
>>>> so with request_spec as extra parameter
>>>> 3. On the ERT side, a plugin is responsible for calling (and caching)
>>>> all flavors metadata, or even only the key you need to compare
>>>> 4. Scheduler knows flavors metadata thanks to ERT plugin and blueprint
>>>> isolate-scheduler-db so it makes decisions on the same subset as 
>>>> the RT
>>>> 5. When claiming in ERT plugin test(), it extracts the flavor_id,
>>>> looks at the internal in-memory representation of flavors to get the
>>>> metadata, and calls the NUMA _test_numa_topology() method for checking
>>>> wrt this flavor metadata
>>>> The idea is that so filters and claims are doing separate decisions on
>>>> the same source of knowledge, here being ERT.
>>>> -Sylvain
>>> Just made a draft over my thoughts here :
>>> https://review.openstack.org/113936
>> I have commented on the review itself in more detail, but in short -
>> this solution won't work for me, and thus in the general case as 
>> well, as
>> 1) we need a way to persist the request data (and not make it a horrible
>> performance bottleneck), as caching won't cut it.
> There's a proposal to persist some data already at 
> http://git.openstack.org/cgit/openstack/nova-specs/tree/specs/juno/persistent-resource-claim.rst 
> .  Based on this discussion it seems like we might want to revisit 
> that to ensure it's going to do the right thing, or can be extended to 
> later.

Well, correct me if I'm wrong but this blueprint you mentioned is about 
reducing the number of DB calls when the compute node is starting by 
persisting existing claims into a local storage.

I had many concerns about this bp and how it was designed so maybe I'm 
not a good advocate for it, but even that said, IMHO I think it covers a 
separate problem (although it can help wrt restarts)
Here, the problem is that we need an hacky thing to get metadata 
associated to the flavor passed in the request data, so we need to 
figure out how RT can decide OK or KO based on the same metrics as for 
the scheduler (but with the slight but not minor difference that Compute 
has a better view of its own resources)

My personal feeling is that we need to update Scheduler host state by 
posting metrics updates from the Compute, so we change how the Scheduler 
decides from a polling model to an async model with pulls, where the 
single source has to be the Compute ResourceTracker. Of course, that 
deserves to improve how updates are done from the RT to the Scheduler 
(eg. not waiting 60 secs for sending updates about flavors or instances 
related to/running into the host, but rather notify the Scheduler 
immediatly), and that also needs to think about passing request data 
into claims.

>> 2) We also need a way to use it in every periodic task run.
>> None of this is only ERT problem, they are problems of the RT itself,
>> but as I have said in the thread already - adding a plugin API on top of
>> something that is broken is just asking for all sorts of trouble.
>> Still leaning towards revert really - anything else would be short
>> sighted. I would be happy to help come up with a baseline of things to
>> fix before we can add it back, to avoid deferring it needlesly due to
>> scope creep.
>> N.
>>> -S
>>>>>> However, I still see several issues with the current ERT
>>>>>> implementation,
>>>>>> but the one I am getting at here, and why I think we should 
>>>>>> revert it,
>>>>>> is that we are building a 3rd party plugin API that is tightly 
>>>>>> coupled
>>>>>> to an obviously flawed internal API (RT and Claims).
>>>>>> We have no policy AFIK about what guarantees we provide to 3rd party
>>>>>> plugins, but if we are going to be breaking them all the time, or in
>>>>>> this case providing very limited usefulness - I see little value 
>>>>>> in the
>>>>>> current implementation of ERT, and see issues with it staying, which
>>>>>> means future work will be based on it.
>>>>>> N.
>>>>> *This* is to me IMHO the biggest problem with ERT : if we say that we
>>>>> externalize an API for 3rd-party plugins, we need to make sure that
>>>>> everything is handled.
>>>>> That said, instead of reverting the whole patch, could we just
>>>>> consider to only accept changes that wouldn't require user-facing
>>>>> metrics ?
>>>>> The existing VCPU implementation still sounds good to me, so we can
>>>>> just consider a clear communication on what is acceptable and what
>>>>> not (ie. a docstring in the plugin API or so., plus -2/-1 reviews if
>>>>> needed)
>>>>> -Sylvain
>>>>>>> Thanks,
>>>>>>> -Sylvain
>>>>>>>> Thanks all,
>>>>>>>> Nikola
>>>>>>>> [1]
>>>>>>>> https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement 
>>>>>>>> [2] https://review.openstack.org/#/c/109643/
>>>>>>>> [3] https://review.openstack.org/#/c/111782/
>>>>>>>> [4] https://review.openstack.org/#/c/89893
>>>>>>>> _______________________________________________
>>>>>>>> OpenStack-dev mailing list
>>>>>>>> OpenStack-dev at lists.openstack.org
>>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>> [2.1]
>>>>>>> https://github.com/openstack/nova/blob/master/nova/compute/resources/__init__.py#L75 
>>>>>>> [2.2]
>>>>>>> https://github.com/openstack/nova/blob/master/nova/compute/claims.py#L134 
>>>>>>> [2.3] https://review.openstack.org/112578
>>>>>>> [2.4] https://review.openstack.org/113373
>>>>>>> _______________________________________________
>>>>>>> OpenStack-dev mailing list
>>>>>>> OpenStack-dev at lists.openstack.org
>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>> _______________________________________________
>>>>>> OpenStack-dev mailing list
>>>>>> OpenStack-dev at lists.openstack.org
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>> _______________________________________________
>>>>> OpenStack-dev mailing list
>>>>> OpenStack-dev at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>> _______________________________________________
>>>> OpenStack-dev mailing list
>>>> OpenStack-dev at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

More information about the OpenStack-dev mailing list