[openstack-dev] [Nova] Concerns around the Extensible Resource Tracker design - revert maybe?
Andrew Laski
andrew.laski at rackspace.com
Fri Aug 15 13:35:11 UTC 2014
On 08/14/2014 03:21 AM, Nikola Đipanov wrote:
> On 08/13/2014 06:05 PM, Sylvain Bauza wrote:
>> Le 13/08/2014 12:21, Sylvain Bauza a écrit :
>>> Le 12/08/2014 22:06, Sylvain Bauza a écrit :
>>>> Le 12/08/2014 18:54, Nikola Đipanov a écrit :
>>>>> On 08/12/2014 04:49 PM, Sylvain Bauza wrote:
>>>>>> (sorry for reposting, missed 2 links...)
>>>>>>
>>>>>> Hi Nikola,
>>>>>>
>>>>>> Le 12/08/2014 12:21, Nikola Đipanov a écrit :
>>>>>>> Hey Nova-istas,
>>>>>>>
>>>>>>> While I was hacking on [1] I was considering how to approach the fact
>>>>>>> that we now need to track one more thing (NUMA node utilization)
>>>>>>> in our
>>>>>>> resources. I went with - "I'll add it to compute nodes table"
>>>>>>> thinking
>>>>>>> it's a fundamental enough property of a compute host that it
>>>>>>> deserves to
>>>>>>> be there, although I was considering Extensible Resource Tracker
>>>>>>> at one
>>>>>>> point (ERT from now on - see [2]) but looking at the code - it did
>>>>>>> not
>>>>>>> seem to provide anything I desperately needed, so I went with
>>>>>>> keeping it
>>>>>>> simple.
>>>>>>>
>>>>>>> So fast-forward a few days, and I caught myself solving a problem
>>>>>>> that I
>>>>>>> kept thinking ERT should have solved - but apparently hasn't, and I
>>>>>>> think it is fundamentally a broken design without it - so I'd really
>>>>>>> like to see it re-visited.
>>>>>>>
>>>>>>> The problem can be described by the following lemma (if you take
>>>>>>> 'lemma'
>>>>>>> to mean 'a sentence I came up with just now' :)):
>>>>>>>
>>>>>>> """
>>>>>>> Due to the way scheduling works in Nova (roughly: pick a host
>>>>>>> based on
>>>>>>> stale(ish) data, rely on claims to trigger a re-schedule), _same
>>>>>>> exact_
>>>>>>> information that scheduling service used when making a placement
>>>>>>> decision, needs to be available to the compute service when
>>>>>>> testing the
>>>>>>> placement.
>>>>>>> """
>>>>>>>
>>>>>>> This is not the case right now, and the ERT does not propose any
>>>>>>> way to
>>>>>>> solve it - (see how I hacked around needing to be able to get
>>>>>>> extra_specs when making claims in [3], without hammering the DB). The
>>>>>>> result will be that any resource that we add and needs user supplied
>>>>>>> info for scheduling an instance against it, will need a buggy
>>>>>>> re-implementation of gathering all the bits from the request that
>>>>>>> scheduler sees, to be able to work properly.
>>>>>> Well, ERT does provide a plugin mechanism for testing resources at the
>>>>>> claim level. This is the plugin responsibility to implement a test()
>>>>>> method [2.1] which will be called when test_claim() [2.2]
>>>>>>
>>>>>> So, provided this method is implemented, a local host check can be
>>>>>> done
>>>>>> based on the host's view of resources.
>>>>>>
>>>>>>
>>>>> Yes - the problem is there is no clear API to get all the needed
>>>>> bits to
>>>>> do so - especially the user supplied one from image and flavors.
>>>>> On top of that, in current implementation we only pass a hand-wavy
>>>>> 'usage' blob in. This makes anyone wanting to use this in conjunction
>>>>> with some of the user supplied bits roll their own
>>>>> 'extract_data_from_instance_metadata_flavor_image' or similar which is
>>>>> horrible and also likely bad for performance.
>>>> I see your concern where there is no interface for user-facing
>>>> resources like flavor or image metadata.
>>>> I also think indeed that the big 'usage' blob is not a good choice
>>>> for long-term vision.
>>>>
>>>> That said, I don't think as we say in French to throw the bath
>>>> water... ie. the problem is with the RT, not the ERT (apart the
>>>> mention of third-party API that you noted - I'll go to it later below)
>>>>>>> This is obviously a bigger concern when we want to allow users to
>>>>>>> pass
>>>>>>> data (through image or flavor) that can affect scheduling, but
>>>>>>> still a
>>>>>>> huge concern IMHO.
>>>>>> And here is where I agree with you : at the moment, ResourceTracker
>>>>>> (and
>>>>>> consequently Extensible RT) only provides the view of the resources
>>>>>> the
>>>>>> host is knowing (see my point above) and possibly some other resources
>>>>>> are missing.
>>>>>> So, whatever your choice of going with or without ERT, your patch [3]
>>>>>> still deserves it if we want not to lookup DB each time a claim goes.
>>>>>>
>>>>>>
>>>>>>> As I see that there are already BPs proposing to use this IMHO broken
>>>>>>> ERT ([4] for example), which will surely add to the proliferation of
>>>>>>> code that hacks around these design shortcomings in what is already a
>>>>>>> messy, but also crucial (for perf as well as features) bit of Nova
>>>>>>> code.
>>>>>> Two distinct implementations of that spec (ie. instances and flavors)
>>>>>> have been proposed [2.3] [2.4] so reviews are welcome. If you see the
>>>>>> test() method, it's no-op thing for both plugins. I'm open to comments
>>>>>> because I have the stated problem : how can we define a limit on
>>>>>> just a
>>>>>> counter of instances and flavors ?
>>>>>>
>>>>> Will look at these - but none of them seem to hit the issue I am
>>>>> complaining about, and that is that it will need to consider other
>>>>> request data for claims, not only data available for on instances.
>>>>>
>>>>> Also - the fact that you don't implement test() in flavor ones tells me
>>>>> that the implementation is indeed racy (but it is racy atm as well) and
>>>>> two requests can indeed race for the same host, and since no claims are
>>>>> done, both can succeed. This is I believe (at least in case of single
>>>>> flavor hosts) unlikely to happen in practice, but you get the idea.
>>>> Agreed, these 2 patches probably require another iteration, in
>>>> particular how we make sure that it won't be racy. So I need another
>>>> run to think about what to test() for these 2 examples.
>>>> Another patch has to be done for aggregates, but it's still WIP so
>>>> not mentioned here.
>>>>
>>>> Anyway, as discussed during today's meeting, these 2 patches will not
>>>> be based on ERT because of the risk it goes for Juno so let's scope
>>>> them out of this thread.
>>>>
>>>>
>>>>>>> I propose to revert [2] ASAP since it is still fresh, and see how
>>>>>>> we can
>>>>>>> come up with a cleaner design.
>>>>>>>
>>>>>>> Would like to hear opinions on this, before I propose the patch tho!
>>>>>> IMHO, I think the problem is more likely that the regular RT misses
>>>>>> some
>>>>>> information for each host so it requires to handle it on a
>>>>>> case-by-case
>>>>>> basis, but I don't think ERT either increases complexity or creates
>>>>>> another issue.
>>>>>>
>>>>> RT does not miss info about the host, but about the particular request
>>>>> which we have to fish out of different places like image_metadata
>>>>> extra_specs etc, yet - it can't really work without them. This is
>>>>> definitely a RT issue that is not specific to ERT.
>>>> +1, I agree with you, that's an API issue for RT : how do we pass out
>>>> user-defined metrics ?
>>>> I still need to figure out which kind of usecases are requiring such
>>>> examples, albeit the NUMA usecase of course.
>>>>
>>> After a night and morning of thinking about this problem, I'm writing
>>> down some ideas that could help fixing the problem (although I don't
>>> think it's a long-term scenario, rather a workaround) :
>>>
>>> Considering you need the metadata of the flavor asked by user when
>>> claiming a resource :
>>> 1. Compute Manager knows request_spec when building an instance, so it
>>> could pass out request_spec as paramater (optional or not) to
>>> rt.instance_claim() in the claim context
>>> 2. As ERT does, when claiming, it calls the ERT plugin .test() method
>>> so with request_spec as extra parameter
>>> 3. On the ERT side, a plugin is responsible for calling (and caching)
>>> all flavors metadata, or even only the key you need to compare
>>> 4. Scheduler knows flavors metadata thanks to ERT plugin and blueprint
>>> isolate-scheduler-db so it makes decisions on the same subset as the RT
>>> 5. When claiming in ERT plugin test(), it extracts the flavor_id,
>>> looks at the internal in-memory representation of flavors to get the
>>> metadata, and calls the NUMA _test_numa_topology() method for checking
>>> wrt this flavor metadata
>>>
>>>
>>> The idea is that so filters and claims are doing separate decisions on
>>> the same source of knowledge, here being ERT.
>>>
>>> -Sylvain
>>>
>>>
>> Just made a draft over my thoughts here :
>> https://review.openstack.org/113936
>>
> I have commented on the review itself in more detail, but in short -
> this solution won't work for me, and thus in the general case as well, as
>
> 1) we need a way to persist the request data (and not make it a horrible
> performance bottleneck), as caching won't cut it.
There's a proposal to persist some data already at
http://git.openstack.org/cgit/openstack/nova-specs/tree/specs/juno/persistent-resource-claim.rst
. Based on this discussion it seems like we might want to revisit that
to ensure it's going to do the right thing, or can be extended to later.
>
> 2) We also need a way to use it in every periodic task run.
>
> None of this is only ERT problem, they are problems of the RT itself,
> but as I have said in the thread already - adding a plugin API on top of
> something that is broken is just asking for all sorts of trouble.
>
> Still leaning towards revert really - anything else would be short
> sighted. I would be happy to help come up with a baseline of things to
> fix before we can add it back, to avoid deferring it needlesly due to
> scope creep.
>
> N.
>
>> -S
>>>
>>>>> However, I still see several issues with the current ERT
>>>>> implementation,
>>>>> but the one I am getting at here, and why I think we should revert it,
>>>>> is that we are building a 3rd party plugin API that is tightly coupled
>>>>> to an obviously flawed internal API (RT and Claims).
>>>>>
>>>>> We have no policy AFIK about what guarantees we provide to 3rd party
>>>>> plugins, but if we are going to be breaking them all the time, or in
>>>>> this case providing very limited usefulness - I see little value in the
>>>>> current implementation of ERT, and see issues with it staying, which
>>>>> means future work will be based on it.
>>>>>
>>>>> N.
>>>> *This* is to me IMHO the biggest problem with ERT : if we say that we
>>>> externalize an API for 3rd-party plugins, we need to make sure that
>>>> everything is handled.
>>>>
>>>> That said, instead of reverting the whole patch, could we just
>>>> consider to only accept changes that wouldn't require user-facing
>>>> metrics ?
>>>> The existing VCPU implementation still sounds good to me, so we can
>>>> just consider a clear communication on what is acceptable and what
>>>> not (ie. a docstring in the plugin API or so., plus -2/-1 reviews if
>>>> needed)
>>>>
>>>> -Sylvain
>>>>
>>>>>> Thanks,
>>>>>> -Sylvain
>>>>>>
>>>>>>> Thanks all,
>>>>>>>
>>>>>>> Nikola
>>>>>>>
>>>>>>> [1]
>>>>>>> https://blueprints.launchpad.net/nova/+spec/virt-driver-numa-placement
>>>>>>>
>>>>>>> [2] https://review.openstack.org/#/c/109643/
>>>>>>> [3] https://review.openstack.org/#/c/111782/
>>>>>>> [4] https://review.openstack.org/#/c/89893
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> OpenStack-dev mailing list
>>>>>>> OpenStack-dev at lists.openstack.org
>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>> [2.1]
>>>>>> https://github.com/openstack/nova/blob/master/nova/compute/resources/__init__.py#L75
>>>>>>
>>>>>>
>>>>>> [2.2]
>>>>>> https://github.com/openstack/nova/blob/master/nova/compute/claims.py#L134
>>>>>>
>>>>>> [2.3] https://review.openstack.org/112578
>>>>>> [2.4] https://review.openstack.org/113373
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> OpenStack-dev mailing list
>>>>>> OpenStack-dev at lists.openstack.org
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>> _______________________________________________
>>>>> OpenStack-dev mailing list
>>>>> OpenStack-dev at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>> _______________________________________________
>>>> OpenStack-dev mailing list
>>>> OpenStack-dev at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list