[openstack-dev] [nova] Boston Forum session recap - claims in the scheduler (or conductor)

John Garbutt john at johngarbutt.com
Fri May 19 10:19:16 UTC 2017

On 19 May 2017 at 10:03, Sylvain Bauza <sbauza at redhat.com> wrote:
> Le 19/05/2017 10:02, Sylvain Bauza a écrit :
>> Le 19/05/2017 02:55, Matt Riedemann a écrit :
>>> The etherpad for this session is here [1]. The goal for this session was
>>> to inform operators and get feedback on the plan for what we're doing
>>> with moving claims from the computes to the control layer (scheduler or
>>> conductor).
>>> We mostly talked about retries, which also came up in the cells v2
>>> session that Dan Smith led [2] and will recap later.
>>> Without getting into too many details, in the cells v2 session we came
>>> to a compromise on build retries and said that we could pass hosts down
>>> to the cell so that the cell-level conductor could retry if needed (even
>>> though we expect doing claims at the top will fix the majority of
>>> reasons you'd have a reschedule in the first place).
>> And during that session, we said that given cell-local conductors (when
>> there is a reschedule) can't upcall the global (for all cells)
>> schedulers, that's why we agreed to use the conductor to be calling
>> Placement API for allocations.
>>> During the claims in the scheduler session, a new wrinkle came up which
>>> is the hosts that the scheduler returns to the top-level conductor may
>>> be in different cells. So if we have two cells, A and B, with hosts x
>>> and y in cell A and host z in cell B, we can't send z to A for retries,
>>> or x or y to B for retries. So we need some kind of post-filter/weigher
>>> filtering such that hosts are grouped by cell and then they can be sent
>>> to the cells for retries as necessary.
>> That's already proposed for reviews in
>> https://review.openstack.org/#/c/465175/
>>> There was also some side discussion asking if we somehow regressed
>>> pack-first strategies by using Placement in Ocata. John Garbutt and Dan
>>> Smith have the context on this (I think) so I'm hoping they can clarify
>>> if we really need to fix something in Ocata at this point, or is this
>>> more of a case of closing a loop-hole?
>> The problem is that the scheduler doesn't verify the cells when trying
>> to find a destination for an instance, it's just using weights for packing.
>> So, for example, say I have N hosts and 2 cells, the first weighting
>> host could be in cell1 while the second could be in cell2. Then, even if
>> the operator uses the weighers for packing, for example a RequestSpec
>> with num_instances=2 could push one instance in cell1 and the other in
>> cell2.
>> From a scheduler point of view, I think we could possibly add a
>> CellWeigher that would help to pack instances within the same cell.
>> Anyway, that's not related to the claims series, so we could possibly
>> backport it for Ocata hopefully.
> Melanie actually made a good point about the current logic based on the
> `host_subset_size`config option. If you're leaving it defaulted to 1, in
> theory all instances coming along the scheduler would get a sorted list
> of hosts by weights and only pick the first one (ie. packing all the
> instances onto the same host) which is good for that (except of course
> some user request that fits all the space of the host and where a spread
> could be better by shuffling between multiple hosts).
> So, while I began deprecating that option because I thought the race
> condition would be fixed by conductor claims, I think we should keep it
> for the time being until we clearly identify whether it's still necessary.
> All what I said earlier above remains valid tho. In a world where 2
> hosts are given as the less weighed ones, we could send instances from
> the same user request onto different cells, but that only ties the
> problem to a multi-instance boot problem, which is far less impactful.

FWIW, I think we need to keep this.

If you have *lots* of contention when picking your host, increasing
host_subset_size should help reduce that contention (and maybe help
increase the throughput). I haven't written a simulator to test it
out, but it feels like we will still need to keep the fuzzy select.
That might just be a different way to say the same thing mel was
saying, not sure.


More information about the OpenStack-dev mailing list