[openstack-dev] [nova] Boston Forum session recap - claims in the scheduler (or conductor)

Sylvain Bauza sbauza at redhat.com
Fri May 19 09:03:14 UTC 2017



Le 19/05/2017 10:02, Sylvain Bauza a écrit :
> 
> 
> Le 19/05/2017 02:55, Matt Riedemann a écrit :
>> The etherpad for this session is here [1]. The goal for this session was
>> to inform operators and get feedback on the plan for what we're doing
>> with moving claims from the computes to the control layer (scheduler or
>> conductor).
>>
>> We mostly talked about retries, which also came up in the cells v2
>> session that Dan Smith led [2] and will recap later.
>>
>> Without getting into too many details, in the cells v2 session we came
>> to a compromise on build retries and said that we could pass hosts down
>> to the cell so that the cell-level conductor could retry if needed (even
>> though we expect doing claims at the top will fix the majority of
>> reasons you'd have a reschedule in the first place).
>>
> 
> And during that session, we said that given cell-local conductors (when
> there is a reschedule) can't upcall the global (for all cells)
> schedulers, that's why we agreed to use the conductor to be calling
> Placement API for allocations.
> 
> 
>> During the claims in the scheduler session, a new wrinkle came up which
>> is the hosts that the scheduler returns to the top-level conductor may
>> be in different cells. So if we have two cells, A and B, with hosts x
>> and y in cell A and host z in cell B, we can't send z to A for retries,
>> or x or y to B for retries. So we need some kind of post-filter/weigher
>> filtering such that hosts are grouped by cell and then they can be sent
>> to the cells for retries as necessary.
>>
> 
> That's already proposed for reviews in
> https://review.openstack.org/#/c/465175/
> 
> 
>> There was also some side discussion asking if we somehow regressed
>> pack-first strategies by using Placement in Ocata. John Garbutt and Dan
>> Smith have the context on this (I think) so I'm hoping they can clarify
>> if we really need to fix something in Ocata at this point, or is this
>> more of a case of closing a loop-hole?
>>
> 
> The problem is that the scheduler doesn't verify the cells when trying
> to find a destination for an instance, it's just using weights for packing.
> 
> So, for example, say I have N hosts and 2 cells, the first weighting
> host could be in cell1 while the second could be in cell2. Then, even if
> the operator uses the weighers for packing, for example a RequestSpec
> with num_instances=2 could push one instance in cell1 and the other in
> cell2.
> 
> From a scheduler point of view, I think we could possibly add a
> CellWeigher that would help to pack instances within the same cell.
> Anyway, that's not related to the claims series, so we could possibly
> backport it for Ocata hopefully.
> 

Melanie actually made a good point about the current logic based on the
`host_subset_size`config option. If you're leaving it defaulted to 1, in
theory all instances coming along the scheduler would get a sorted list
of hosts by weights and only pick the first one (ie. packing all the
instances onto the same host) which is good for that (except of course
some user request that fits all the space of the host and where a spread
could be better by shuffling between multiple hosts).

So, while I began deprecating that option because I thought the race
condition would be fixed by conductor claims, I think we should keep it
for the time being until we clearly identify whether it's still necessary.

All what I said earlier above remains valid tho. In a world where 2
hosts are given as the less weighed ones, we could send instances from
the same user request onto different cells, but that only ties the
problem to a multi-instance boot problem, which is far less impactful.



> 
>> We also spent a good chunk of the session talking about overhead
>> calculations for memory_mb and disk_gb which happens in the compute and
>> on a per-hypervisor basis. In the absence of automating ways to adjust
>> for overhead, our solution for now is operators can adjust reserved host
>> resource values (vcpus, memory, disk) via config options and be
>> conservative or aggressive as they see fit. Chris Dent and I also noted
>> that you can adjust those reserved values via the placement REST API but
>> they will be overridden by the config in a periodic task - which may be
>> a bug, if not at least a surprise to an operator.
>>
>> We didn't really get into this during the forum session, but there are
>> different opinions within the nova dev team on how to do claims in the
>> controller services (conductor vs scheduler). Sylvain Bauza has a series
>> which uses the conductor service, and Ed Leafe has a series using the
>> scheduler. More on that in the mailing list [3].
>>
> 
> Sorry, but I do remember we had a consensus on using conductor at least
> during the cells v2 session.
> 
> What I'm a bit afraid is that we're duplicating efforts on a sole
> blueprint while we all agreed to go that way.
> 
>> Next steps are going to be weighing both options between Sylvain and Ed,
>> picking a path and moving forward, as we don't have a lot of time to sit
>> on this fence if we're going to get it done in Pike.
>>
> 
> There are multiple reasons why we chose to use conductor for that :
>  - as I said earlier, conductors when rescheduling can't upcall a global
> scheduler, and we agreed to not have (for the moment) second-level
> schedulers for cellsv2
>  - eventually in 1 or 2 cycles, nova-scheduler will become a library
> that conductors can use for filtering/weighting reasons. The idea is to
> stop doing RPC calls to a separate service that requires its own HA (and
> we know we have problems with, given schedulers are stateful in memory).
> Instead, we should make the scheduler modules stateless so operators
> would only need to scale out conductors for performance. In that model,
> I think conductors should be the engines responsible for making allocations.
>  - scheduler doesn't have any idea on whether the instance request is
> for a move operation or a boot, but conductors do know that logic.
> Instead of adding more specific conditionals into the scheduler so it
> would make it a library more nova-centric, it's far better to avoid of
> any notion of nova-ism in the scheduler. In that context, there are
> cases where we prefer to leave the conductor responsible for placing
> allocations because we know that eventually conductors will own the
> "State Machine" (I haven't said Tasks).
>  - last but not the least, we had a very clear consensus with Andrew
> Laski and I on not having the scheduler knowing the instance UUIDs. If
> you see the RequestSpec object, you only have a single instance_uuid and
> a num_instances field. That's *by design*. The idea is that when you
> boot --max_servers=2 for instance, all the instances share the same user
> request so the scheduler only needs to know how to place based on those
> resource criterias. Pushing the instance UUIDs to the scheduler would
> leak out too much semantics to the scheduler (and you could imagine one
> day some very bad use of that).
> 
> 
>> As a side request, it would be great if companies that have teams doing
>> performance and scale testing could help out and compare before (Ocata)
>> and after (Pike with claims in the controller) results, because we
>> eventually want to deprecate the caching scheduler but that currently
>> outperforms the filter scheduler at scale because of the retries
>> involved when using the filter scheduler, and which we expect doing
>> claims at the top will fix.
>>
> 
> 
> +1000 to that.
> FWIW, having 2 very distinct series that are duplicating work makes
> reviews very hard and possibly confuse people. I'd ask Ed to put his own
> series into at least a separate Gerrit topic potentially named
> 'placement-claims-alternative' or something like that, so we could
> mention it in the blueprint whiteboard.
> 
> -Sylvain
> 
>> [1]
>> https://etherpad.openstack.org/p/BOS-forum-move-claims-from-compute-to-scheduler
>>
>> [2]
>> https://etherpad.openstack.org/p/BOS-forum-cellsv2-developer-community-coordination
>>
>> [3] http://lists.openstack.org/pipermail/openstack-dev/2017-May/116949.html
>>
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list