[openstack-dev] [nova] Boston Forum session recap - claims in the scheduler (or conductor)

Sylvain Bauza sbauza at redhat.com
Fri May 19 08:02:41 UTC 2017



Le 19/05/2017 02:55, Matt Riedemann a écrit :
> The etherpad for this session is here [1]. The goal for this session was
> to inform operators and get feedback on the plan for what we're doing
> with moving claims from the computes to the control layer (scheduler or
> conductor).
> 
> We mostly talked about retries, which also came up in the cells v2
> session that Dan Smith led [2] and will recap later.
> 
> Without getting into too many details, in the cells v2 session we came
> to a compromise on build retries and said that we could pass hosts down
> to the cell so that the cell-level conductor could retry if needed (even
> though we expect doing claims at the top will fix the majority of
> reasons you'd have a reschedule in the first place).
> 

And during that session, we said that given cell-local conductors (when
there is a reschedule) can't upcall the global (for all cells)
schedulers, that's why we agreed to use the conductor to be calling
Placement API for allocations.


> During the claims in the scheduler session, a new wrinkle came up which
> is the hosts that the scheduler returns to the top-level conductor may
> be in different cells. So if we have two cells, A and B, with hosts x
> and y in cell A and host z in cell B, we can't send z to A for retries,
> or x or y to B for retries. So we need some kind of post-filter/weigher
> filtering such that hosts are grouped by cell and then they can be sent
> to the cells for retries as necessary.
> 

That's already proposed for reviews in
https://review.openstack.org/#/c/465175/


> There was also some side discussion asking if we somehow regressed
> pack-first strategies by using Placement in Ocata. John Garbutt and Dan
> Smith have the context on this (I think) so I'm hoping they can clarify
> if we really need to fix something in Ocata at this point, or is this
> more of a case of closing a loop-hole?
> 

The problem is that the scheduler doesn't verify the cells when trying
to find a destination for an instance, it's just using weights for packing.

So, for example, say I have N hosts and 2 cells, the first weighting
host could be in cell1 while the second could be in cell2. Then, even if
the operator uses the weighers for packing, for example a RequestSpec
with num_instances=2 could push one instance in cell1 and the other in
cell2.

>From a scheduler point of view, I think we could possibly add a
CellWeigher that would help to pack instances within the same cell.
Anyway, that's not related to the claims series, so we could possibly
backport it for Ocata hopefully.


> We also spent a good chunk of the session talking about overhead
> calculations for memory_mb and disk_gb which happens in the compute and
> on a per-hypervisor basis. In the absence of automating ways to adjust
> for overhead, our solution for now is operators can adjust reserved host
> resource values (vcpus, memory, disk) via config options and be
> conservative or aggressive as they see fit. Chris Dent and I also noted
> that you can adjust those reserved values via the placement REST API but
> they will be overridden by the config in a periodic task - which may be
> a bug, if not at least a surprise to an operator.
> 
> We didn't really get into this during the forum session, but there are
> different opinions within the nova dev team on how to do claims in the
> controller services (conductor vs scheduler). Sylvain Bauza has a series
> which uses the conductor service, and Ed Leafe has a series using the
> scheduler. More on that in the mailing list [3].
> 

Sorry, but I do remember we had a consensus on using conductor at least
during the cells v2 session.

What I'm a bit afraid is that we're duplicating efforts on a sole
blueprint while we all agreed to go that way.

> Next steps are going to be weighing both options between Sylvain and Ed,
> picking a path and moving forward, as we don't have a lot of time to sit
> on this fence if we're going to get it done in Pike.
> 

There are multiple reasons why we chose to use conductor for that :
 - as I said earlier, conductors when rescheduling can't upcall a global
scheduler, and we agreed to not have (for the moment) second-level
schedulers for cellsv2
 - eventually in 1 or 2 cycles, nova-scheduler will become a library
that conductors can use for filtering/weighting reasons. The idea is to
stop doing RPC calls to a separate service that requires its own HA (and
we know we have problems with, given schedulers are stateful in memory).
Instead, we should make the scheduler modules stateless so operators
would only need to scale out conductors for performance. In that model,
I think conductors should be the engines responsible for making allocations.
 - scheduler doesn't have any idea on whether the instance request is
for a move operation or a boot, but conductors do know that logic.
Instead of adding more specific conditionals into the scheduler so it
would make it a library more nova-centric, it's far better to avoid of
any notion of nova-ism in the scheduler. In that context, there are
cases where we prefer to leave the conductor responsible for placing
allocations because we know that eventually conductors will own the
"State Machine" (I haven't said Tasks).
 - last but not the least, we had a very clear consensus with Andrew
Laski and I on not having the scheduler knowing the instance UUIDs. If
you see the RequestSpec object, you only have a single instance_uuid and
a num_instances field. That's *by design*. The idea is that when you
boot --max_servers=2 for instance, all the instances share the same user
request so the scheduler only needs to know how to place based on those
resource criterias. Pushing the instance UUIDs to the scheduler would
leak out too much semantics to the scheduler (and you could imagine one
day some very bad use of that).


> As a side request, it would be great if companies that have teams doing
> performance and scale testing could help out and compare before (Ocata)
> and after (Pike with claims in the controller) results, because we
> eventually want to deprecate the caching scheduler but that currently
> outperforms the filter scheduler at scale because of the retries
> involved when using the filter scheduler, and which we expect doing
> claims at the top will fix.
> 


+1000 to that.
FWIW, having 2 very distinct series that are duplicating work makes
reviews very hard and possibly confuse people. I'd ask Ed to put his own
series into at least a separate Gerrit topic potentially named
'placement-claims-alternative' or something like that, so we could
mention it in the blueprint whiteboard.

-Sylvain

> [1]
> https://etherpad.openstack.org/p/BOS-forum-move-claims-from-compute-to-scheduler
> 
> [2]
> https://etherpad.openstack.org/p/BOS-forum-cellsv2-developer-community-coordination
> 
> [3] http://lists.openstack.org/pipermail/openstack-dev/2017-May/116949.html
> 



More information about the OpenStack-dev mailing list