[openstack-dev] [Nova] [Gantt] Scheduler split status (updated)

Jay Pipes jaypipes at gmail.com
Tue Jul 15 15:04:41 UTC 2014


Hi Paul, thanks for your reply. Comments inline.

BTW, is there any way to reply inline instead of top-posting? On these
longer emails, it gets hard sometimes to follow your reply to specific
things I mentioned (vs. what John G mentioned).

Death to MS Outlook.

On 07/14/2014 04:40 PM, Murray, Paul (HP Cloud) wrote:
> On extensible resource tracking
>
> Jay, I am surprised to hear you say no one has explained to you why
> there is an extensible resource tracking blueprint. It’s simple,
> there was a succession of blueprints wanting to add data about this
> and that to the resource tracker and the scheduler and the database
> tables used to communicate. These included capabilities, all the
> stuff in the stats, rxtx_factor, the equivalent for cpu (which only
> works on one hypervisor I think), pci_stats and more were coming
> including,
>
> https://blueprints.launchpad.net/nova/+spec/network-bandwidth-entitlement
>https://blueprints.launchpad.net/nova/+spec/cpu-entitlement
>
> So, in short, your claim that there are no operators asking for
> additional stuff is simply not true.

A few things about the above blueprints

1) Neither above blueprint is approved.

2) Neither above blueprint nor the extensible resource tracker blueprint
contains a single *use case*. The blueprints are full of statements like
"We want to extend this model to add a measure of XXX" and "We propose a
unified API to support YYY", however none of them actually contains a
real use case. A use case is in the form of "As a XXX user, I want to be
able to YYY so that my ZZZ can do AAA." Blueprints without use cases are
not necessarily things to be disregarded, but when the blueprint
proposes a significant change in behaviour/design or a new feature,
without specifying one or more use cases that are satisfied by the
proposed spec, the blueprint is suspicious, in my mind.

3) The majority of the feature requests in the CPUEntitlement are 
enabled with the use of existing host aggregates and their
cpu_allocation_ratios and Dan Berrange's work on adding NUMA topology
aspects to the compute node and flavours.

4) In my previous emails, I was pretty specific that I had never met a
single operator or admin that was "sitting there tinkering with weight
multipliers" trying to control the placement of VMs in their cloud. When
I talk about the *needless complexity* in the current scheduler design,
I am talking specifically about the weigher multipliers. I can guarantee
you that there isn't a single person out there sitting behind the scenes
going "Oooh, let me change my ram weigher multiplier from 1.0 to .675
and see what happens". It's just not something that is done -- that is
way too low of a level for the Compute scheduler to be thinking at. The
Nova scheduler *is not a process or thread scheduler*. Folks who think
that the Nova scheduler should emulate the Linux kernel scheduling
policies and strategies are thinking on *completely* the wrong level,
IMO. We should be focusing on making the scheduler *simpler*, with admin
users *easily* able to figure out how to control placement decisions for
their host aggregates and, more importantly, allow *tenant-by-tenant
sorting policies* [1] so that scheduling decisions for different classes
of tenants can be controlled distinctly.

> Around about the Icehouse summit (I think) it was suggested that we
> should stop the obvious trend and add a way to make resource
> tracking extensible, similar to metrics, which had just been added as
> an extensible way of collecting on going usage data (because that
> was also wanted).

OK, I do understand that. I actually think it would have been more
appropriate to define real models for these new resource types instead
of making it a free-for-all with too much ability to support out-of-tree
custom, non-standard resource types, but I understand the idea behind it.

> The json blob you refer to was down to the bad experience of the
> compute_node_stats table implemented for stats – which had a
> particular performance hit because it required an expensive join.
> This was dealt with by removing the table and adding a string field
> to contain the data as a json blob. A pure performance optimization.

Interesting. This is good to know (and would have been good to note on
the ERT blueprint).

The problem I have with this is that we are muddying the code and the DB
schema unnecessarily because we don't want to optimize our DB read code
to not pull giant BLOB columns when we don't need or want to. Instead,
we take the easy route and shove everything into a JSON BLOB field.

> Clearly there is no need to store things in this way and with Nova
> objects being introduced there is a means to provide strict type
> checking on the data even if it is stored as json blobs in the
> database.

The big problem I have with the ERT implementation is that it does not
model the *resource*. Instead, it provides a plugin interface that is
designed to take a BLOB of data and pass back a BLOB of data, and
doesn't actually model the resource in any way. One of the things I was
going for in my sample PoC code in the lock-free claims stuff [2] was
actually *modeling* resources using objects instead of a dict of random
nested dicts and string values.

> On scheduler split
>
> I have no particular position on splitting the scheduler. However,
> there was an interesting reaction to the network bandwidth
> entitlement blueprint listed above. The nova community felt it was a
> network thing and so nova should not provide it – neutron should. Of
> course, in nova, the nova scheduler makes placement decisions… can
> you see where this is going…? Nova needs to coordinate its placement
> decision with neutron to decide if a host has sufficient bandwidth
> available. Similar points are made about cinder – nova has no idea
> about cinder, but in some environments the location of a volume
> matters when you come to place an instance.

I understand the above, but don't feel those are reasons to split out
the scheduler. Rather, I feel that they are reasons to make Nova's
internal scheduler interfaces cleaner and more capable. Splitting the
scheduler out for the above reasons just means that the split-out
scheduler will be trying to be all things for all folks, which is a
recipe for disaster, IMO. Better to make the Nova scheduler a
best-in-breed and have Neutron or Cinder's schedulers bring over any
particular improvements that are appropriate.

> I should re-iterate that I have no position on splitting out the
> scheduler, but some way to deal with information from outside nova
> is certainly desirable. Maybe other services have the same dilemma.
>
> On global resource tracker
>
> I have to say I am inclined to be against the idea of turning the
> scheduler into a “global resource tracker”. I do see the benefit of
> obtaining a resource claim up front, we have all seen that the
> scheduler can make incorrect choices because of the delay in
> reflecting resource allocation to the database and so to the
> scheduler – it operates on imperfect information. However, it is
> best to avoid a global service relying on synchronous interaction
> with compute nodes during the process of servicing a request.

There isn't a single thing about my PoC code that relies on a
synchronous interaction with a compute node.

> I have looked at your example code for the scheduler (global
> resource tracker) and it seems to make a choice from local
> information and then interact with the chosen compute node to obtain
> a claim and then try again if the claim fails.

Sorry, that's totally incorrect. Please take a look at it again. The
scheduler does not interact with the compute node at all. You may be
confused by the use of the compute_node *nova.object*, which is simply
the interface to the database for the compute nodes table.

In the PoC code, the main difference is that a collection of Claim
objects is returned by the scheduler to the the caller (nova-conductor,
most likely), and the resources controlled by the claim are reserved in
a lock-free manner, using a compare-and-swap strategy. This means that
the current coarse-grained semaphore over the entire compute node that
is currently taken during scheduling activities can be avoided.

> I get it – I see that it deals with the same list of hosts on the
> retry. I also see it has no better chance of getting it right.

Again, please take a look at the code again, this isn't correct at all.
The Claim object's constructor reserves the resource amounts on the
compute node in question using a compare-and-swap strategy and retries
the claim logic if and only if the UPDATE SQL statement that changes
resource usage amounts on the compute node record fails to return a
number of affected rows >0. The WHERE condition of the UPDATE statement 
is constructed to match what the scheduler process thought the resource 
usage amounts of that compute node were at the time it attempts to 
consume resources on the compute node, and the UPDATE statement will 
fail to return any affected rows if another process or thread has 
successfully claimed resources on that compute node record in between 
the time the first scheduler process reserved resources on the node and 
tried to update the database. If no affected rows are returned, the 
Claim object refreshes its view of the compute node record from the DB 
and retries the reservation and the UPDATE to the DB. If it succeeds, 
great. If the reservation would put the amount of available resources on 
the node below zero, then the Claim raises a ValueError, and the 
scheduler will try to claim on another compute node.

In this way, there is no locking in the form of any SELECT FOR UPDATE 
usage, and no need for taking a semaphore lock during any decisions.

> Your desire to have a claim is borne out by the persistent claims
> spec (I love the spec, I really I don’t see why they have to be
> persistent).

I've never seen the spec. It was John Garbutt that mentioned that. Would
be easier to tell if you didn't top-post. Sorry, had to mention that.

> I think that is a great idea. Why not let the scheduler make
> placement suggestions (as a global service) and then allow conductors
> to obtain the claim and retry if the claim fails?

Please see my PoC code. It's NOT the conductor that does the retry. It's
the Scheduler in conjunction with the database, which is the final
arbiter of resource usage records.

> Similar process to your code, but the scheduler only does its part
> and the conductors scale out the process by acting more locally and
> with more parallelism. (Of course, you could also be optimistic and
> allow the compute node to do the claim as part of the create as the
> degenerate case).

The way to scale out the scheduler in my proposed PoC is by sharding the 
set of compute nodes that each schedule process would deal with. See the 
comments in the PoC that discuss that.

> To emphasize the point further, what would a cells scheduler do?
> Would that also make a synchronous operation to obtain the claim?

It's not a synchronous operation, and it doesn't involve the compute 
worker itself. Please revisit the PoC code.

> My reaction to the global resource tracker idea has been quite
> negative. I want to like the idea because I like the thought of
> knowing I have the resources when I get my answer. Its just that I
> think the persistent claims (without the persistent part J ) gives
> us a lot of what we need. But I am still open to be convinced.

I've still yet to see the persistent claim blueprint (or any code for 
it). Would very much welcome a link for that.

Best,
-jay

[1]
https://review.openstack.org/#/c/97503/3/specs/juno/policy-based-scheduing-engine.rst


(although I'm -1 on the implementation details, I'm +1 on the idea of
allowing different tenants to be treated with different scheduling policies.

[2] https://review.openstack.org/#/c/103598/4/nova/placement/resource.py

(yes, I know that the resources exposed in that file are not
"extensible", in that they aren't loaded from a stevedore plugins, but
the idea is to have a model for any resource that can be compared in a
quantitative way -- even for resources like NUMA topologies that have
multiple components for their resource amounts (cores, cells, threads, etc)



More information about the OpenStack-dev mailing list