[openstack-dev] [Nova] [Gantt] Scheduler split status (updated)

Dugger, Donald D donald.d.dugger at intel.com
Tue Jul 15 16:02:15 UTC 2014


Unfortunately, much as I agree with your sentiment (Death to MS Outlook) my IT overlords have pretty much forced me into using it.  I still top post but try and use some copied context (typically by adding an `in re:' to be explicit) so you know what part of the long email I'm referring to.

--
Don Dugger
"Censeo Toto nos in Kansa esse decisse." - D. Gale
Ph: 303/443-3786

-----Original Message-----
From: Jay Pipes [mailto:jaypipes at gmail.com] 
Sent: Tuesday, July 15, 2014 9:05 AM
To: openstack-dev at lists.openstack.org
Subject: Re: [openstack-dev] [Nova] [Gantt] Scheduler split status (updated)

Hi Paul, thanks for your reply. Comments inline.

BTW, is there any way to reply inline instead of top-posting? On these longer emails, it gets hard sometimes to follow your reply to specific things I mentioned (vs. what John G mentioned).

Death to MS Outlook.

On 07/14/2014 04:40 PM, Murray, Paul (HP Cloud) wrote:
> On extensible resource tracking
>
> Jay, I am surprised to hear you say no one has explained to you why 
> there is an extensible resource tracking blueprint. It's simple, there 
> was a succession of blueprints wanting to add data about this and that 
> to the resource tracker and the scheduler and the database tables used 
> to communicate. These included capabilities, all the stuff in the 
> stats, rxtx_factor, the equivalent for cpu (which only works on one 
> hypervisor I think), pci_stats and more were coming including,
>
> 
>https://blueprints.launchpad.net/nova/+spec/network-bandwidth-entitleme
>nt https://blueprints.launchpad.net/nova/+spec/cpu-entitlement
>
> So, in short, your claim that there are no operators asking for 
> additional stuff is simply not true.

A few things about the above blueprints

1) Neither above blueprint is approved.

2) Neither above blueprint nor the extensible resource tracker blueprint contains a single *use case*. The blueprints are full of statements like "We want to extend this model to add a measure of XXX" and "We propose a unified API to support YYY", however none of them actually contains a real use case. A use case is in the form of "As a XXX user, I want to be able to YYY so that my ZZZ can do AAA." Blueprints without use cases are not necessarily things to be disregarded, but when the blueprint proposes a significant change in behaviour/design or a new feature, without specifying one or more use cases that are satisfied by the proposed spec, the blueprint is suspicious, in my mind.

3) The majority of the feature requests in the CPUEntitlement are enabled with the use of existing host aggregates and their cpu_allocation_ratios and Dan Berrange's work on adding NUMA topology aspects to the compute node and flavours.

4) In my previous emails, I was pretty specific that I had never met a single operator or admin that was "sitting there tinkering with weight multipliers" trying to control the placement of VMs in their cloud. When I talk about the *needless complexity* in the current scheduler design, I am talking specifically about the weigher multipliers. I can guarantee you that there isn't a single person out there sitting behind the scenes going "Oooh, let me change my ram weigher multiplier from 1.0 to .675 and see what happens". It's just not something that is done -- that is way too low of a level for the Compute scheduler to be thinking at. The Nova scheduler *is not a process or thread scheduler*. Folks who think that the Nova scheduler should emulate the Linux kernel scheduling policies and strategies are thinking on *completely* the wrong level, IMO. We should be focusing on making the scheduler *simpler*, with admin users *easily* able to figure out how to control placement decisions for their host aggregates and, more importantly, allow *tenant-by-tenant sorting policies* [1] so that scheduling decisions for different classes of tenants can be controlled distinctly.

> Around about the Icehouse summit (I think) it was suggested that we 
> should stop the obvious trend and add a way to make resource tracking 
> extensible, similar to metrics, which had just been added as an 
> extensible way of collecting on going usage data (because that was 
> also wanted).

OK, I do understand that. I actually think it would have been more appropriate to define real models for these new resource types instead of making it a free-for-all with too much ability to support out-of-tree custom, non-standard resource types, but I understand the idea behind it.

> The json blob you refer to was down to the bad experience of the 
> compute_node_stats table implemented for stats - which had a 
> particular performance hit because it required an expensive join.
> This was dealt with by removing the table and adding a string field to 
> contain the data as a json blob. A pure performance optimization.

Interesting. This is good to know (and would have been good to note on the ERT blueprint).

The problem I have with this is that we are muddying the code and the DB schema unnecessarily because we don't want to optimize our DB read code to not pull giant BLOB columns when we don't need or want to. Instead, we take the easy route and shove everything into a JSON BLOB field.

> Clearly there is no need to store things in this way and with Nova 
> objects being introduced there is a means to provide strict type 
> checking on the data even if it is stored as json blobs in the 
> database.

The big problem I have with the ERT implementation is that it does not model the *resource*. Instead, it provides a plugin interface that is designed to take a BLOB of data and pass back a BLOB of data, and doesn't actually model the resource in any way. One of the things I was going for in my sample PoC code in the lock-free claims stuff [2] was actually *modeling* resources using objects instead of a dict of random nested dicts and string values.

> On scheduler split
>
> I have no particular position on splitting the scheduler. However, 
> there was an interesting reaction to the network bandwidth entitlement 
> blueprint listed above. The nova community felt it was a network thing 
> and so nova should not provide it - neutron should. Of course, in 
> nova, the nova scheduler makes placement decisions... can you see where 
> this is going...? Nova needs to coordinate its placement decision with 
> neutron to decide if a host has sufficient bandwidth available. 
> Similar points are made about cinder - nova has no idea about cinder, 
> but in some environments the location of a volume matters when you 
> come to place an instance.

I understand the above, but don't feel those are reasons to split out the scheduler. Rather, I feel that they are reasons to make Nova's internal scheduler interfaces cleaner and more capable. Splitting the scheduler out for the above reasons just means that the split-out scheduler will be trying to be all things for all folks, which is a recipe for disaster, IMO. Better to make the Nova scheduler a best-in-breed and have Neutron or Cinder's schedulers bring over any particular improvements that are appropriate.

> I should re-iterate that I have no position on splitting out the 
> scheduler, but some way to deal with information from outside nova is 
> certainly desirable. Maybe other services have the same dilemma.
>
> On global resource tracker
>
> I have to say I am inclined to be against the idea of turning the 
> scheduler into a "global resource tracker". I do see the benefit of 
> obtaining a resource claim up front, we have all seen that the 
> scheduler can make incorrect choices because of the delay in 
> reflecting resource allocation to the database and so to the scheduler 
> - it operates on imperfect information. However, it is best to avoid a 
> global service relying on synchronous interaction with compute nodes 
> during the process of servicing a request.

There isn't a single thing about my PoC code that relies on a synchronous interaction with a compute node.

> I have looked at your example code for the scheduler (global resource 
> tracker) and it seems to make a choice from local information and then 
> interact with the chosen compute node to obtain a claim and then try 
> again if the claim fails.

Sorry, that's totally incorrect. Please take a look at it again. The scheduler does not interact with the compute node at all. You may be confused by the use of the compute_node *nova.object*, which is simply the interface to the database for the compute nodes table.

In the PoC code, the main difference is that a collection of Claim objects is returned by the scheduler to the the caller (nova-conductor, most likely), and the resources controlled by the claim are reserved in a lock-free manner, using a compare-and-swap strategy. This means that the current coarse-grained semaphore over the entire compute node that is currently taken during scheduling activities can be avoided.

> I get it - I see that it deals with the same list of hosts on the 
> retry. I also see it has no better chance of getting it right.

Again, please take a look at the code again, this isn't correct at all.
The Claim object's constructor reserves the resource amounts on the compute node in question using a compare-and-swap strategy and retries the claim logic if and only if the UPDATE SQL statement that changes resource usage amounts on the compute node record fails to return a number of affected rows >0. The WHERE condition of the UPDATE statement is constructed to match what the scheduler process thought the resource usage amounts of that compute node were at the time it attempts to consume resources on the compute node, and the UPDATE statement will fail to return any affected rows if another process or thread has successfully claimed resources on that compute node record in between the time the first scheduler process reserved resources on the node and tried to update the database. If no affected rows are returned, the Claim object refreshes its view of the compute node record from the DB and retries the reservation and the UPDATE to the DB. If it succeeds, great. If the reservation would put the amount of available resources on the node below zero, then the Claim raises a ValueError, and the scheduler will try to claim on another compute node.

In this way, there is no locking in the form of any SELECT FOR UPDATE usage, and no need for taking a semaphore lock during any decisions.

> Your desire to have a claim is borne out by the persistent claims spec 
> (I love the spec, I really I don't see why they have to be 
> persistent).

I've never seen the spec. It was John Garbutt that mentioned that. Would be easier to tell if you didn't top-post. Sorry, had to mention that.

> I think that is a great idea. Why not let the scheduler make placement 
> suggestions (as a global service) and then allow conductors to obtain 
> the claim and retry if the claim fails?

Please see my PoC code. It's NOT the conductor that does the retry. It's the Scheduler in conjunction with the database, which is the final arbiter of resource usage records.

> Similar process to your code, but the scheduler only does its part and 
> the conductors scale out the process by acting more locally and with 
> more parallelism. (Of course, you could also be optimistic and allow 
> the compute node to do the claim as part of the create as the 
> degenerate case).

The way to scale out the scheduler in my proposed PoC is by sharding the set of compute nodes that each schedule process would deal with. See the comments in the PoC that discuss that.

> To emphasize the point further, what would a cells scheduler do?
> Would that also make a synchronous operation to obtain the claim?

It's not a synchronous operation, and it doesn't involve the compute worker itself. Please revisit the PoC code.

> My reaction to the global resource tracker idea has been quite 
> negative. I want to like the idea because I like the thought of 
> knowing I have the resources when I get my answer. Its just that I 
> think the persistent claims (without the persistent part J ) gives us 
> a lot of what we need. But I am still open to be convinced.

I've still yet to see the persistent claim blueprint (or any code for it). Would very much welcome a link for that.

Best,
-jay

[1]
https://review.openstack.org/#/c/97503/3/specs/juno/policy-based-scheduing-engine.rst


(although I'm -1 on the implementation details, I'm +1 on the idea of allowing different tenants to be treated with different scheduling policies.

[2] https://review.openstack.org/#/c/103598/4/nova/placement/resource.py

(yes, I know that the resources exposed in that file are not "extensible", in that they aren't loaded from a stevedore plugins, but the idea is to have a model for any resource that can be compared in a quantitative way -- even for resources like NUMA topologies that have multiple components for their resource amounts (cores, cells, threads, etc)

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list