Open Stack

Mon Jul 20 22:59:19 UTC 2015

Clint Byrum wrote:
> Excerpts from Chris Friesen's message of 2015-07-20 14:30:53 -0700:
>> On 07/20/2015 02:04 PM, Clint Byrum wrote:
>>> Excerpts from Chris Friesen's message of 2015-07-20 12:17:29 -0700:
>>>> Some questions:
>>>>
>>>> 1) Could you elaborate a bit on how this would work?  I don't quite understand
>>>> how you would handle a request for booting an instance with a certain set of
>>>> resources--would you queue up a message for each resource?
>>>>
>>> Please be concrete on what you mean by resource.
>>>
>>> I'm suggesting if you only have flavors, which have cpu, ram, disk, and rx/tx ratios,
>>> then each flavor is a queue. Thats the easiest problem to solve. Then if
>>> you have a single special thing that can only have one VM per host (lets
>>> say, a PCI pass through thing), then thats another iteration of each
>>> flavor. So assuming 3
>>> flavors:
>>>
>>> 1=tiny cpu=1,ram=1024m,disk=5gb,rxtx=1
>>> 2=medium cpu=2,ram=4096m,disk=100gb,rxtx=2
>>> 3=large cpu=8,ram=16384,disk=200gb,rxtx=2
>>>
>>> This means you have these queues:
>>>
>>> reserve
>>> release
>>> compute,cpu=1,ram=1024m,disk=5gb,rxtx=1,pci=1
>>> compute,cpu=1,ram=1024m,disk=5gb,rxtx=1
>>> compute,cpu=2,ram=4096m,disk=100gb,rxtx=2,pci=1
>>> compute,cpu=2,ram=4096m,disk=100gb,rxtx=2
>>> compute,cpu=8,ram=16384,disk=200gb,rxtx=2pci=1
>>> compute,cpu=8,ram=16384,disk=200gb,rxtx=2
>> <snip>
>>
>>> Now, I've made this argument in the past, and people have pointed out
>>> that the permutations can get into the tens of thousands very easily
>>> if you start adding lots of dimensions and/or flavors. I suggest that
>>> is no big deal, but maybe I'm biased because I have done something like
>>> that in Gearman and it was, in fact, no big deal.
>> Yeah, that's what I was worried about.  We have things that can be specified per
>> flavor, and things that can be specified per image, and things that can be
>> specified per instance, and they all multiply together.
>>
>
> So all that matters is the size of the set of permutations that people
> are using _now_ to request nodes.  It's relatively low-cost to create
> the queues in a distributed manner and just have compute nodes listen to
> a broadcast for new ones that they should try to subscribe to. Even if
> there are 1 million queues possible, it's unlikely there will be 1 million
> legitimate unique boot arguments. This does complicate things quite a
> bit though, so part of me just wants to suggest "don't do that".  ;)
>
>>>> 2) How would it handle stuff like weight functions where you could have multiple
>>>> compute nodes that *could* satisfy the requirement but some of them would be
>>>> "better" than others by some arbitrary criteria.
>>>>
>>> Can you provide a concrete example? Feels like I'm asking for a straw
>>> man to be built. ;)
>> Well, as an example we have a cluster that is aimed at high-performance network
>> processing and so all else being equal they will choose the compute node with
>> the least network traffic.  You might also try to pack instances together for
>> power efficiency (allowing you to turn off unused compute nodes), or choose the
>> compute node that results in the tightest packing (to minimize unused resources).
>>
>
> Least-utilized is hard since it requires knowledge of all of the nodes'
> state. It also breaks down and gives 0 benefit when all the nodes are
> fully bandwidth-utilized. However, "Below 20% utilized" is extremely
> easy and achieves the actual goal that the user stated, since each node
> can self-assess whether it is or is not in that group. In this way a user
> gets given an error "I don't have any fully available networking for you"
> instead of getting a node which is oversubscribed unknowingly.
>
> Packing is kind of interesting. One can achieve it on an empty cluster
> simply by only turning on one node at a time, and whenever the queue
> has less than "safety_margin" workers, turn on more nodes. However,
> once nodes are full and workloads are being deleted, you want to assess
> which ones would be the least cost to migrate off of and turn off. I'm
> inclined to say I would do this from something outside the scheduler,
> as part of a power-reclaimer, but perhaps a centralized scheduler that
> always knows would do a better job here. It would need to do that in
> such a manner that is so efficient it would outweigh the benefit of not
> needing global state awareness. An external reclaimer can work in an
> eventually consistent manner and thus I would still lean toward that over
> the realtime scheduler, but this needs some experimentation to confirm.

 From what I've heard (idk how widely this is done in the industry); but 
actually turning off nodes I've heard causes more problems than it 
solves in terms of power-costs, cooling, hardware [disk, cpu, other] 
failures and so-on, so maybe turning nodes off may not be the best idea. 
This is all things I've heard second-hand though so may not be what 
others do.

>
>>>> 3) The biggest improvement I'd like to see is in group scheduling.  Suppose I
>>>> want to schedule multiple instances, each with their own resource requirements,
>>>> but also with interdependency between them (these ones on the same node, these
>>>> ones not on the same node, these ones with this provider network, etc.)  The
>>>> scheduler could then look at the whole request all at once and optimize it
>>>> rather than looking at each piece separately.  That could also allow relocating
>>>> multiple instances that want to be co-located on the same compute node.
>>>>
>>> So, if the grouping is arbitrary, then there's no way to pre-calculate the
>>> group size, I agree. I am wont to pursue something like this though, as I
>>> don't really think this is the kind of optimization that cloud workloads
>>> should be built on top of. If you need two processes to have low latency,
>>> why not just boot a bigger machine and do it all in one VM? There are a
>>> few reasons I can think of, but I wonder how many are in the general
>>> case?
>> It's a fair question. :)  I honestly don't know...I was just thinking that we
>> allow the expression of affinity/anti-affinity policies via server groups, but
>> the scheduler doesn't really do a good job of actually scheduling those groups.
>
> For anti-affinity I'm still inclined to say "thats what availability
> zones and regions are for". But I know that for smaller clouds, these
> are too heavy weight in their current form to be useful. Perhaps we
> could look at making them less-so.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Open Stack

[openstack-dev] [nova] Proposal for an Experiment

OpenStack

Community

Documentation

Branding & Legal