[openstack-dev] [nova] Proposal for an Experiment

Clint Byrum clint at fewbar.com
Mon Jul 20 20:04:04 UTC 2015


Excerpts from Chris Friesen's message of 2015-07-20 12:17:29 -0700:
> On 07/20/2015 11:40 AM, Clint Byrum wrote:
> 
> > To your earlier point about state being abused in the system, I
> > totally 100% agree. In the past I've wondered a lot if there can be a
> > worker model, where compute hosts all try to grab work off queues if
> > they have available resources. So API requests for boot/delete don't
> > change any state, they just enqueue a message. Queues would be matched
> > up to resources and the more filter choices, the more queues. Each
> > time a compute node completed a task (create vm, destroy vm) it would
> > re-evaluate all of the queues and subscribe to the ones it could satisfy
> > right now. Quotas would simply be the first stop for the enqueued create
> > messages, and a final stop for the enqueued delete messages (once its
> > done, release quota). If you haven't noticed, this would agree with Robert
> > Collins's suggestion that something like Kafka is a technology more suited
> > to this (or my favorite old-often-forgotten solution to this , Gearman. ;)
> >
> > This would have no global dynamic state, and very little local dynamic
> > state. API, conductor, and compute nodes simply need to know all of the
> > choices users are offered, and there is no scheduler at runtime, just
> > a predictive queue-list-manager that only gets updated when choices are
> > added or removed. This would relieve a ton of the burden currently put
> > on the database by scheduling since the only accesses would be simple
> > read/writes (that includes 'server-list' type operations since that
> > would read a single index key).
> 
> Some questions:
> 
> 1) Could you elaborate a bit on how this would work?  I don't quite understand 
> how you would handle a request for booting an instance with a certain set of 
> resources--would you queue up a message for each resource?
> 

Please be concrete on what you mean by resource.

I'm suggesting if you only have flavors, which have cpu, ram, disk, and rx/tx ratios,
then each flavor is a queue. Thats the easiest problem to solve. Then if
you have a single special thing that can only have one VM per host (lets
say, a PCI pass through thing), then thats another iteration of each
flavor. So assuming 3
flavors:

1=tiny cpu=1,ram=1024m,disk=5gb,rxtx=1
2=medium cpu=2,ram=4096m,disk=100gb,rxtx=2
3=large cpu=8,ram=16384,disk=200gb,rxtx=2

This means you have these queues:

reserve
release
compute,cpu=1,ram=1024m,disk=5gb,rxtx=1,pci=1
compute,cpu=1,ram=1024m,disk=5gb,rxtx=1
compute,cpu=2,ram=4096m,disk=100gb,rxtx=2,pci=1
compute,cpu=2,ram=4096m,disk=100gb,rxtx=2
compute,cpu=8,ram=16384,disk=200gb,rxtx=2pci=1
compute,cpu=8,ram=16384,disk=200gb,rxtx=2

Also you have a delete queue per compute node (and migrate and and and..
RPC still is pretty unchanged at the single-instance level)

So, compute nodes that have the pci device boot up, query the flavors
table, and subscribe to the compute queues that they can satisfy now
(which would be _all_ of them assuming they have 16G of ram available).

A user asks for a tiny + pci pass through. API node injects a message
to the reserve queue, a conductor receives it, checks the user's quota,
bumps usage by 1, and then sends it to the appropriate compute queue. A
compute node receives it. It starts the VM, ACK's the job (so it is
dropped from the queue so it won't be retried) and then looks at its
capabilities vs. the queues, and unsubscribes from all of the pci=1
queues, since its one pci device is in use.

When the user deletes the node, the compute node receives that on its
delete queue, removes the node, and then sends a message on the release
queue that the resources can be returned to the user's quota (or we can
talk about whether to just release them earlier.. when releasing happens
is a sub-topic).

Now, I've made this argument in the past, and people have pointed out
that the permutations can get into the tens of thousands very easily
if you start adding lots of dimensions and/or flavors. I suggest that
is no big deal, but maybe I'm biased because I have done something like
that in Gearman and it was, in fact, no big deal.

> 2) How would it handle stuff like weight functions where you could have multiple 
> compute nodes that *could* satisfy the requirement but some of them would be 
> "better" than others by some arbitrary criteria.
>

Can you provide a concrete example? Feels like I'm asking for a straw
man to be built. ;)

> 3) The biggest improvement I'd like to see is in group scheduling.  Suppose I 
> want to schedule multiple instances, each with their own resource requirements, 
> but also with interdependency between them (these ones on the same node, these 
> ones not on the same node, these ones with this provider network, etc.)  The 
> scheduler could then look at the whole request all at once and optimize it 
> rather than looking at each piece separately.  That could also allow relocating 
> multiple instances that want to be co-located on the same compute node.
> 

So, if the grouping is arbitrary, then there's no way to pre-calculate the
group size, I agree. I am wont to pursue something like this though, as I
don't really think this is the kind of optimization that cloud workloads
should be built on top of. If you need two processes to have low latency,
why not just boot a bigger machine and do it all in one VM? There are a
few reasons I can think of, but I wonder how many are in the general
case?

Anyway, another way to do group scheduling is to have macro-queues where
the compute nodes all subscribe to the biggest version of each dimension
they could possibly handle and the group scheduler sends to those,
but it gets very messy and starts to feel like a centralized scheduler
process with omniscience is a better choice, but I also question how many
peoples' current workload on OpenStack would stop working if you didn't
have the grouping ability? How much better would that workload be if it
simply had more hardware available to it because there's less control
plane overhead and error states for the operator to spend money on?



More information about the OpenStack-dev mailing list