[openstack-dev] Bare-metal node scheduling

David Kang dkang at isi.edu
Mon Oct 8 20:53:07 UTC 2012


 Mark,

 I think Mark's suggestion has many advantages.
However, I'm not sure how realistic the assumption is in real-world "Assume that each compute node has a homogeneous set of bare-metal nodes".
I want to ask other people's opinion on that, especially industry people like Calxeda, NTT, ...
Please let us know your experiences and opinions.

 If we have to change the code according to the review, this would be another fundamental design changes.
Let me tell you how our design has evolved.

Our initial design did not change upstream scheduler part except adding BaremetalHostManager.
BaremetalHostManager (instead of HostManager) takes care of special communication with bare-metal nova-compute.
In this case, bare-metal nova-compute reports the largest available resources to BaremetalHostManager.
But Vish and other people did not like it and suggested the multiple entries of capabilities for bare-metal nova-compute.
(We've exchanged many emails in this mailing list to conclude the 2nd design.)

2nd design: This is the design that is being reviewed.
Now bare-metal nova-compute reports multiple entries of capabilities of bare-metal machines.
We knew that this design change causes (non-trivial) changes in upstream scheduler.
After the 2nd design is summarized in that email chain at August 28, our team has worked based on that.
But, now the multiple entries of capabilities seem to disrupt the main nova too much. Sigh~

 We hope to make a right design decision as early as possible to save wasted effort.

 Thanks,


----------------------
Dr. Dong-In "David" Kang
Computer Scientist
USC/ISI

----- Original Message -----
> On Mon, 2012-10-08 at 14:21 +0100, John Garbutt wrote:
> > Interesting ideas.
> >
> > > What we're doing is allowing the scheduler to choose a compute
> > > node
> > > based on the details of the individual bare-metal nodes available
> > > via the
> > > compute node. However, the compute node is still responsible for
> > > choosing
> > > which bare-metal node to provision.
> >
> > While I don't like this approach, it could be used for Hypervisor
> > pools.
> > We did wonder about this for XenServer pools. However, it just
> > seemed too messy.
> > For example, when you want to live migrate between two members of
> > the pool using nova.
> 
> Yeah, I'm not loving the idea of the nova scheduler knowing much if
> anything about the details of the resource available to a
> virt-driver-layer scheduler.
> 
> Another example would be if there was a virt driver for oVirt, I'd
> much
> rather if nova knew nothing about individual oVirt hosts but rather if
> the admin configured a bunch of compute slots representing the
> resources
> which nova is allowed to consume from an oVirt cluster.
> 
> > > As for terminology, rather than the scheduler considering "nodes"
> > > I think
> > > "slots" would be less confusing.
> > >
> > > You could imagine extending this scheme to other virt drivers to
> > > give
> > > providers the option of a much more simple and predictable
> > > scheduling
> > > strategy. You could configure a compute node to have e.g. 10
> > > medium size
> > > "slots" and the scheduler would only ever schedule 10 medium size
> > > instances to that node. This could potentially be a way for
> > > providers to
> > > simplify their capacity planning.
> >
> > This sounds like a good idea.
> 
> Cool.
> 
> > I have wondered about an alternative scheduler where each
> > nova-compute
> > node is configured with a supported set of flavours, and it reports
> > to
> > the scheduler how many of each flavour it still has the capacity to
> > run (i.e. full-ish hypervisor reports: 4 tiny instances or 1 small
> > instance, 0 large instances etc, but baremetal: 0 tiny, 3 small, 10
> > large, etc). That seems to unify the two cases.
> 
> Yeah, that's the way I'm thinking.
> 
> The issue with making this about configuring a compute node with a set
> of flavours is that we're working towards having the compute node not
> access the DB at all.
> 
> This means the "compute slots" configuration would need to live in the
> DB. I guess that's pretty nice in way because we can have a proper
> admin
> API for it.
> 
> > For the above, I was thinking about GPU pass-through. You probably
> > don't want to fill up a GPU pass-through enabled hypervisors with
> > standard instances, unless there is no other option. So you could
> > use
> > the above information to write such a server. Once you have used the
> > GPUs, you might want to fill up the server with tiny instances to
> > maybe save on power.
> 
> You could use slots for this, but the simple version wouldn't have the
> flexibility around allowing GPU slots to be used for standard
> instances
> if there was no room elsewhere.
> 
> Cheers,
> Mark.
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list