[openstack-dev] Scheduler proposal

Ian Wells ijw.ubuntu at cack.org.uk
Fri Oct 9 01:32:19 UTC 2015

On 8 October 2015 at 13:28, Ed Leafe <ed at leafe.com> wrote:

> On Oct 8, 2015, at 1:38 PM, Ian Wells <ijw.ubuntu at cack.org.uk> wrote:
> > Truth be told, storing that data in MySQL is secondary to the correct
> functioning of the scheduler.
> I have no problem with MySQL (well, I do, but that's not relevant to this
> discussion). My issue is that the current system poorly replicates its data
> from MySQL to the places where it is needed.

Well, the issue is that the data shouldn't be replicated from the database
at all.  There doesn't need to be One True Copy of data here (though I
think the point further down is why we're differing on that).

> > Is there any reason why the duplication (given it's not a huge amount of
> data - megabytes, not gigabytes) is a problem?  Is there any reason why
> inconsistency is a problem?
> I'm sure that many of the larger deployments may have issues with the
> amount of data that must be managed in-memory by so many different parts of
> the system.

I wonder about that.  If I have a scheduler making a scheduling decision I
don't want it calling out to a database and the database calling out to
offline storage just to find the information, at least not if I can
possibly avoid it.  It's a critical path element in every boot call.

Given that what we're talking about is generally a bunch of resource values
for each host, I'm not sure how big this gets, even in the 100k host range,
but do you have a particularly sizeable structure in mind?

> Inconsistency is a problem, but one that has workarounds. The primary
> issue is scalability: with the current design, increasing the number of
> scheduler processes increases the raciness of the system.

And again, given your point below I see where you're coming from here, but
I think the key here is to make two schedulers considerably *less* likely
to make the same choice on the same information.

> I do sympathise with your point in the following email where you have 5
> VMs scheduled by 5 schedulers to the same host, but consider:
> >
> > 1. if only one host suits the 5 VMs this results in the same behaviour:
> 1 VM runs, the rest don't.  There's more work to discover that but arguably
> less work than maintaining a consistent database.
> True, but in a large scale deployment this is an extremely rare case.

Indeed; I'm trying to get that one out of the way.

> 2. if many hosts suit the 5 VMs then this is *very* unlucky, because we
> should be choosing a host at random from the set of suitable hosts and
> that's a huge coincidence - so this is a tiny corner case that we shouldn't
> be designing around
> Here is where we differ in our understanding. With the current system of
> filters and weighers, 5 schedulers getting requests for identical VMs and
> having identical information are *expected* to select the same host. It is
> not a tiny corner case; it is the most likely result for the current system
> design. By catching this situation early (in the scheduling process) we can
> avoid multiple RPC round-trips to handle the fail/retry mechanism.

And so maybe this would be a different fix - choose, at random, one of the
hosts above a weighting threshold, not choose the top host every time?
Technically, any host passing the filter is adequate to the task from the
perspective of an API user (and they can't prove if they got the highest
weighting or not), so if we assume weighting an operator preference, and
just weaken it slightly, we'd have a few more options.

Again, we want to avoid overscheduling to a host, which will eventually
cause a decline and a reschedule.  But something that on balance probably
won't overschedule is adequate; overscheduling sucks but is not in fact the
end of the world as long as it's not every single time.

I'm not averse to the central database if we need the central database, but
I'm not sure how much we do at this point, and a central database will
become a point of contention, I would think, beyond the cost of the above
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151008/2b0a2091/attachment.html>

More information about the OpenStack-dev mailing list