[openstack-dev] The PCI support blueprint

Bhandaru, Malini K malini.k.bhandaru at intel.com
Mon Jul 22 21:41:08 UTC 2013

Ian, your suggestion of retrieving changes since a timestamp is good.  When a scheduler first comes online (in an HA context), it requests compute node status providing Null for timestamp to retrieve everything.

It also paves the way for full in memory record of all compute node status because it requires that each scheduler keep a copy of the status.

The Scheduler could retrieve status every second or whenever it gets a new VM request. Under heavy load, that is frequent requests, the timestamps would be closer and hopefully fewer changes being returned. We may want to make the frequency of polling a configurable item
to tune .. too infrequent means payload large (no worse than today's full load), too often may  be moot.


-----Original Message-----
From: Ian Wells [mailto:ijw.ubuntu at cack.org.uk] 
Sent: Monday, July 22, 2013 1:56 PM
To: OpenStack Development Mailing List
Subject: Re: [openstack-dev] The PCI support blueprint

On 22 July 2013 21:08, Boris Pavlovic <boris at pavlovic.me> wrote:
> Ian,
> I don't like to write anything personally.
> But I have to write some facts:
> 1) I see tons of hands and only 2 solutions my and one more that is 
> based on code.
> 2) My code was published before session (18. Apr 2013)
> 3) Blueprints from summit were published (03. Mar 2013)
> 4) My Blueprints were published (25. May 2013)
> 5) Patches based on my patch were published only (5. Jul 2013)

Absolutely.  Your patch and our organisation crossed in the mail, and everyone held off work on this because you were working on this.
That's perfectly normal, just unfortunate, and I'm grateful for your work on this, not pointing fingers.

> After making investigations and tests we found that one of the reason 
> why scheduler works slow and has problems with scalability is work with DB.
> JOINS are pretty unscalable and slow thing and if we add one more JOIN 
> that is required by PCI passthrough we will get much worse situation.

Your current PCI passthrough design adds a new database that stores every PCI device in the cluster, and you're thinking of crossing that with the compute node and its friends.  That's certainly unscalable.

I think the issue here is, in fact, more that you're storing every PCI device.  The scheduler doesn't care.  In most cases, devices are equivalent, so instead of storing 1024 devices you can store one single row in the stats table saying pci_device_class_networkcard = 1024.  There may be a handful of these classes, but there won't be
1024 of them per cluster node.  The compute node can take any one of the PCI devices in that class and use it - the scheduler should neither know nor care.

This drastically reduces the transfer of information from the compute node to host and also reduces the amount of data you need to store in the database - and the scheduler DB doesn't need changing at all.

This seems like a much more low impact approach for now - it doesn't change the database at all and it and doesn't add much to the scheduling problem (indeed, no overhead at all for the non-PCI users) until we solve the scalability issues you're talking about at some later date.

For what it's worth, one way of doing that without drastic database design would be to pass compute_node_get_all a timestamp, return only stats updated since that timestamp, return a new timestamp, and merge that in with what the scheduler already knows about.  There's some refinement to that - since timestamps are not reliable clocks in databases - but it reduces the flow of data from the DB file substantially and works with an eventually consistent system.
(Truthfully, I prefer your in-memory-store idea, there's nothing about these stats that really needs to survive a reboot of the control node, but this might be a quick fix.)

OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org

More information about the OpenStack-dev mailing list