[openstack-dev] The PCI support blueprint

Ian Wells ijw.ubuntu at cack.org.uk
Mon Jul 22 20:55:51 UTC 2013

On 22 July 2013 21:08, Boris Pavlovic <boris at pavlovic.me> wrote:
> Ian,
> I don't like to write anything personally.
> But I have to write some facts:
> 1) I see tons of hands and only 2 solutions my and one more that is based on
> code.
> 2) My code was published before session (18. Apr 2013)
> 3) Blueprints from summit were published (03. Mar 2013)
> 4) My Blueprints were published (25. May 2013)
> 5) Patches based on my patch were published only (5. Jul 2013)

Absolutely.  Your patch and our organisation crossed in the mail, and
everyone held off work on this because you were working on this.
That's perfectly normal, just unfortunate, and I'm grateful for your
work on this, not pointing fingers.

> After making investigations and tests we found that one of the reason why
> scheduler works slow and has problems with scalability is work with DB.
> JOINS are pretty unscalable and slow thing and if we add one more JOIN that
> is required by PCI passthrough we will get much worse situation.

Your current PCI passthrough design adds a new database that stores
every PCI device in the cluster, and you're thinking of crossing that
with the compute node and its friends.  That's certainly unscalable.

I think the issue here is, in fact, more that you're storing every PCI
device.  The scheduler doesn't care.  In most cases, devices are
equivalent, so instead of storing 1024 devices you can store one
single row in the stats table saying pci_device_class_networkcard =
1024.  There may be a handful of these classes, but there won't be
1024 of them per cluster node.  The compute node can take any one of
the PCI devices in that class and use it - the scheduler should
neither know nor care.

This drastically reduces the transfer of information from the compute
node to host and also reduces the amount of data you need to store in
the database - and the scheduler DB doesn't need changing at all.

This seems like a much more low impact approach for now - it doesn't
change the database at all and it and doesn't add much to the
scheduling problem (indeed, no overhead at all for the non-PCI users)
until we solve the scalability issues you're talking about at some
later date.

For what it's worth, one way of doing that without drastic database
design would be to pass compute_node_get_all a timestamp, return only
stats updated since that timestamp, return a new timestamp, and merge
that in with what the scheduler already knows about.  There's some
refinement to that - since timestamps are not reliable clocks in
databases - but it reduces the flow of data from the DB file
substantially and works with an eventually consistent system.
(Truthfully, I prefer your in-memory-store idea, there's nothing about
these stats that really needs to survive a reboot of the control node,
but this might be a quick fix.)

More information about the OpenStack-dev mailing list