<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 7 October 2015 at 16:00, Chris Friesen <span dir="ltr"><<a href="mailto:chris.friesen@windriver.com" target="_blank">chris.friesen@windriver.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">1) Some resources (RAM) only require tracking amounts.  Other resources (CPUs, PCI devices) require tracking allocation of specific individual host resources (for CPU pinning, PCI device allocation, etc.).  Presumably for the latter we would have to actually do the allocation of resources at the time of the scheduling operation in order to update the database with the claimed resources in a race-free way.<br></blockquote><div><br></div><div>The whole process is inherently racy (and this is inevitable, and correct), which is why the scheduler works the way it does:<br><br></div><div>- scheduler guesses at a host based on (guaranteed - hello distributed systems!) outdated information<br></div><div>- VM is scheduled to a host that looks like it might work, and host attempts to run it<br></div><div>- VM run may fail (because the information was outdated or has become outdated), in which case we retry the schedule<br><br></div><div>In fact, with PCI devices the code has been written rather carefully to make sure that they fit into this model.  There is central per-device tracking (which, fwiw, I argued against back in the day) but that's not how allocation works (or, considering how long it is since I looked, worked).<br><br>PCI devices are actually allocated from pools of equivalent devices, and allocation works in the same manner as other scheduling: you work out from the nova boot call what constraints a host must satisfy (in this case, in number of PCI devices in specific pools), you check your best guess at global host state against those constraints, and you pick one of the hosts that meets the constraints to schedule on.<br><br></div><div>So: yes, there is a central registry of devices, which we try to keep up to date - but this is for admins to refer to, it's not a necessity of scheduling.  The scheduler input is the pool counts, which work largely the same way as the available memory works as regards scheduling and updating.<br><br></div><div>No idea on CPUs, sorry, but again I'm not sure why the behaviour would be any different: compare suspected host state against needs, schedule if it fits, hope you got it right and tolerate if you didn't.<br></div><div><br></div><div>That being the case, it's worth noting that the database can be eventually consistent and doesn't need to be transactional.  It's also worth considering that the database can have multiple (mutually inconsistent) copies.  There's no need to use a central datastore if you don't want to - one theoretical example is to run multiple schedulers and let each scheduler attempt to collate cloud state from unreliable messages from the compute hosts.  This is not quite what happens today, because messages we send over Rabbit are reliable and therefore costly.<br></div>-- <br></div><div class="gmail_quote">Ian.<br></div></div></div>