<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><br><div><div>On Aug 15, 2012, at 3:17 PM, Michael J Fork <<a href="mailto:mjfork@us.ibm.com">mjfork@us.ibm.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div><p><font size="2" face="sans-serif">I am interested in finding a solution that enables bare-metal and virtualized requests to be serviced through the same scheduler where the compute_nodes table has a full view of schedulable resources.  This would seem to simplify the end-to-end flow while opening up some additional use cases (e.g. dynamic allocation of a node from bare-metal to hypervisor and back).  </font><br>

<br>

<font size="2" face="sans-serif">One approach would be to have a proxy running a single nova-compute daemon fronting the bare-metal nodes .  That nova-compute daemon would report up many HostState objects (1 per bare-metal node) to become entries in the compute_nodes table and accessible through the scheduler HostManager object.</font></p></div></blockquote><blockquote type="cite"><div><p><font size="2" face="sans-serif">The HostState object would set cpu_info, vcpus, member_mb and local_gb values to be used for scheduling with the hypervisor_host field holding the bare-metal machine address (e.g. for IPMI based commands) and hypervisor_type = NONE.  The bare-metal Flavors are created with an extra_spec of hypervisor_type= NONE and the corresponding compute_capabilities_filter would reduce the available hosts to those bare_metal nodes.  The scheduler would need to understand that hypervisor_type = NONE means you need an exact fit (or best-fit) host vs weighting them (perhaps through the multi-scheduler).  The scheduler would cast out the message to the <topic>.<service-hostname> (code today uses the HostState hostname), with the compute driver having to understand if it must be serviced elsewhere (but does not break any existing implementations since it is 1 to 1).</font></p></div></blockquote><blockquote type="cite"><div><p>

<br>

<font size="2" face="sans-serif">Does this solution seem workable? Anything I missed?</font><br></p></div></blockquote><div>The bare metal driver already is proxying for the other nodes so it sounds like we need a couple of things to make this happen:</div><div><br></div><div>a) modify driver.get_host_stats to be able to return a list of host stats instead of just one. Report the whole list back to the scheduler. We could modify the receiving end to accept a list as well or just make multiple calls to </div><div><span class="Apple-tab-span" style="white-space: pre; ">   </span>self.update_service_capabilities(capabilities)</div><div><br></div><div>b) make a few minor changes to the scheduler to make sure filtering still works. Note the changes here may be very helpful:</div><div><br></div><div><a href="https://review.openstack.org/10327">https://review.openstack.org/10327</a></div><div><br></div><div>c) we have to make sure that instances launched on those nodes take up the entire host state somehow. We could probably do this by making sure that the instance_type ram, mb, gb etc. matches what the node has, but we may want a new boolean field "used" if those aren't sufficient.</div><div><br></div><div>I This approach seems pretty good. We could potentially get rid of the shared bare_metal_node table. I guess the only other concern is how you populate the capabilities that the bare metal nodes are reporting. I guess an api extension that rpcs to a baremetal node to add the node. Maybe someday this could be autogenerated by the bare metal host looking in its arp table for dhcp requests! :)</div><div><br></div><div>Vish</div></div><br></body></html>