<html><body>

<p><tt><font size="2">openstack-bounces+mjfork=us.ibm.com@lists.launchpad.net wrote on 08/27/2012 02:58:56 PM:<br>

<br>

> From: David Kang <dkang@isi.edu></font></tt><br>

<tt><font size="2">> To: Vishvananda Ishaya <vishvananda@gmail.com>, </font></tt><br>

<tt><font size="2">> Cc: OpenStack Development Mailing List <openstack-<br>

> dev@lists.openstack.org>, "openstack@lists.launchpad.net \<br>

> (openstack@lists.launchpad.net\)" <openstack@lists.launchpad.net></font></tt><br>

<tt><font size="2">> Date: 08/27/2012 03:06 PM</font></tt><br>

<tt><font size="2">> Subject: Re: [Openstack] [openstack-dev] Discussion about where to <br>

> put database for bare-metal provisioning (review 10726)</font></tt><br>

<tt><font size="2">> Sent by: openstack-bounces+mjfork=us.ibm.com@lists.launchpad.net</font></tt><br>

<tt><font size="2">> <br>

> <br>

>  Hi Vish,<br>

> <br>

>  I think I understand your idea.<br>

> One service entry with multiple bare-metal compute_node entries are <br>

> registered at the start of bare-metal nova-compute.<br>

> 'hypervisor_hostname' must be different for each bare-metal machine,<br>

> such as 'bare-metal-0001.xxx.com', 'bare-metal-0002.xxx.com', etc.)<br>

> But their IP addresses must be the IP address of bare-metal nova-<br>

> compute, such that an instance is casted <br>

> not to bare-metal machine directly but to bare-metal nova-compute.<br>

</font></tt><br>

<tt><font size="2">I believe the change here is to cast out the message to the <topic>.<service-hostname>. Existing code sends it to the compute_node hostname (see line 202 of nova/scheduler/filter_scheduler.py, specifically host=weighted_host.host_state.host).  Changing that to cast to the service hostname would send the message to the bare-metal proxy node and should not have an effect on current deployments since the service hostname and the host_state.host would always be equal.  This model will also let you keep the bare-metal compute node IP in the compute node table.</font></tt><br>

<tt><font size="2"><br>

>  One extension we need to do at the scheduler side is using (host, <br>

> hypervisor_hostname) instead of (host) only in host_manager.py.<br>

> 'HostManager.service_state' is { <host> : { <service > : { cap k : v }}}.<br>

> It needs to be changed to { <host> : { <service> : { <br>

> <hypervisor_name> : { cap k : v }}}}.<br>

> Most functions of HostState need to be changed to use (host, <br>

> hypervisor_name) pair to identify a compute node. <br>

</font></tt><br>

<tt><font size="2">Would an alternative here be to change the top level "host" to be the hypervisor_hostname and enforce uniqueness?</font></tt><br>

<tt><font size="2"><br>

>  Are we on the same page, now?<br>

> <br>

>  Thanks,<br>

>  David<br>

> <br>

> ----- Original Message -----<br>

> > Hi David,<br>

> > <br>

> > I just checked out the code more extensively and I don't see why you<br>

> > need to create a new service entry for each compute_node entry. The<br>

> > code in host_manager to get all host states explicitly gets all<br>

> > compute_node entries. I don't see any reason why multiple compute_node<br>

> > entries can't share the same service. I don't see any place in the<br>

> > scheduler that is grabbing records by "service" instead of by "compute<br>

> > node", but if there is one that I missed, it should be fairly easy to<br>

> > change it.<br>

> > <br>

> > The compute_node record is created in the compute/resource_tracker.py<br>

> > as of a recent commit, so I think the path forward would be to make<br>

> > sure that one of the records is created for each bare metal node by<br>

> > the bare metal compute, perhaps by having multiple resource_trackers.<br>

> > <br>

> > Vish<br>

> > <br>

> > On Aug 27, 2012, at 9:40 AM, David Kang <dkang@isi.edu> wrote:<br>

> > <br>

> > ><br>

> > >  Vish,<br>

> > ><br>

> > >  I think I don't understand your statement fully.<br>

> > > Unless we use different hostnames, (hostname, hypervisor_hostname)<br>

> > > must be the<br>

> > > same for all bare-metal nodes under a bare-metal nova-compute.<br>

> > ><br>

> > >  Could you elaborate the following statement a little bit more?<br>

> > ><br>

> > >> You would just have to use a little more than hostname. Perhaps<br>

> > >> (hostname, hypervisor_hostname) could be used to update the entry?<br>

> > >><br>

> > ><br>

> > >  Thanks,<br>

> > >  David<br>

> > ><br>

> > ><br>

> > ><br>

> > > ----- Original Message -----<br>

> > >> I would investigate changing the capabilities to key off of<br>

> > >> something<br>

> > >> other than hostname. It looks from the table structure like<br>

> > >> compute_nodes could be have a many-to-one relationship with<br>

> > >> services.<br>

> > >> You would just have to use a little more than hostname. Perhaps<br>

> > >> (hostname, hypervisor_hostname) could be used to update the entry?<br>

> > >><br>

> > >> Vish<br>

> > >><br>

> > >> On Aug 24, 2012, at 11:23 AM, David Kang <dkang@isi.edu> wrote:<br>

> > >><br>

> > >>><br>

> > >>>  Vish,<br>

> > >>><br>

> > >>>  I've tested your code and did more testing.<br>

> > >>> There are a couple of problems.<br>

> > >>> 1. host name should be unique. If not, any repetitive updates of<br>

> > >>> new<br>

> > >>> capabilities with the same host name are simply overwritten.<br>

> > >>> 2. We cannot generate arbitrary host names on the fly.<br>

> > >>>   The scheduler (I tested filter scheduler) gets host names from<br>

> > >>>   db.<br>

> > >>>   So, if a host name is not in the 'services' table, it is not<br>

> > >>>   considered by the scheduler at all.<br>

> > >>><br>

> > >>> So, to make your suggestions possible, nova-compute should<br>

> > >>> register<br>

> > >>> N different host names in 'services' table,<br>

> > >>> and N corresponding entries in 'compute_nodes' table.<br>

> > >>> Here is an example:<br>

> > >>><br>

> > >>> mysql> select id, host, binary, topic, report_count, disabled,<br>

> > >>> availability_zone from services;<br>

> > >>> +----+-------------+----------------+-----------<br>

> +--------------+----------+-------------------+<br>

> > >>> | id | host | binary | topic | report_count | disabled |<br>

> > >>> | availability_zone |<br>

> > >>> +----+-------------+----------------+-----------<br>

> +--------------+----------+-------------------+<br>

> > >>> |  1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 | nova |<br>

> > >>> |  2 | bespin101 | nova-network | network | 16819 | 0 | nova |<br>

> > >>> |  3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova |<br>

> > >>> |  4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova |<br>

> > >>> +----+-------------+----------------+-----------<br>

> +--------------+----------+-------------------+<br>

> > >>><br>

> > >>> mysql> select id, service_id, hypervisor_hostname from<br>

> > >>> compute_nodes;<br>

> > >>> +----+------------+------------------------+<br>

> > >>> | id | service_id | hypervisor_hostname |<br>

> > >>> +----+------------+------------------------+<br>

> > >>> |  1 | 3 | bespin101.east.isi.edu |<br>

> > >>> |  2 | 4 | bespin101.east.isi.edu |<br>

> > >>> +----+------------+------------------------+<br>

> > >>><br>

> > >>>  Then, nova db (compute_nodes table) has entries of all bare-metal<br>

> > >>>  nodes.<br>

> > >>> What do you think of this approach.<br>

> > >>> Do you have any better approach?<br>

> > >>><br>

> > >>>  Thanks,<br>

> > >>>  David<br>

> > >>><br>

> > >>><br>

> > >>><br>

> > >>> ----- Original Message -----<br>

> > >>>> To elaborate, something the below. I'm not absolutely sure you<br>

> > >>>> need<br>

> > >>>> to<br>

> > >>>> be able to set service_name and host, but this gives you the<br>

> > >>>> option<br>

> > >>>> to<br>

> > >>>> do so if needed.<br>

> > >>>><br>

> > >>>> iff --git a/nova/manager.py b/nova/manager.py<br>

> > >>>> index c6711aa..c0f4669 100644<br>

> > >>>> --- a/nova/manager.py<br>

> > >>>> +++ b/nova/manager.py<br>

> > >>>> @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager):<br>

> > >>>><br>

> > >>>> def update_service_capabilities(self, capabilities):<br>

> > >>>> """Remember these capabilities to send on next periodic<br>

> > >>>> update."""<br>

> > >>>> + if not isinstance(capabilities, list):<br>

> > >>>> + capabilities = [capabilities]<br>

> > >>>> self.last_capabilities = capabilities<br>

> > >>>><br>

> > >>>> @periodic_task<br>

> > >>>> @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager):<br>

> > >>>> """Pass data back to the scheduler at a periodic interval."""<br>

> > >>>> if self.last_capabilities:<br>

> > >>>> LOG.debug(_('Notifying Schedulers of capabilities ...'))<br>

> > >>>> - self.scheduler_rpcapi.update_service_capabilities(context,<br>

> > >>>> - self.service_name, self.host, self.last_capabilities)<br>

> > >>>> + for capability_item in self.last_capabilities:<br>

> > >>>> + name = capability_item.get('service_name', self.service_name)<br>

> > >>>> + host = capability_item.get('host', self.host)<br>

> > >>>> + self.scheduler_rpcapi.update_service_capabilities(context,<br>

> > >>>> + name, host, capability_item)<br>

> > >>>><br>

> > >>>> On Aug 21, 2012, at 1:28 PM, David Kang <dkang@isi.edu> wrote:<br>

> > >>>><br>

> > >>>>><br>

> > >>>>>  Hi Vish,<br>

> > >>>>><br>

> > >>>>>  We are trying to change our code according to your comment.<br>

> > >>>>> I want to ask a question.<br>

> > >>>>><br>

> > >>>>>>>> a) modify driver.get_host_stats to be able to return a list<br>

> > >>>>>>>> of<br>

> > >>>>>>>> host<br>

> > >>>>>>>> stats instead of just one. Report the whole list back to the<br>

> > >>>>>>>> scheduler. We could modify the receiving end to accept a list<br>

> > >>>>>>>> as<br>

> > >>>>>>>> well<br>

> > >>>>>>>> or just make multiple calls to<br>

> > >>>>>>>> self.update_service_capabilities(capabilities)<br>

> > >>>>><br>

> > >>>>>  Modifying driver.get_host_stats to return a list of host stats<br>

> > >>>>>  is<br>

> > >>>>>  easy.<br>

> > >>>>> Calling muliple calls to<br>

> > >>>>> self.update_service_capabilities(capabilities) doesn't seem to<br>

> > >>>>> work,<br>

> > >>>>> because 'capabilities' is overwritten each time.<br>

> > >>>>><br>

> > >>>>>  Modifying the receiving end to accept a list seems to be easy.<br>

> > >>>>> However, 'capabilities' is assumed to be dictionary by all other<br>

> > >>>>> scheduler routines,<br>

> > >>>>> it looks like that we have to change all of them to handle<br>

> > >>>>> 'capability' as a list of dictionary.<br>

> > >>>>><br>

> > >>>>>  If my understanding is correct, it would affect many parts of<br>

> > >>>>>  the<br>

> > >>>>>  scheduler.<br>

> > >>>>> Is it what you recommended?<br>

> > >>>>><br>

> > >>>>>  Thanks,<br>

> > >>>>>  David<br>

> > >>>>><br>

> > >>>>><br>

> > >>>>> ----- Original Message -----<br>

> > >>>>>> This was an immediate goal, the bare-metal nova-compute node<br>

> > >>>>>> could<br>

> > >>>>>> keep an internal database, but report capabilities through nova<br>

> > >>>>>> in<br>

> > >>>>>> the<br>

> > >>>>>> common way with the changes below. Then the scheduler wouldn't<br>

> > >>>>>> need<br>

> > >>>>>> access to the bare metal database at all.<br>

> > >>>>>><br>

> > >>>>>> On Aug 15, 2012, at 4:23 PM, David Kang <dkang@isi.edu> wrote:<br>

> > >>>>>><br>

> > >>>>>>><br>

> > >>>>>>> Hi Vish,<br>

> > >>>>>>><br>

> > >>>>>>> Is this discussion for long-term goal or for this Folsom<br>

> > >>>>>>> release?<br>

> > >>>>>>><br>

> > >>>>>>> We still believe that bare-metal database is needed<br>

> > >>>>>>> because there is not an automated way how bare-metal nodes<br>

> > >>>>>>> report<br>

> > >>>>>>> their capabilities<br>

> > >>>>>>> to their bare-metal nova-compute node.<br>

> > >>>>>>><br>

> > >>>>>>> Thanks,<br>

> > >>>>>>> David<br>

> > >>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>> I am interested in finding a solution that enables bare-metal<br>

> > >>>>>>>> and<br>

> > >>>>>>>> virtualized requests to be serviced through the same<br>

> > >>>>>>>> scheduler<br>

> > >>>>>>>> where<br>

> > >>>>>>>> the compute_nodes table has a full view of schedulable<br>

> > >>>>>>>> resources.<br>

> > >>>>>>>> This<br>

> > >>>>>>>> would seem to simplify the end-to-end flow while opening up<br>

> > >>>>>>>> some<br>

> > >>>>>>>> additional use cases (e.g. dynamic allocation of a node from<br>

> > >>>>>>>> bare-metal to hypervisor and back).<br>

> > >>>>>>>><br>

> > >>>>>>>> One approach would be to have a proxy running a single<br>

> > >>>>>>>> nova-compute<br>

> > >>>>>>>> daemon fronting the bare-metal nodes . That nova-compute<br>

> > >>>>>>>> daemon<br>

> > >>>>>>>> would<br>

> > >>>>>>>> report up many HostState objects (1 per bare-metal node) to<br>

> > >>>>>>>> become<br>

> > >>>>>>>> entries in the compute_nodes table and accessible through the<br>

> > >>>>>>>> scheduler HostManager object.<br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>> The HostState object would set cpu_info, vcpus, member_mb and<br>

> > >>>>>>>> local_gb<br>

> > >>>>>>>> values to be used for scheduling with the hypervisor_host<br>

> > >>>>>>>> field<br>

> > >>>>>>>> holding the bare-metal machine address (e.g. for IPMI based<br>

> > >>>>>>>> commands)<br>

> > >>>>>>>> and hypervisor_type = NONE. The bare-metal Flavors are<br>

> > >>>>>>>> created<br>

> > >>>>>>>> with<br>

> > >>>>>>>> an<br>

> > >>>>>>>> extra_spec of hypervisor_type= NONE and the corresponding<br>

> > >>>>>>>> compute_capabilities_filter would reduce the available hosts<br>

> > >>>>>>>> to<br>

> > >>>>>>>> those<br>

> > >>>>>>>> bare_metal nodes. The scheduler would need to understand that<br>

> > >>>>>>>> hypervisor_type = NONE means you need an exact fit (or<br>

> > >>>>>>>> best-fit)<br>

> > >>>>>>>> host<br>

> > >>>>>>>> vs weighting them (perhaps through the multi-scheduler). The<br>

> > >>>>>>>> scheduler<br>

> > >>>>>>>> would cast out the message to the <topic>.<service-hostname><br>

> > >>>>>>>> (code<br>

> > >>>>>>>> today uses the HostState hostname), with the compute driver<br>

> > >>>>>>>> having<br>

> > >>>>>>>> to<br>

> > >>>>>>>> understand if it must be serviced elsewhere (but does not<br>

> > >>>>>>>> break<br>

> > >>>>>>>> any<br>

> > >>>>>>>> existing implementations since it is 1 to 1).<br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>> Does this solution seem workable? Anything I missed?<br>

> > >>>>>>>><br>

> > >>>>>>>> The bare metal driver already is proxying for the other nodes<br>

> > >>>>>>>> so<br>

> > >>>>>>>> it<br>

> > >>>>>>>> sounds like we need a couple of things to make this happen:<br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>> a) modify driver.get_host_stats to be able to return a list<br>

> > >>>>>>>> of<br>

> > >>>>>>>> host<br>

> > >>>>>>>> stats instead of just one. Report the whole list back to the<br>

> > >>>>>>>> scheduler. We could modify the receiving end to accept a list<br>

> > >>>>>>>> as<br>

> > >>>>>>>> well<br>

> > >>>>>>>> or just make multiple calls to<br>

> > >>>>>>>> self.update_service_capabilities(capabilities)<br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>> b) make a few minor changes to the scheduler to make sure<br>

> > >>>>>>>> filtering<br>

> > >>>>>>>> still works. Note the changes here may be very helpful:<br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>> <a href="https://review.openstack.org/10327">https://review.openstack.org/10327</a><br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>> c) we have to make sure that instances launched on those<br>

> > >>>>>>>> nodes<br>

> > >>>>>>>> take<br>

> > >>>>>>>> up<br>

> > >>>>>>>> the entire host state somehow. We could probably do this by<br>

> > >>>>>>>> making<br>

> > >>>>>>>> sure that the instance_type ram, mb, gb etc. matches what the<br>

> > >>>>>>>> node<br>

> > >>>>>>>> has, but we may want a new boolean field "used" if those<br>

> > >>>>>>>> aren't<br>

> > >>>>>>>> sufficient.<br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>> I This approach seems pretty good. We could potentially get<br>

> > >>>>>>>> rid<br>

> > >>>>>>>> of<br>

> > >>>>>>>> the<br>

> > >>>>>>>> shared bare_metal_node table. I guess the only other concern<br>

> > >>>>>>>> is<br>

> > >>>>>>>> how<br>

> > >>>>>>>> you populate the capabilities that the bare metal nodes are<br>

> > >>>>>>>> reporting.<br>

> > >>>>>>>> I guess an api extension that rpcs to a baremetal node to add<br>

> > >>>>>>>> the<br>

> > >>>>>>>> node. Maybe someday this could be autogenerated by the bare<br>

> > >>>>>>>> metal<br>

> > >>>>>>>> host<br>

> > >>>>>>>> looking in its arp table for dhcp requests! :)<br>

> > >>>>>>>><br>

> > >>>>>>>><br>

> > >>>>>>>> Vish<br>

> > >>>>>>>><br>

> > >>>>>>>> _______________________________________________<br>

> > >>>>>>>> OpenStack-dev mailing list<br>

> > >>>>>>>> OpenStack-dev@lists.openstack.org<br>

> > >>>>>>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

> > >>>>>>><br>

> > >>>>>>> _______________________________________________<br>

> > >>>>>>> OpenStack-dev mailing list<br>

> > >>>>>>> OpenStack-dev@lists.openstack.org<br>

> > >>>>>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

> > >>>>>><br>

> > >>>>>><br>

> > >>>>>> _______________________________________________<br>

> > >>>>>> OpenStack-dev mailing list<br>

> > >>>>>> OpenStack-dev@lists.openstack.org<br>

> > >>>>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

> > >>>>><br>

> > >>>>> _______________________________________________<br>

> > >>>>> OpenStack-dev mailing list<br>

> > >>>>> OpenStack-dev@lists.openstack.org<br>

> > >>>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

> > >>>><br>

> > >>>><br>

> > >>>> _______________________________________________<br>

> > >>>> OpenStack-dev mailing list<br>

> > >>>> OpenStack-dev@lists.openstack.org<br>

> > >>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

> <br>

> _______________________________________________<br>

> Mailing list: <a href="https://launchpad.net/~openstack">https://launchpad.net/~openstack</a><br>

> Post to     : openstack@lists.launchpad.net<br>

> Unsubscribe : <a href="https://launchpad.net/~openstack">https://launchpad.net/~openstack</a><br>

> More help   : <a href="https://help.launchpad.net/ListHelp">https://help.launchpad.net/ListHelp</a><br>

> <br>

</font></tt><br>

<font size="2" face="sans-serif">Michael<br>

<br>

-------------------------------------------------<br>

Michael Fork<br>

Cloud Architect, Emerging Solutions<br>

IBM Systems & Technology Group</font></body></html>