[openstack-dev] Discussion about where to put database for bare-metal provisioning (review 10726)

Vishvananda Ishaya vishvananda at gmail.com
Mon Aug 27 17:07:31 UTC 2012


Hi David,

I just checked out the code more extensively and I don't see why you need to create a new service entry for each compute_node entry. The code in host_manager to get all host states explicitly gets all compute_node entries. I don't see any reason why multiple compute_node entries can't share the same service. I don't see any place in the scheduler that is grabbing records by "service" instead of by "compute node", but if there is one that I missed, it should be fairly easy to change it.

The compute_node record is created in the compute/resource_tracker.py as of a recent commit, so I think the path forward would be to make sure that one of the records is created for each bare metal node by the bare metal compute, perhaps by having multiple resource_trackers. 

Vish

On Aug 27, 2012, at 9:40 AM, David Kang <dkang at isi.edu> wrote:

> 
>  Vish,
> 
>  I think I don't understand your statement fully.
> Unless we use different hostnames, (hostname, hypervisor_hostname) must be the 
> same for all bare-metal nodes under a bare-metal nova-compute.
> 
>  Could you elaborate the following statement a little bit more?
> 
>> You would just have to use a little more than hostname. Perhaps
>> (hostname, hypervisor_hostname) could be used to update the entry?
>> 
> 
>  Thanks,
>  David
> 
> 
> 
> ----- Original Message -----
>> I would investigate changing the capabilities to key off of something
>> other than hostname. It looks from the table structure like
>> compute_nodes could be have a many-to-one relationship with services.
>> You would just have to use a little more than hostname. Perhaps
>> (hostname, hypervisor_hostname) could be used to update the entry?
>> 
>> Vish
>> 
>> On Aug 24, 2012, at 11:23 AM, David Kang <dkang at isi.edu> wrote:
>> 
>>> 
>>>  Vish,
>>> 
>>>  I've tested your code and did more testing.
>>> There are a couple of problems.
>>> 1. host name should be unique. If not, any repetitive updates of new
>>> capabilities with the same host name are simply overwritten.
>>> 2. We cannot generate arbitrary host names on the fly.
>>>   The scheduler (I tested filter scheduler) gets host names from db.
>>>   So, if a host name is not in the 'services' table, it is not
>>>   considered by the scheduler at all.
>>> 
>>> So, to make your suggestions possible, nova-compute should register
>>> N different host names in 'services' table,
>>> and N corresponding entries in 'compute_nodes' table.
>>> Here is an example:
>>> 
>>> mysql> select id, host, binary, topic, report_count, disabled,
>>> availability_zone from services;
>>> +----+-------------+----------------+-----------+--------------+----------+-------------------+
>>> | id | host | binary | topic | report_count | disabled |
>>> | availability_zone |
>>> +----+-------------+----------------+-----------+--------------+----------+-------------------+
>>> |  1 | bespin101 | nova-scheduler | scheduler | 17145 | 0 | nova |
>>> |  2 | bespin101 | nova-network | network | 16819 | 0 | nova |
>>> |  3 | bespin101-0 | nova-compute | compute | 16405 | 0 | nova |
>>> |  4 | bespin101-1 | nova-compute | compute | 1 | 0 | nova |
>>> +----+-------------+----------------+-----------+--------------+----------+-------------------+
>>> 
>>> mysql> select id, service_id, hypervisor_hostname from
>>> compute_nodes;
>>> +----+------------+------------------------+
>>> | id | service_id | hypervisor_hostname |
>>> +----+------------+------------------------+
>>> |  1 | 3 | bespin101.east.isi.edu |
>>> |  2 | 4 | bespin101.east.isi.edu |
>>> +----+------------+------------------------+
>>> 
>>>  Then, nova db (compute_nodes table) has entries of all bare-metal
>>>  nodes.
>>> What do you think of this approach.
>>> Do you have any better approach?
>>> 
>>>  Thanks,
>>>  David
>>> 
>>> 
>>> 
>>> ----- Original Message -----
>>>> To elaborate, something the below. I'm not absolutely sure you need
>>>> to
>>>> be able to set service_name and host, but this gives you the option
>>>> to
>>>> do so if needed.
>>>> 
>>>> iff --git a/nova/manager.py b/nova/manager.py
>>>> index c6711aa..c0f4669 100644
>>>> --- a/nova/manager.py
>>>> +++ b/nova/manager.py
>>>> @@ -217,6 +217,8 @@ class SchedulerDependentManager(Manager):
>>>> 
>>>> def update_service_capabilities(self, capabilities):
>>>> """Remember these capabilities to send on next periodic update."""
>>>> + if not isinstance(capabilities, list):
>>>> + capabilities = [capabilities]
>>>> self.last_capabilities = capabilities
>>>> 
>>>> @periodic_task
>>>> @@ -224,5 +226,8 @@ class SchedulerDependentManager(Manager):
>>>> """Pass data back to the scheduler at a periodic interval."""
>>>> if self.last_capabilities:
>>>> LOG.debug(_('Notifying Schedulers of capabilities ...'))
>>>> - self.scheduler_rpcapi.update_service_capabilities(context,
>>>> - self.service_name, self.host, self.last_capabilities)
>>>> + for capability_item in self.last_capabilities:
>>>> + name = capability_item.get('service_name', self.service_name)
>>>> + host = capability_item.get('host', self.host)
>>>> + self.scheduler_rpcapi.update_service_capabilities(context,
>>>> + name, host, capability_item)
>>>> 
>>>> On Aug 21, 2012, at 1:28 PM, David Kang <dkang at isi.edu> wrote:
>>>> 
>>>>> 
>>>>>  Hi Vish,
>>>>> 
>>>>>  We are trying to change our code according to your comment.
>>>>> I want to ask a question.
>>>>> 
>>>>>>>> a) modify driver.get_host_stats to be able to return a list of
>>>>>>>> host
>>>>>>>> stats instead of just one. Report the whole list back to the
>>>>>>>> scheduler. We could modify the receiving end to accept a list
>>>>>>>> as
>>>>>>>> well
>>>>>>>> or just make multiple calls to
>>>>>>>> self.update_service_capabilities(capabilities)
>>>>> 
>>>>>  Modifying driver.get_host_stats to return a list of host stats is
>>>>>  easy.
>>>>> Calling muliple calls to
>>>>> self.update_service_capabilities(capabilities) doesn't seem to
>>>>> work,
>>>>> because 'capabilities' is overwritten each time.
>>>>> 
>>>>>  Modifying the receiving end to accept a list seems to be easy.
>>>>> However, 'capabilities' is assumed to be dictionary by all other
>>>>> scheduler routines,
>>>>> it looks like that we have to change all of them to handle
>>>>> 'capability' as a list of dictionary.
>>>>> 
>>>>>  If my understanding is correct, it would affect many parts of the
>>>>>  scheduler.
>>>>> Is it what you recommended?
>>>>> 
>>>>>  Thanks,
>>>>>  David
>>>>> 
>>>>> 
>>>>> ----- Original Message -----
>>>>>> This was an immediate goal, the bare-metal nova-compute node
>>>>>> could
>>>>>> keep an internal database, but report capabilities through nova
>>>>>> in
>>>>>> the
>>>>>> common way with the changes below. Then the scheduler wouldn't
>>>>>> need
>>>>>> access to the bare metal database at all.
>>>>>> 
>>>>>> On Aug 15, 2012, at 4:23 PM, David Kang <dkang at isi.edu> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> Hi Vish,
>>>>>>> 
>>>>>>> Is this discussion for long-term goal or for this Folsom
>>>>>>> release?
>>>>>>> 
>>>>>>> We still believe that bare-metal database is needed
>>>>>>> because there is not an automated way how bare-metal nodes
>>>>>>> report
>>>>>>> their capabilities
>>>>>>> to their bare-metal nova-compute node.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> David
>>>>>>> 
>>>>>>>> 
>>>>>>>> I am interested in finding a solution that enables bare-metal
>>>>>>>> and
>>>>>>>> virtualized requests to be serviced through the same scheduler
>>>>>>>> where
>>>>>>>> the compute_nodes table has a full view of schedulable
>>>>>>>> resources.
>>>>>>>> This
>>>>>>>> would seem to simplify the end-to-end flow while opening up
>>>>>>>> some
>>>>>>>> additional use cases (e.g. dynamic allocation of a node from
>>>>>>>> bare-metal to hypervisor and back).
>>>>>>>> 
>>>>>>>> One approach would be to have a proxy running a single
>>>>>>>> nova-compute
>>>>>>>> daemon fronting the bare-metal nodes . That nova-compute daemon
>>>>>>>> would
>>>>>>>> report up many HostState objects (1 per bare-metal node) to
>>>>>>>> become
>>>>>>>> entries in the compute_nodes table and accessible through the
>>>>>>>> scheduler HostManager object.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The HostState object would set cpu_info, vcpus, member_mb and
>>>>>>>> local_gb
>>>>>>>> values to be used for scheduling with the hypervisor_host field
>>>>>>>> holding the bare-metal machine address (e.g. for IPMI based
>>>>>>>> commands)
>>>>>>>> and hypervisor_type = NONE. The bare-metal Flavors are created
>>>>>>>> with
>>>>>>>> an
>>>>>>>> extra_spec of hypervisor_type= NONE and the corresponding
>>>>>>>> compute_capabilities_filter would reduce the available hosts to
>>>>>>>> those
>>>>>>>> bare_metal nodes. The scheduler would need to understand that
>>>>>>>> hypervisor_type = NONE means you need an exact fit (or
>>>>>>>> best-fit)
>>>>>>>> host
>>>>>>>> vs weighting them (perhaps through the multi-scheduler). The
>>>>>>>> scheduler
>>>>>>>> would cast out the message to the <topic>.<service-hostname>
>>>>>>>> (code
>>>>>>>> today uses the HostState hostname), with the compute driver
>>>>>>>> having
>>>>>>>> to
>>>>>>>> understand if it must be serviced elsewhere (but does not break
>>>>>>>> any
>>>>>>>> existing implementations since it is 1 to 1).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Does this solution seem workable? Anything I missed?
>>>>>>>> 
>>>>>>>> The bare metal driver already is proxying for the other nodes
>>>>>>>> so
>>>>>>>> it
>>>>>>>> sounds like we need a couple of things to make this happen:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> a) modify driver.get_host_stats to be able to return a list of
>>>>>>>> host
>>>>>>>> stats instead of just one. Report the whole list back to the
>>>>>>>> scheduler. We could modify the receiving end to accept a list
>>>>>>>> as
>>>>>>>> well
>>>>>>>> or just make multiple calls to
>>>>>>>> self.update_service_capabilities(capabilities)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> b) make a few minor changes to the scheduler to make sure
>>>>>>>> filtering
>>>>>>>> still works. Note the changes here may be very helpful:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> https://review.openstack.org/10327
>>>>>>>> 
>>>>>>>> 
>>>>>>>> c) we have to make sure that instances launched on those nodes
>>>>>>>> take
>>>>>>>> up
>>>>>>>> the entire host state somehow. We could probably do this by
>>>>>>>> making
>>>>>>>> sure that the instance_type ram, mb, gb etc. matches what the
>>>>>>>> node
>>>>>>>> has, but we may want a new boolean field "used" if those aren't
>>>>>>>> sufficient.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I This approach seems pretty good. We could potentially get rid
>>>>>>>> of
>>>>>>>> the
>>>>>>>> shared bare_metal_node table. I guess the only other concern is
>>>>>>>> how
>>>>>>>> you populate the capabilities that the bare metal nodes are
>>>>>>>> reporting.
>>>>>>>> I guess an api extension that rpcs to a baremetal node to add
>>>>>>>> the
>>>>>>>> node. Maybe someday this could be autogenerated by the bare
>>>>>>>> metal
>>>>>>>> host
>>>>>>>> looking in its arp table for dhcp requests! :)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Vish
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> OpenStack-dev mailing list
>>>>>>>> OpenStack-dev at lists.openstack.org
>>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> OpenStack-dev mailing list
>>>>>>> OpenStack-dev at lists.openstack.org
>>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> OpenStack-dev mailing list
>>>>>> OpenStack-dev at lists.openstack.org
>>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>> 
>>>>> _______________________________________________
>>>>> OpenStack-dev mailing list
>>>>> OpenStack-dev at lists.openstack.org
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>> 
>>>> 
>>>> _______________________________________________
>>>> OpenStack-dev mailing list
>>>> OpenStack-dev at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list