<div dir="ltr">Thanks for your comments let me explain a bit more about Hadoop topology.<br><br>In Hadoop 1.2 version, 4 level topologies were introduced: all network, rack, node group (represent Hadoop nodes on the same compute host in the simplest case) and node. Usually Hadoop has replication factor 3. In this case Hadoop placement algorithm is trying to put a HDFS block in the local node or local node group, second replica should be placed outside the node group, but on the same rack, and the last replica outside the initial rack. Topology is defined by the path to vm e.g.<br>
<br>/datacenter1/rack1/host1/vm1<br>/datacenter1/rack1/host1/vm2<br>/datacenter1/rack1/host2/vm1<br>/datacenter1/rack1/host2/vm2<br>/datacenter1/rack2/host3/vm1<br>/datacenter1/rack2/host3/vm2<br>....<br><br>Also, this information will be used for job routing, to place the mapper as closest as possible to the data.<br>
<br><br>The main idea to provide this information to Hadoop. Usually it direct mapping between physical data center structure and Hadoop node placement, but the case of public center the some abstract names will be fine if this configuration a reflex a proximity information for Hadoop nodes.<br>
<br><br>Mike as I understand holistic scheduler can provide needed information. Can you give more details about it?</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Sep 13, 2013 at 11:54 AM, John Garbutt <span dir="ltr"><<a href="mailto:john@johngarbutt.com" target="_blank">john@johngarbutt.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Exposing the detailed info in private cloud, sure makes sense. For<br>
public clouds, not so sure. Would be nice to find something that works<br>
for both.<br>
<br>
We let the user express their intent through the instance groups api.<br>
The scheduler will then do a best effort to meet that criteria, using<br>
its private information. At a courser grain, we have availability<br>
zones, that you could use to express "closeness", and probably often<br>
give you a good measure of closeness anyway.<br>
<br>
So a Hadoop user could request a several small groups of VMs defined<br>
in instance groups to be close, and maybe spread across different<br>
availability zones.<br>
<br>
Would that do the trick? Or does Hadoop/HDFS need a bit more<br>
granularity than that? Could it look to auto-detect "closeness" in<br>
some auto-setup phase, given rough user hints?<br>
<span class="HOEnZb"><font color="#888888"><br>
John<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
On 13 September 2013 07:40, Alex Glikson <<a href="mailto:GLIKSON@il.ibm.com">GLIKSON@il.ibm.com</a>> wrote:<br>
> If I understand correctly, what really matters at least in case of Hadoop is<br>
> network proximity between instances.<br>
> Hence, maybe Neutron would be a better fit to provide such information. In<br>
> particular, depending on virtual network configuration, having 2 instances<br>
> on the same node does not guarantee that the network traffic between them<br>
> will be routed within the node.<br>
> Physical layout could be useful for availability-related purposes. But even<br>
> then, it should be abstracted in such a way that it will not reveal details<br>
> that a cloud provider will typically prefer not to expose. Maybe this can be<br>
> done by Ironic -- or a separate/new project (Tuskar sounds related).<br>
><br>
> Regards,<br>
> Alex<br>
><br>
><br>
><br>
><br>
> From: Mike Spreitzer <<a href="mailto:mspreitz@us.ibm.com">mspreitz@us.ibm.com</a>><br>
> To: OpenStack Development Mailing List<br>
> <<a href="mailto:openstack-dev@lists.openstack.org">openstack-dev@lists.openstack.org</a>>,<br>
> Date: 13/09/2013 08:54 AM<br>
> Subject: Re: [openstack-dev] [nova] [savanna] Host information for<br>
> non admin users<br>
> ________________________________<br>
><br>
><br>
><br>
>> From: Nirmal Ranganathan <<a href="mailto:rnirmal@gmail.com">rnirmal@gmail.com</a>><br>
>> ...<br>
>> Well that's left upto the specific block placement policies in hdfs,<br>
>> all we are providing with the topology information is a hint on<br>
>> node/rack placement.<br>
><br>
> Oh, you are looking at the placement of HDFS blocks within the fixed storage<br>
> volumes, not choosing where to put the storage volumes. In that case I<br>
> understand and agree that simply providing identifiers from the<br>
> infrastructure to the middleware (HDFS) will suffice. Coincidentally my<br>
> group is working on this very example right now in our own environment. We<br>
> have a holistic scheduler that is given a whole template to place, and it<br>
> returns placement information. We imagine, as does Hadoop, a general<br>
> hierarchy in the physical layout, and the holistic scheduler returns, for<br>
> each VM, the path from the root to the VM's host.<br>
><br>
> Regards,<br>
><br>
> Mike_______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
><br>
><br>
> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
><br>
<br>
_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</div></div></blockquote></div><br></div>