<div dir="ltr">Mark,<div><br></div><div>I believe we didn't face the problem so far. Did you test network connection between nodes on its stability and throughput? Maybe the error is caused by network oversaturation.</div>
<div><br></div><div>Though the errors show network as a problem, it might be worth checking with Hadoop community if such exceptions could be caused by reason different from network malfunction.</div><div><br></div><div>Dmitry</div>
<div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">2013/12/18 Marc Solanas Tarre -X (msolanas - AAP3 INC at Cisco) <span dir="ltr"><<a href="mailto:msolanas@cisco.com" target="_blank">msolanas@cisco.com</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="font-size:14px;font-family:Calibri,sans-serif;word-wrap:break-word">
<div>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
Hi,</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
I asked this question in Launchpad (<a href="https://answers.launchpad.net/savanna/+question/240969" style="font-family:Calibri,sans-serif;font-size:14px" target="_blank">https://answers.launchpad.net/savanna/+question/240969</a>), but I thought it might reach more people
if I use the list.</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
My set up is:</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
Ubuntu 12.04<br>
OpenStack Havana with Vanilla Plugin</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
I have deployed a cluster with the following node groups:</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
1 x master:</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
-Uses 1 cinder volume : 2TB</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
-namenode<br>
-secondarynam<u></u>enode<br>
-oozie<br>
-datanode<br>
-jobtracker<br>
-tasktracker</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
2x slaves:</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
-Uses 1 cinder volume: 2TB</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
-datanode<br>
-tasktracker</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
Both node groups used the following flavor:</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
VCPUs: 32<br>
RAM: 250000<br>
Root disk: 300GB<br>
Ephemeral: 300GB<br>
Swap: 0</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
They also use the default Ubuntu Hadoop Vanilla image downloadable from <a rel="nofollow" href="https://savanna.readthedocs.org/en/latest/userdoc/vanilla_plugin.html" style="color:rgb(0,51,170);text-decoration:none" target="_blank">https:/<u></u>/savanna.<u></u>readthedocs.<u></u>org/en/<u></u>latest/<u></u>userdoc/<u></u>vanilla_<u></u>plugin.<u></u>html</a></p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
The /etc/hosts file in all nodes is:<br>
127.0.0.1 localhost<br>
10.0.0.2 test-master2T-<u></u>001.novalocal test-master2T-001<br>
10.0.0.3 test-slave2T-<u></u>001.novalocal test-slave2T-001<br>
10.0.0.4 test-slave2T-<u></u>002.novalocal test-slave2T-002</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
Without changing any of the default configuration, the cluster boots correctly.</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
The problem is that, when running a job (for example, teragen 100GB), the map tasks fail many times, having to repeat them, thus increasing the job time. They seem to fail randomly, from one slave or the other, depending on the execution.</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
Checking the logs of the datanotes in the slaves, I can see this error:</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
WARN org.apache.<u></u>hadoop.<u></u>hdfs.server.<u></u>datanode.<u></u>DataNode: <a href="http://java.net" target="_blank">java.net</a>.<u></u>ConnectExceptio<u></u>n: Call to test-master2T-<u></u>001/10.<u></u>0.0.2:8020 failed on connection exception: <a href="http://java.net" target="_blank">java.net</a>.<u></u>ConnectExceptio<u></u>n: Connection refused</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
Full error: <a rel="nofollow" href="http://pastebin.com/DDp39yqt" style="color:rgb(0,51,170);text-decoration:none" target="_blank">http://<u></u>pastebin.<u></u>com/DDp39yqt</a></p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
The logs of the datanode in the master, gives this error:</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
WARN org.apache.<u></u>hadoop.<u></u>hdfs.server.<u></u>datanode.<u></u>DataNode: checkDiskError: exception:<br>
<a href="http://java.net" target="_blank">java.net</a>.<u></u>SocketException<u></u>: Original Exception : <a href="http://java.io" target="_blank">java.io</a>.<u></u>IOException: Connection reset by peer</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
Full error: <a rel="nofollow" href="http://pastebin.com/NXYXELQX" style="color:rgb(0,51,170);text-decoration:none" target="_blank">http://<u></u>pastebin.<u></u>com/NXYXELQX</a></p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
I have tried changing hadoop.tmp.dir to point to the 2TB cinder volume /volumes/<u></u>disk1/lib/<u></u>hadoop/<u></u>hdfs/tmp, but nothing changed.</p>
<p style="line-height:18px;text-align:left;color:rgb(51,51,51);max-width:45em;font-size:12px;width:auto;font-family:Ubuntu,'Bitstream Vera Sans','DejaVu Sans',Tahoma,sans-serif;margin:0px 0px 0.8em;padding:0px">
Thank you in advance.</p><span class="HOEnZb"><font color="#888888">
</font></span></div><span class="HOEnZb"><font color="#888888">
<div>
<div><br>
</div>
<div>Marc</div>
</div>
</font></span></div>
<br>_______________________________________________<br>
Mailing list: <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack</a><br>
Post to : <a href="mailto:openstack@lists.openstack.org">openstack@lists.openstack.org</a><br>
Unsubscribe : <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack</a><br>
<br></blockquote></div><br></div>