Tuesday, July 5, 2011

Configurations of running Hadoop locally.

This is a follow-up post of my earlier post of "How to debug Hadoop locally using eclipse".

In this post I will spec out what all configurations are needed for running different modes of Hadoop locally. I will only cover the local and pseudo distributed mode. The cluster mode is quiet advance and may  be more suited for admin. (or may be I don't motivated enough to know about it right now).

As I mentioned in my previous post, there are three modes of running Hadoop.
a) Local mode
b) Pseudo distributed mode
c) Cluster.

Two of them Local and Pseudo distributed corresponds to running hadoop locally.

Only Local mode is suitable for debugging all your mappers and reducer locally. Reason being, each mapper and reducer runs in a single JVM thus giving eclipse an option to debug. This is difficult to do in Pseudo mode.

The following are the config changes you might to perform for each of the node.

In case you are interested in debugging mode too, you should all the following line in your $HADOOP_HOME/conf/hadoop-env.sh file.
export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5002"
This will put hadoop into debugging mode listening to connection at host:localhost and port : 5002. 

Now, changes required to run in various mode:

a) Pseudo-mode:Change the following properties of the 3 files.
  1. $HADOOP_HOME/conf/core-site.xml: 
<property>
<name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.</description>
         </property>
       This will tell how hadoop how to access the files. Here it is using HDFS mode, the file system under the hood of Hadoop. This can be changed to FTP and other implementation of Hadoop file system. HDFS is one of them.
     2. $HADOOP_HOME/conf/hdfs-site.xml: 

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
This will tell how hadoop the number of times it will replicate the files in HDFS. For a pseudo distributed the logical value is 1. You can specify any value here say 2 or 5, but when hadoop daemons runs it will message out with a warning that only 1 is a valid value in this mode. It is smart :) 
     3. $HADOOP_HOME/conf/mapred-site.xml: 
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
</property>
This will tell how hadoop the host and port that the MapReduce job tracker runs at.  If "local", then jobs are run in-process as a single map  and reduce task 

You can check the status of your job tracker and hdfs name node at the following locations http://localhost:50030/ and http://localhost:50070/.


b) Local-mode:
Change the following properties of the 3 files.
  1. $HADOOP_HOME/conf/core-site.xml: 
<property>
  <name>fs.default.name</name>
  <value>file:///</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.</description>
</property>
    Files are accessed locally using the local file system protocol. Remember no name node is running in local node. 
     2. $HADOOP_HOME/conf/hdfs-site.xml: 
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
         This is irrelevant now, since hdfs is not running for file system. 

     3. $HADOOP_HOME/conf/mapred-site.xml: 
<property>
  <name>mapred.job.tracker</name>
  <value>local</value>
</property>
 No job tracker here as Hadoop is now running at local mode but no job tracker and data node.  

Use the local mode for debugging stuff in eclipse.
Thank to Michael for the original post

5 comments:

  1. Hi, your post is very useful as i am trying to modify and debug, then run the actual hadoop source code. for that i tried to follow this post, evrything is fine, but when i click the debug on the 'debug configurations' window, it is giving the error

    "Failed to connect to remote VM. Connection refused.
    Connection refused"

    please tell me how to resolve this..
    thank you

    ReplyDelete
  2. Start your program using for instance bin/hadoop jar hadoop-examples-*.jar pi 10 100
    for the pi example. Then it will listen on the above port. Next start the eclipse debugger for this project and it should work.

    ReplyDelete
  3. tutorials on Upgrade Hadoop is excellent.I am happy to found such helpful and fascinating post that is written in well manner. i actually enhanced my data when browse your post .thanks.
    Hadoop Training in hyderabad

    ReplyDelete
  4. Can Setting up 2 datanodes on same machine be considered as pseudo-ditributed mode ?

    ReplyDelete
  5. Good source of information on hadoop and Map R. We always rely on this blog other than our regular hadoop online training classes.

    ReplyDelete