This is a follow-up post of my earlier post of "How to debug Hadoop locally using eclipse".
In this post I will spec out what all configurations are needed for running different modes of Hadoop locally. I will only cover the local and pseudo distributed mode. The cluster mode is quiet advance and may be more suited for admin. (or may be I don't motivated enough to know about it right now).
As I mentioned in my previous post, there are three modes of running Hadoop.
a) Local mode
b) Pseudo distributed mode
c) Cluster.
Two of them Local and Pseudo distributed corresponds to running hadoop locally.
Only Local mode is suitable for debugging all your mappers and reducer locally. Reason being, each mapper and reducer runs in a single JVM thus giving eclipse an option to debug. This is difficult to do in Pseudo mode.
The following are the config changes you might to perform for each of the node.
In case you are interested in debugging mode too, you should all the following line in your $HADOOP_HOME/conf/hadoop-env.sh file.
You can check the status of your job tracker and hdfs name node at the following locations http://localhost:50030/ and http://localhost:50070/.
In this post I will spec out what all configurations are needed for running different modes of Hadoop locally. I will only cover the local and pseudo distributed mode. The cluster mode is quiet advance and may be more suited for admin. (or may be I don't motivated enough to know about it right now).
As I mentioned in my previous post, there are three modes of running Hadoop.
a) Local mode
b) Pseudo distributed mode
c) Cluster.
Two of them Local and Pseudo distributed corresponds to running hadoop locally.
Only Local mode is suitable for debugging all your mappers and reducer locally. Reason being, each mapper and reducer runs in a single JVM thus giving eclipse an option to debug. This is difficult to do in Pseudo mode.
The following are the config changes you might to perform for each of the node.
In case you are interested in debugging mode too, you should all the following line in your $HADOOP_HOME/conf/hadoop-env.sh file.
export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5002"
This will put hadoop into debugging mode listening to connection at host:localhost and port : 5002.
Now, changes required to run in various mode:
a) Pseudo-mode:Change the following properties of the 3 files.
- $HADOOP_HOME/conf/core-site.xml:
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation.</description>
</property>
This will tell how hadoop how to access the files. Here it is using HDFS mode, the file system under the hood of Hadoop. This can be changed to FTP and other implementation of Hadoop file system. HDFS is one of them.2. $HADOOP_HOME/conf/hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
This will tell how hadoop the number of times it will replicate the files in HDFS. For a pseudo distributed the logical value is 1. You can specify any value here say 2 or 5, but when hadoop daemons runs it will message out with a warning that only 1 is a valid value in this mode. It is smart :)
3. $HADOOP_HOME/conf/mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>
This will tell how hadoop the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task
You can check the status of your job tracker and hdfs name node at the following locations http://localhost:50030/ and http://localhost:50070/.
b) Local-mode:
Change the following properties of the 3 files.
- $HADOOP_HOME/conf/core-site.xml:
<property>
<name>fs.default.name</name>
<value>file:///</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation.</description>
</property>
Files are accessed locally using the local file system protocol. Remember no name node is running in local node.
2. $HADOOP_HOME/conf/hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
This is irrelevant now, since hdfs is not running for file system.
3. $HADOOP_HOME/conf/mapred-site.xml:
<property>
<name>mapred.job.tracker</name>
<value>local</value>
</property>
No job tracker here as Hadoop is now running at local mode but no job tracker and data node.
Use the local mode for debugging stuff in eclipse.
Thank to Michael for the original post.