Tuesday, July 5, 2011

Configurations of running Hadoop locally.

This is a follow-up post of my earlier post of "How to debug Hadoop locally using eclipse".

In this post I will spec out what all configurations are needed for running different modes of Hadoop locally. I will only cover the local and pseudo distributed mode. The cluster mode is quiet advance and may  be more suited for admin. (or may be I don't motivated enough to know about it right now).

As I mentioned in my previous post, there are three modes of running Hadoop.
a) Local mode
b) Pseudo distributed mode
c) Cluster.

Two of them Local and Pseudo distributed corresponds to running hadoop locally.

Only Local mode is suitable for debugging all your mappers and reducer locally. Reason being, each mapper and reducer runs in a single JVM thus giving eclipse an option to debug. This is difficult to do in Pseudo mode.

The following are the config changes you might to perform for each of the node.

In case you are interested in debugging mode too, you should all the following line in your $HADOOP_HOME/conf/hadoop-env.sh file.
export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5002"
This will put hadoop into debugging mode listening to connection at host:localhost and port : 5002. 

Now, changes required to run in various mode:

a) Pseudo-mode:Change the following properties of the 3 files.
  1. $HADOOP_HOME/conf/core-site.xml: 
<property>
<name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.</description>
         </property>
       This will tell how hadoop how to access the files. Here it is using HDFS mode, the file system under the hood of Hadoop. This can be changed to FTP and other implementation of Hadoop file system. HDFS is one of them.
     2. $HADOOP_HOME/conf/hdfs-site.xml: 

<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
This will tell how hadoop the number of times it will replicate the files in HDFS. For a pseudo distributed the logical value is 1. You can specify any value here say 2 or 5, but when hadoop daemons runs it will message out with a warning that only 1 is a valid value in this mode. It is smart :) 
     3. $HADOOP_HOME/conf/mapred-site.xml: 
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:54311</value>
</property>
This will tell how hadoop the host and port that the MapReduce job tracker runs at.  If "local", then jobs are run in-process as a single map  and reduce task 

You can check the status of your job tracker and hdfs name node at the following locations http://localhost:50030/ and http://localhost:50070/.


b) Local-mode:
Change the following properties of the 3 files.
  1. $HADOOP_HOME/conf/core-site.xml: 
<property>
  <name>fs.default.name</name>
  <value>file:///</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.</description>
</property>
    Files are accessed locally using the local file system protocol. Remember no name node is running in local node. 
     2. $HADOOP_HOME/conf/hdfs-site.xml: 
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
         This is irrelevant now, since hdfs is not running for file system. 

     3. $HADOOP_HOME/conf/mapred-site.xml: 
<property>
  <name>mapred.job.tracker</name>
  <value>local</value>
</property>
 No job tracker here as Hadoop is now running at local mode but no job tracker and data node.  

Use the local mode for debugging stuff in eclipse.
Thank to Michael for the original post

Debugging Hadoop locally on Eclipse


I am now a Cloudera's certified Hadoop developer. (CCDH).  Yayy!

After doing all the theoretical work, I had a chance to work on Hadoop at my work.  We have 3 different remote environments(namely test, qa and prod.) and its pretty hard to debug your job in Map-Reduce paradigm.

In this post I will discuss the 3 modes of Hadoop and what one should use to debug stuff locally. In case you are interested in the configuration settings of 3 modes, read my follow-up post here.

I usually code in eclipse and would like to test my code locally before copying the jar over to any of those machines. Running hadoop locally is easy but debugging it with a debugger is hard.

Lets start with some intro:-
Hadoop can mainly run in 3 modes(only two of them are locally).
  1. Standalone (or local) mode: There are no daemons running in this mode. When you do a JPS on your terminal, there would be no Job tracker, Name node or other daemons running. Hadoop just uses its local files system as an substitue for hdfs files system. 
  2. Pseudo-distributed mode: All daemons runs on a single machine and it mimics the behaviour of cluster. All the daemons runs in your machine locally using the hdfs protocol. 
  3. Fully distributed mode: This is the kind of environment you will usually find on test, prod and qa grids. It was 100 of machines with some equal number of cores and the true power of hadoop. As an application developer you would not settup this machine. Its usually the admins folks who set this up. 
Now, you would usually use #3 while you are running your final job with real production data. Some developers also code and test on #3 machines(qa, test, prod). This post is however to run Hadoop in #1 and #2 mode.

As I mentioned earlier, I like to code and test stuff locally on eclipse before doing the final stuff.
Do to this Hadoop gives you two options #1(local mode) and #2(Pseudo mode).

To debug Hadoop job's you need to make the following configuration: 

a) In conf folder of your HADOOP_HOME, just add the following line in hadoop-env.sh.
export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5002"
This would put ypur code in Remote Java Application's mode. To run it use the following steps:
a) Click on Debug Configurations in eclipse.
b) Select Remote Java Application on the menu on left.
c) In the host just provide localhost and port should be the one as provided in the address variable above. 5002 in this case. You can choose any valid port number.

When would you choose #1 over #2:  #1 and #2 are identical in a way that all the code is running locally. However, you cannot debug your mappers and reducer using mode #2. The main reason here is that in Pseudo mode, each mapper and reducer runs in his own JVM and it impossible to debug them in one instance of eclipse. The only way you can debug your hadoop's mapper and reducer at the full potential is using local mode(#1). Since, all the mappers and reducer runs in single JVM, you can debug your variables easily.

You would like to run in #2 mode in case you are interested to see how HDFS performs on your machine. In terms of power, I don't think there is difference since you only have one machine. However, local mode is faster as hadoop reads files locally whereas in Pseudo mode it uses hdfs to read local files.

I will write a follow-up post on what configurations are needed to run under the two modes.(mainly #1 and #2).

Please feel free to comment.