Tuesday, July 5, 2011

Debugging Hadoop locally on Eclipse


I am now a Cloudera's certified Hadoop developer. (CCDH).  Yayy!

After doing all the theoretical work, I had a chance to work on Hadoop at my work.  We have 3 different remote environments(namely test, qa and prod.) and its pretty hard to debug your job in Map-Reduce paradigm.

In this post I will discuss the 3 modes of Hadoop and what one should use to debug stuff locally. In case you are interested in the configuration settings of 3 modes, read my follow-up post here.

I usually code in eclipse and would like to test my code locally before copying the jar over to any of those machines. Running hadoop locally is easy but debugging it with a debugger is hard.

Lets start with some intro:-
Hadoop can mainly run in 3 modes(only two of them are locally).
  1. Standalone (or local) mode: There are no daemons running in this mode. When you do a JPS on your terminal, there would be no Job tracker, Name node or other daemons running. Hadoop just uses its local files system as an substitue for hdfs files system. 
  2. Pseudo-distributed mode: All daemons runs on a single machine and it mimics the behaviour of cluster. All the daemons runs in your machine locally using the hdfs protocol. 
  3. Fully distributed mode: This is the kind of environment you will usually find on test, prod and qa grids. It was 100 of machines with some equal number of cores and the true power of hadoop. As an application developer you would not settup this machine. Its usually the admins folks who set this up. 
Now, you would usually use #3 while you are running your final job with real production data. Some developers also code and test on #3 machines(qa, test, prod). This post is however to run Hadoop in #1 and #2 mode.

As I mentioned earlier, I like to code and test stuff locally on eclipse before doing the final stuff.
Do to this Hadoop gives you two options #1(local mode) and #2(Pseudo mode).

To debug Hadoop job's you need to make the following configuration: 

a) In conf folder of your HADOOP_HOME, just add the following line in hadoop-env.sh.
export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5002"
This would put ypur code in Remote Java Application's mode. To run it use the following steps:
a) Click on Debug Configurations in eclipse.
b) Select Remote Java Application on the menu on left.
c) In the host just provide localhost and port should be the one as provided in the address variable above. 5002 in this case. You can choose any valid port number.

When would you choose #1 over #2:  #1 and #2 are identical in a way that all the code is running locally. However, you cannot debug your mappers and reducer using mode #2. The main reason here is that in Pseudo mode, each mapper and reducer runs in his own JVM and it impossible to debug them in one instance of eclipse. The only way you can debug your hadoop's mapper and reducer at the full potential is using local mode(#1). Since, all the mappers and reducer runs in single JVM, you can debug your variables easily.

You would like to run in #2 mode in case you are interested to see how HDFS performs on your machine. In terms of power, I don't think there is difference since you only have one machine. However, local mode is faster as hadoop reads files locally whereas in Pseudo mode it uses hdfs to read local files.

I will write a follow-up post on what configurations are needed to run under the two modes.(mainly #1 and #2).

Please feel free to comment.

2 comments:

  1. hello kapil,
    I am a new to hadoop and getting problem while running mapreduce code. So i tried to debug my program in both standalone and fully distributed mode but unable to do so. Would u please assist me in this issue by defining stepwise debugging steps.

    ReplyDelete
  2. Hi friend,
    Is it possible to have your email, please ?
    Standing by ...

    ReplyDelete