Let them C: hadoop

Showing posts with label hadoop. Show all posts

Tuesday, January 24, 2012

Power of HIVE (HQL, SQL ) vs procedural language

Today my manager asked me to perform a simple task. The nature of task was somewhat easy if you think it in terms of your favorite language like Java, Python etc. My manager asked me to perform the same task in HIVE.

Consider the following columns in a HIVE table



(key1 string, 

key2 string ,

position int, 

Value double)

Here the values of column position ranges from 0-60.

Now, the task was to find for a given combination of key1,key2 combined all those rows for which the value at pos1 is less than value at pos2, value at pos2 is less than value at pos2, so on and so forth. Note the position should differ by 1 i.e. pos2-pos1 =1.

In our product, key1 is the source of query(say mobile or web) and key2 is the category code of query(say 800 for restaurants, 9001 for pizza etc) and value is the ctr rate for that source and category code combination at a given position.

So, to explain it again, the task was to find out those category codes combination for which the ctr value at position 2 is greater than position 1, position 3 greater than position 2 etc.

In Java this was easy, get all the rows order them by key1, key2, and key3. This will give you the values in an increasing order of a given key combination and in the increasing order of position too. Just write a simple algorithm to find those codes for which value at position 2 is greater than position 1 for a given key1,key2 combination(Leaving that as an exercise for the reader). TADaaa! This was easy.

Now the major problem was to implement the same logic in HIVE. I will present you the final query here. Its very self explanatory.



select

distinct

f.key1,

f.key2,

f.position

from

(select key1,key2,position,values from table_a) f

join

(select key1,key2,(position-1) as position,values from table_a) a

on

(f.key1=a.key1 and f.key2=a.key2 and f.position=a.position)

where f.ctr> a.ctr

In the above query the table joins with itself on same key to key combination. The trick used here to compare values at position 1 with values at position 2, position 2 and position 3 is to subtract 1 from all the rows in the second table and join it with the first table. Thus, later on you can compare the ctr values and print the results.

Tuesday, July 5, 2011

Configurations of running Hadoop locally.

This is a follow-up post of my earlier post of "How to debug Hadoop locally using eclipse".

In this post I will spec out what all configurations are needed for running different modes of Hadoop locally. I will only cover the local and pseudo distributed mode. The cluster mode is quiet advance and may be more suited for admin. (or may be I don't motivated enough to know about it right now).

As I mentioned in my previous post, there are three modes of running Hadoop.
a) Local mode
b) Pseudo distributed mode
c) Cluster.

Two of them Local and Pseudo distributed corresponds to running hadoop locally.

Only Local mode is suitable for debugging all your mappers and reducer locally. Reason being, each mapper and reducer runs in a single JVM thus giving eclipse an option to debug. This is difficult to do in Pseudo mode.

The following are the config changes you might to perform for each of the node.

In case you are interested in debugging mode too, you should all the following line in your $HADOOP_HOME/conf/hadoop-env.sh file.

export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5002"

This will put hadoop into debugging mode listening to connection at host:localhost and port : 5002.

Now, changes required to run in various mode:

a) Pseudo-mode:Change the following properties of the 3 files.

$HADOOP_HOME/conf/core-site.xml:

<property>

<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation.</description>

</property>

This will tell how hadoop how to access the files. Here it is using HDFS mode, the file system under the hood of Hadoop. This can be changed to FTP and other implementation of Hadoop file system. HDFS is one of them.

2. $HADOOP_HOME/conf/hdfs-site.xml:

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

This will tell how hadoop the number of times it will replicate the files in HDFS. For a pseudo distributed the logical value is 1. You can specify any value here say 2 or 5, but when hadoop daemons runs it will message out with a warning that only 1 is a valid value in this mode. It is smart :)

3. $HADOOP_HOME/conf/mapred-site.xml:

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
</property>

This will tell how hadoop the host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task

You can check the status of your job tracker and hdfs name node at the following locations http://localhost:50030/ and http://localhost:50070/.

b) Local-mode:

Change the following properties of the 3 files.

$HADOOP_HOME/conf/core-site.xml:

<property>
<name>fs.default.name</name>
<value>file:///</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation.</description>
</property>

Files are accessed locally using the local file system protocol. Remember no name node is running in local node.

2. $HADOOP_HOME/conf/hdfs-site.xml:

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

This is irrelevant now, since hdfs is not running for file system.

3. $HADOOP_HOME/conf/mapred-site.xml:

<property>
<name>mapred.job.tracker</name>
<value>local</value>
</property>

No job tracker here as Hadoop is now running at local mode but no job tracker and data node.

Use the local mode for debugging stuff in eclipse.

Thank to Michael for the original post.

Debugging Hadoop locally on Eclipse

I am now a Cloudera's certified Hadoop developer. (CCDH). Yayy!

After doing all the theoretical work, I had a chance to work on Hadoop at my work. We have 3 different remote environments(namely test, qa and prod.) and its pretty hard to debug your job in Map-Reduce paradigm.

In this post I will discuss the 3 modes of Hadoop and what one should use to debug stuff locally. In case you are interested in the configuration settings of 3 modes, read my follow-up post here.

I usually code in eclipse and would like to test my code locally before copying the jar over to any of those machines. Running hadoop locally is easy but debugging it with a debugger is hard.

Lets start with some intro:-
Hadoop can mainly run in 3 modes(only two of them are locally).

Standalone (or local) mode: There are no daemons running in this mode. When you do a JPS on your terminal, there would be no Job tracker, Name node or other daemons running. Hadoop just uses its local files system as an substitue for hdfs files system.
Pseudo-distributed mode: All daemons runs on a single machine and it mimics the behaviour of cluster. All the daemons runs in your machine locally using the hdfs protocol.
Fully distributed mode: This is the kind of environment you will usually find on test, prod and qa grids. It was 100 of machines with some equal number of cores and the true power of hadoop. As an application developer you would not settup this machine. Its usually the admins folks who set this up.

Now, you would usually use #3 while you are running your final job with real production data. Some developers also code and test on #3 machines(qa, test, prod). This post is however to run Hadoop in #1 and #2 mode.

As I mentioned earlier, I like to code and test stuff locally on eclipse before doing the final stuff.
Do to this Hadoop gives you two options #1(local mode) and #2(Pseudo mode).

To debug Hadoop job's you need to make the following configuration:

a) In conf folder of your HADOOP_HOME, just add the following line in hadoop-env.sh.

export HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5002"

This would put ypur code in Remote Java Application's mode. To run it use the following steps:
a) Click on Debug Configurations in eclipse.
b) Select Remote Java Application on the menu on left.
c) In the host just provide localhost and port should be the one as provided in the address variable above. 5002 in this case. You can choose any valid port number.

When would you choose #1 over #2: #1 and #2 are identical in a way that all the code is running locally. However, you cannot debug your mappers and reducer using mode #2. The main reason here is that in Pseudo mode, each mapper and reducer runs in his own JVM and it impossible to debug them in one instance of eclipse. The only way you can debug your hadoop's mapper and reducer at the full potential is using local mode(#1). Since, all the mappers and reducer runs in single JVM, you can debug your variables easily.

You would like to run in #2 mode in case you are interested to see how HDFS performs on your machine. In terms of power, I don't think there is difference since you only have one machine. However, local mode is faster as hadoop reads files locally whereas in Pseudo mode it uses hdfs to read local files.

I will write a follow-up post on what configurations are needed to run under the two modes.(mainly #1 and #2).

Please feel free to comment.