Tuesday, January 24, 2012

Power of HIVE (HQL, SQL ) vs procedural language

Today my manager asked me to perform a simple task. The nature of task was somewhat easy if you think it in terms of your favorite language like Java, Python etc. My manager asked me to perform the same task in HIVE.

Consider the following columns in a HIVE table

(key1 string,
key2 string ,
position int,
Value double)

Here the values of column position ranges from 0-60.

Now, the task was to find for a given combination of key1,key2 combined all those rows for which the value at pos1 is less than value at pos2, value at pos2 is less than value at pos2, so on and so forth. Note the position should differ by 1 i.e. pos2-pos1 =1.

In our product, key1 is the source of query(say mobile or web) and key2 is the category code of query(say 800 for restaurants, 9001 for pizza etc) and value is the ctr rate for that source and category code combination at a given position.

So, to explain it again, the task was to find out those category codes combination for which the ctr value at position 2 is greater than position 1, position 3 greater than position 2 etc.

In Java this was easy, get all the rows order them by key1, key2, and key3. This will give you the values in an increasing order of a given key combination and in the increasing order of position too. Just write a simple algorithm to find those codes for which value at position 2 is greater than position 1 for a given key1,key2 combination(Leaving that as an exercise for the reader). TADaaa! This was easy.

Now the major problem was to implement the same logic in HIVE. I will present you the final query here. Its very self explanatory.

(select key1,key2,position,values from table_a) f
(select key1,key2,(position-1) as position,values from table_a) a
(f.key1=a.key1 and f.key2=a.key2 and f.position=a.position)
where f.ctr> a.ctr

In the above query the table joins with itself on same key to key combination. The trick used here to compare values at position 1 with values at position 2, position 2 and position 3 is to subtract 1 from all the rows in the second table and join it with the first table. Thus, later on you can compare the ctr values and print the results.