June 03, 2014

HIVE and PIG to simplify Hadoop

[Note] -- Hadoop, IMHO, is history. Rather than waste time with all this, suggest you check up my blog post on Spark with Python.


When I was doing engineering at IIT, Kharagpur, the computers that we had were not even as powerful as a low-cost non-smart phone today and other than the basic concept of programming, nothing that we learnt is of any relevance today. So when we start a teaching a course on Business Analytics, that lies at the bleeding edge of  current technology and business practices, there is simply no option but to take the Do-It-Yourself approach of first learning a subject and then teaching it to students. Fortunately, there are many kind and knowledgeable souls on this planet who have taken the pains to explain new and difficult concepts to ancients like us and thanks to Google, it is not too difficult to locate them.

Using this route, I first learnt what is Data Science and then created this compilation of tutorials and training materials that anyone can use to learn about this new subject in greater depth. The next big challenge was to Demystify Hadoop and Map Reduce as these two key concepts play a very significant role in this area of interest. Writing Map Reduce programs in java, as is the standard practice, is a non-trivial task and many people have sought to simplify matters by adopting other approaches. One is to use the Hadoop streaming API and use a program written in any executable language like Python or R. HIVE and PIG are two other products that have evolved to ease and facilitate the use of MR techniques with Hadoop systems.

HIVE simulates an SQL based query engine sitting on top of the data stored in HDFS file system on Hadoop. Anyone familiar with SQL will immediately feel at home with the DDL, DML (load, insert) and Select commands.

PIG (and its humourously named command prompt, GRUNT > ) is a scripting language that allows one to run queries on data stored on HDFS without writing complex MR programs in Java.

In this post we will

  1. Install HIVE and use SQL commands to load and retrieve data from an HDFS file system.
  2. Install PIG and use it to retrieve the same data 
  3. Do the same task with the usual Java program ( already shown in an earlier blog post.)
We assume that you have followed instructions in the earlier blog post and you single machine cluster of Hadoop installed on a Ubuntu ( preferably 14.04) machine.

Varad Meru of Orzota has created a set of four excellent tutorials that we will use to get a grip on PIG and HIVE.

The first one talks about installing Hadoop 1.0.3, but we will ignore that because we have already learnt to install Hadoop 2.2.0.

The data that is used in the three other tutorials is called the Book Crossing Dataset that you can download as a zip file and then extract ONLY the file called BX-Books.csv for the purpose of the next three tutorials.

From this file we will answer the question of how many books are published in each calendar year. Not really rocket-science but enough to meet the requirements of requirements of how HIVE and PIG work.

The second tutorial Hive for Beginners gives clear, step by step instructions to carry out the task. Almost every instruction works perfectly. The following listing show the shell script used for all three tutorials (HIVE, PIG, Java).
---

========================================
#!/bin/bash

# --- hive and common data cleaning and loading

#hdfs dfs -mkdir /user/hive
#hdfs dfs -mkdir /user/hive/warehouse
#hdfs dfs -chmod g+w /tmp
#hdfs dfs -chmod g+w /user/hive/warehouse
#hdfs dfs -mkdir /user/hduser/BXData-in
#sed 's/&/&/g' BX-Books.csv | sed -e '1d' |sed 's/;/$$$/g' | sed 's/"$$$"/";"/g' > BX-BooksCorrected.txt
#hdfs dfs -copyFromLocal /home/hduser/BXData/BX-BooksCorrected.txt /user/hduser/BXData-in

#hive -f goBX2.sql > goBX2.output

# ---- pig
#pig goBX3.pig

# --- java

#rm -rf LocalClasses
#mkdir LocalClasses
# ....
#javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar -d LocalClasses BookXReducer.java
# .....
#javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar -d LocalClasses BookXMapper.java
# .....
#javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar:LocalClasses -d LocalClasses BookXDriver.java && jar -cvf BookXDriver.jar -C LocalClasses/ .

#hadoop jar BookXDriver.jar BookXDriver /user/hduser/BXData-in /user/hduser/BXData-MR-out
========================================

---

instead of typing long HIVE commands by hand, we have created a file call goBX2.sql to store the various HIVE commands and by selectively un-commenting lines, we execute the different commands.

---

========================================
-- CREATE TABLE IF NOT EXISTS BXDataSet (ISBN STRING, BookTitle STRING, BookAuthor STRING,YearOfPublication STRING, Publisher STRING, ImageURLS STRING, ImageURLM STRING, ImageURLL STRING) COMMENT 'BX-Books Table' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' STORED AS TEXTFILE;
--use default;
--show databases;
--show tables;
--LOAD DATA INPATH '/user/hduser/BXData-in/BX-BooksCorrected.txt' OVERWRITE INTO TABLE BXDataSet;
select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;
========================================

---

The only deviation from the instructions is
  1. One error in the CREATE TABLE command. Since ";" is the EOL for HIVE files, the first CREATE TABLE statement failed because it contained a ";" symbol. This problem was solved by changing it to "\;" before the execution could proceed.
Also note that output is stored in file goBX2.output.

After using HIVE, the same task is performed using PIG by following instructions given in the tutorial PIG for Beginners.

There were two deviations from the instructions
  1. PIG was throwing a fearful error ERROR org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl - Error whiletrying to run jobs.java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected. that was causing a major abort. This was tracked down to this StackOverflow thread and the following command, issued from $PIG_HOME directory solved the problem : ant clean jar-all -Dhadoopversion=23 .. However please note that the command takes nearly 25 minutes to execute as it virtually rebuilds many Hadoop, PIG and related jars
  2. the PIG_CLASSPATH is set to the conf directory which in the case of Hadoop 2.2.0 is set to $HADOOP_INSTALL/etc/hadoop
  3. Also do note that after HIVE has loaded data into a table, it removes the data from the HDFS filesystem. So before PIG can start, the data has to be reloaded from the local file system to HDFS once again ! Simply uncomment the line in the shell script and run it once again
the PIG commands were stored in a file goBX3.pig and executed from the shell script goBX1.sh 
---

========================================
BookXRecords = LOAD '/user/hduser/BXData-in/BX-BooksCorrected.txt' USING PigStorage(';') AS (ISBN:chararray, BookTitle:chararray, BookAuthor:chararray, YearOfPublication:chararray, Publisher:chararray, ImageURLS:chararray, ImageURLM:chararray, ImageURLL:chararray);
GroupByYear = GROUP BookXRecords BY YearOfPublication;
CountByYear = FOREACH GroupByYear GENERATE CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));
STORE CountByYear INTO '/user/hduser/BXData-out-pig/BXDataQueryResult' USING PigStorage('t');
========================================

---

In this case, the output is stored in the HDFS file system that can be accessed thorough the browser at localhost:50075 and downloaded.



Finally, after using HIVE and then PIG to generate the data, one can use the standard Java route as explained in this fourth and final tutorial. There is really no need to configure Eclipse with Hadoop Plug-in ( the version for Hadoop 2.2.0 is not yet ready or stable, as of now ). You can simply download the three java files : BookXDriver, BookXMapper, and BookXReducer and then use the javac command from the ubuntu prompt as given in the shell script above. Once again the output will be stored in the HDFS directory /user/hudser/BXData-MR-out ( as show in the diagram above ) and can be downloaded for comparison with the two other results.

Ok, here is the final screenshot of the applications console available in the browser at localhost:8088 that shows all the three jobs to have executed successfully.


If you find any errors in this post, please leave a comment. If you find it useful, do share it with your friends ... and also check out the Business Analytics program at Praxis Business School.

33 comments:

vignesh 8:08 am  

wounderful opperunities for comment for that our report.This is excellent posts.That very kind of you,Thanks a lot.

Hadoop Training in Chennai

john son 1:36 pm  

Congratulations guys, quality information you have given!!!..Its really useful blog. Thanks for sharing this useful information..

Hadoop Training in Chennai

Daniel Mason 8:08 pm  

I was totally amazed when i saw this website Best Hadoop Online Training first time i thought this is what i am looking for from a long time i am very thankful to you for helping not only me but to all those guys who are new to this IT SECTOR and who wants to make a career in this sector.

ramya parvathaneni 12:33 pm  

Hi,
nice to share information and hadoop real time experts online training with real time projects on
hadoop online training
industry based projects

dhanamlakshmi palu 5:38 pm  

very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.
AWS Training in chennai | AWS Training chennai | AWS course in chennai

Suranka VMware 5:40 pm  

Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
VMWare Training in chennai | VMWare Training chennai | VMWare course in chennai

gokul saran 1:19 pm  

They are offer the software tesing for hadoop.This is very useful for you hadoop programming dots.I have read you article very useful information software testing. Thank you for sharing you article.Hadoop Course in Chennai

Pooja Doss 3:12 pm  

There are lots of information about latest technology and how to get trained in them, like Best Hadoop Training In Chennai in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies Hadoop Training in Chennai By the way you are running a great blog. Thanks for sharing this blogs..

Pooja Doss 3:12 pm  

I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing..
SalesForce Training in Chennai

Pooja Doss 3:12 pm  

Pretty article! I found some useful information in your blog, it was awesome to read,thanks for sharing this great content to my vision, keep sharing..
Unix Training In Chennai

Pooja Doss 3:13 pm  

This information is impressive..I am inspired with your post writing style & how continuously you describe this topic. After reading your post,thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic
Android Training In Chennai In Chennai

Pooja Doss 3:13 pm  

SAP Training in Chennai
This post is really nice and informative. The explanation given is really comprehensive and informative..

Pooja Doss 3:13 pm  

Oracle Training in chennai
Thanks for sharing such a great information..Its really nice and informative..

Pooja Doss 3:14 pm  

Selenium Training in Chennai
Wonderful blog.. Thanks for sharing informative blog.. its very useful to me..

Pooja Doss 3:14 pm  

Data warehousing Training in Chennai
I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly..

Pooja Doss 3:14 pm  

Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing..
Websphere Training in Chennai

Pooja Doss 3:15 pm  

Oracle DBA Training in Chennai
Thanks for sharing this informative blog. I did Oracle DBA Certification in Greens Technology at Adyar. This is really useful for me to make a bright career..

Alia Kumar 11:14 am  

This is really an awesome article. Thank you for sharing this.It is worth reading for everyone. Visit us:
Oracle Training in Chennai

Alia Kumar 11:21 am  

very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
Oracle DBA Training in Chennai

Alia Kumar 11:26 am  

great article!!!!!This is very importent information for us.I like all content and information.I have read it.You know more about this please visit again.
Oracle RAC Training in Chennai

Alia Kumar 11:33 am  

Wonderful tips, very helpful well explained. Your post is definitely incredible. I will refer this to my friend.
SalesForce Training in Chennai

Alia Kumar 11:40 am  

I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly.
Java Training in Chennai

Alia Kumar 11:46 am  

Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
PHP Training in Chennai

Alia Kumar 11:54 am  

Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me..
Android Training in Chennai

Alia Kumar 11:59 am  

Really awesome blog. Your blog is really useful for me. Thanks for sharing this informative blog. Keep update your blog.
SAP Training in Chennai

Alia Kumar 12:04 pm  

Excellent information with unique content and it is very useful to know about the information based on blogs.
Hadoop Training in Chennai

Hadoop online Training 5:00 pm  

by real-time faculty has opened the door to new career options
Hadoop online Training

Akula Rajitha 11:18 am  

Rajasthan Gram Panchayat 2252 Sathin Recruitment 2015-16

Thanks for author for providing amazing information in the post........

Kavin Cooper 5:23 pm  
This comment has been removed by the author.
Kavin Cooper 4:28 pm  

very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.who is looking for hadoop Online Training There is Good Offer Going On at hadooponlinetutor.com.

Raju Kumar 1:45 pm  

topics presented very nice .. I love it.
Thank you for sharing, greeting success always for everything.
Hadoop Online Training

Sumathi Reddy 3:36 pm  

Latest Govt JObs 2016

Very interesting thanks. I believe there's even more that could be on there! Keep it up......................

Marketing Training 3:13 pm  

Quality information was given regarding the Hadoop. Hadoop Admin Online Training

About This Blog

  © Blogger template 'External' by Ourblogtemplates.com 2008

Back to TOP