[Note] -- Hadoop, IMHO, is history. Rather than waste time with all this, suggest you check up my blog post on Spark with Python.
In an earlier post on Data Science - A DIY approach, I had explained how one can initiate a career in data science, or data analytics, by using free resources available on the web. Since Hadoop and Map Reduce is a tool and a technique that is very popular in data science, this post will get you started and help you
1. Install Hadoop 2.2, in a single machine cluster mode on a machine running Ubuntu
note that a directory called WC-classes has been created, and the last command is quite long and ends with '.'
Once this has executed, there will be jar file called WordCount.jar that is created and this is used when we invoke Hadoop through this set of commands stored in a shell script
---
the first two commands delete two directories (if they exist ) from the HDFS file system
the fourth command, creates a directory called WC-input in /user/hduser. Note that in the HDFS filesystem, that is different from the normal Ubuntu file system, user hduser has his files stored in directory /user/hduser and NOT in the usual /home/hduser
the fifth command copies the three *.txt files from the Ubuntu filesystem (/home/huser/BookText/ ) to the HDFS filesystem ( /user/hduser/WC-input )
the last command executes Hadoop, uses the WordCount.jar created in the previous step, reads the data from HDFS file system directory WC-input and send the output to HDFS directory WC-output that MUST NOT EXIST before the job is started
This job will take quite a few minutes to run and machine will slow down and may even freeze for a while. Lots of messages will be dumped on the console. Dont panic unless you see a lot of error messages. However if you have done everything correctly, this should not happen. You can follow the progress of the job by sending your browser to http://localhost:8088 and you will see an image like this
These five exercises are not meant to be a replacement for the full time course on Map Reduce, Hadoop that is taught at the Praxis Business School, Calcutta, but will serve as a simple introduction to this very important technology.
If you find any errors, please leave a comment. Otherwise if you like this post, please share with your friends.
In an earlier post on Data Science - A DIY approach, I had explained how one can initiate a career in data science, or data analytics, by using free resources available on the web. Since Hadoop and Map Reduce is a tool and a technique that is very popular in data science, this post will get you started and help you
- Install Hadoop 2.2, in a single machine cluster mode on a machine running Ubuntu
- Compile and run the standard WordCount example in Java
- Compile and run another, non WordCount, program in Java
- Use the Hadoop streaming utility to run a WordCount program written in Python, as an example of a non-Java application
- Compile and run a java program that actually solves a small but representative Predictive Analytics problem
All the information presented here has been gathered from various sites on the web and has been tested on dual-boot laptop running Ubuntu 14.04. Believe me, it works. Should you still get stuck because of variations in the operating environment, you would need to Google with the appropriate error messages and locate your own solutions.
Hadoop is a piece of software, a Java framework, and MapReduce is a technique, an algorithm, that was developed to support "internet-scale" data processing requirements at Yahoo, Google and other internet giants. The primary requirement was to sort through and count vast amounts of data. A quick search in Google will reveal a vast number of tutorials on both Hadoop and MapReduce like this one on Introduction to the Hadoop Ecosystem by Uwe Seiler, or Programming Hadoop MapReduce by Arun C Murthy of Yahoo. You can also download Tom White's Hadoop - The Definitive Guide, 3rd Edition or read Hadoop Illuminated online
Slide 5 of Murthy's deck goes to the heart of the MapReduce technique and explains it with a very simple Unix shell script analogy. A Map process takes data and generates a long list of [key, value] pairs, where a key is an alphanumeric string, e.g. a word, a URL, a country and the value is a usually a numeric. Once the long list of [key, value] pairs have been generated, this list is sorted ( or shuffled) so that all the pairs with the same key are located one after the other in the list. Finally, in the Reduce phase, the multiple values associated with each key, are processed ( for example, added ) to generate a list where each key appears once, along with it's single, reduced, value.
What Hadoop does is distribute the Map and Reduce process among multiple computers in way that is essentially invisible ( or as they say, transparent) to the person executing the MapReduce program. Hence the same MapReduce program can be executed on either a standalone "cluster" of a single machine ( as will be the case in our exercise) or in genuine cluster of appropriately configured machines. However if the size of the data is large, or "internet-scale", a single machine cluster will either take a very long time or may simply crash.
My good friend Rajiv Pratap, who runs an excellent data analytics company, Abzooba, has a brilliant analogy for Hadoop. Let us assume that field is covered with thousands of red and green apples and I am asked to determine the number of each red and green apples. I might slowly and painstakingly go through the whole field myself but better still I can hire an army of street-urchins who can barely count upto 20 ("low cost commodity machines"). I ask each urchin to pick up an armload of apples, count the number of red and green ones and report back to me with two numbers, say, ([red, 3], [green, 8]). These are my [key, value] pairs, two pairs reported by each urchin. Then I can simply add the values corresponding to the red key and I get the total number of red apples in the field. Same for the green apples and I have my answer. In the process, if one of the urchins throws down his apples and runs away ["a commodity machine has a failure"] the process is not impacted because some other urchin picks up the apples and reports the data. Hadoop is like a manager who hires urchins, tells them what to do, shows them where the data is located, sorts, shuffles, collates their result and replaces them if one or two run away. The urchins simply have to know how to count [ Map ] and add [ Reduce ].
Anyway, enough of theory ... let us
1. Install Hadoop 2.2, in a single machine cluster mode on a machine running Ubuntu
I have a Dell laptop that has a dual-boot feature that allows me to use either Windows 7 or Ubuntu 14.04. Running Hadoop on Windows7 is possible but then you will be seen to be an amateur. As a budding Hadoop professional, it is imperative that you get access to a Linux box. If you are using Ubuntu, then you can use the directions given in this blog post to setup newest Hadoop 2.x (2.2.0) on Ubuntu.
I followed these instructions except for the following deviations
- the HADOOP_INSTALL directory was /usr/local/hadoop220 and not /usr/local/hadoop to distinguish it from an earlier Hadoop 1.8 install that I had abandoned
- the HDFS file system was located at /user/hduser/HDFS and not at /mydata/hdfs. Hence the directories for the namenode and datanode were located at
- /user/hduser/HDFS/namenode310514 [ not /mydata/hdfs/namenode ]
- /user/hduser/HDFS/datanode310514 [ not /mydata/hdfs/datanode ]
- for xml files, located at $HADOOP_INSTALL/etc/hadoop have to be updated. The exact syntax is not very clearly given in the instructions but you can see them here
----
========================================
<!-- These files located in $HADOOP_INSTALL/etc/hadoop -->
<!-- Contents of file : core-site.xml -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<!-- Contents of file : yarn-site.xml -->
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
<!-- Contents of file : mapred-site.xml -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
<!-- Contents of file : hdfs-site.xml -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/HDFS/namenode310514</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hduser/HDFS/datanode310514</value>
</property>
</configuration>
========================================
----
2. Compile and run the standard WordCount example in Java
- After the namenode is formatted with the command : hdfs namenode -format, the two commands to start the hadoop, namely start-dfs.sh and start-yarn.sh throw a lot of fearful errors or warnings, none of which cause any harm to the overall cause. To avoid this ugliness on the console screen, I have created two simple shell scripts to start and stop hadoop.
---
========================================
# -- written prithwis mukerjee
# -- file : $HADOOP_INSTALL/sbin/pm-start-hadoop-sh
# --
cd $HADOOP_INSTALL/sbin
echo 'Using Java at ' $JAVA_HOME
echo 'Starting Hadoop from ' $HADOOP_INSTALL
# --
# formatting HDFS if necessary, uncomment following
#hdfs namenode -format
# --
hadoop-daemon.sh start namenode
hadoop-daemon.sh start datanode
yarn-daemon.sh start resourcemanager
yarn-daemon.sh start nodemanager
mr-jobhistory-daemon.sh start historyserver
jps
cd ~
# -- file : $HADOOP_INSTALL/sbin/pm-stop-hadoop-sh
# --
cd $HADOOP_INSTALL/sbin
hadoop-daemon.sh stop namenode
hadoop-daemon.sh stop datanode
yarn-daemon.sh stop resourcemanager
yarn-daemon.sh stop nodemanager
mr-jobhistory-daemon.sh stop historyserver
========================================
---- create these two shell scripts, and place them in the same bin where the usual Hadoop shell scripts are stored and you can simply execute pm-start-hadoop.sh or pm-stop-hadoop.sh from any directory to start and stop Hadoop services. After starting hadoop, make sure that all 6 processes reported by the jps command are operational.
- If there is a problem with the datanode not starting, delete the HDFS directories, recreate them, redefine them in hdfs-site.xml, reformat the namenode and restart Hadoop.
The installation exercise ends with an execution of trivial Map Reduce job that calculates the value of Pi. If this executes without errors ( though the value of Pi it calculates is very approximate ) then Hadoop has been installed correctly.
Next we
2. Compile and run the standard WordCount example in Java
In this exercise we will scan through a number of text files and create a list of words along with the number of times it is found across all three files. For our exercise we will download text files of these copyright free books (a) Outline of Science, Arthur Thomson, (b) Notebooks of Leonardo Da Vinci and (c) Ulysses by James Joyce.
For each book, the txt version is downloaded and kept in a directory /home/hduser/BookText as three *.txt files
Many sample java programs for WordCount with Hadoop are available but you need to find one that works with the Hadoop 2.2.0 APIs. One such program is available in the CodeFusion blog. Copy, paste and save these three java programs as WordMapper.java, SumReducer.java and WordCount.java in your machine. These three files must be compiled, linked and made into a jar file for Hadoop to execute. The commands to do so are given in this shell script but they can also be executed from the command line from the directory where the java programs are stored
---
========================================
rm -rf WC-classes
mkdir WC-classes
# ....
javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar -d WC-classes WordMapper.java
# .....
javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar -d WC-classes SumReducer.java
# .....
javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar:WC-classes -d WC-classes WordCount.java && jar -cvf WordCount.jar -C WC-classes/ .
========================================
---note that a directory called WC-classes has been created, and the last command is quite long and ends with '.'
Once this has executed, there will be jar file called WordCount.jar that is created and this is used when we invoke Hadoop through this set of commands stored in a shell script
---
========================================
hdfs dfs -rm -r /user/hduser/WC-input
hdfs dfs -rm -r /user/hduser/WC-output
hdfs dfs -ls /user/hduser
hdfs dfs -mkdir /user/hduser/WC-input
hdfs dfs -copyFromLocal /home/hduser/BookText/* /user/hduser/WC-input
hdfs dfs -ls /user/hduser
hdfs dfs -ls /user/hduser/WC-input
hadoop jar WordCount.jar WordCount /user/hduser/WC-input /user/hduser/WC-output
========================================
---the first two commands delete two directories (if they exist ) from the HDFS file system
the fourth command, creates a directory called WC-input in /user/hduser. Note that in the HDFS filesystem, that is different from the normal Ubuntu file system, user hduser has his files stored in directory /user/hduser and NOT in the usual /home/hduser
the fifth command copies the three *.txt files from the Ubuntu filesystem (/home/huser/BookText/ ) to the HDFS filesystem ( /user/hduser/WC-input )
the last command executes Hadoop, uses the WordCount.jar created in the previous step, reads the data from HDFS file system directory WC-input and send the output to HDFS directory WC-output that MUST NOT EXIST before the job is started
This job will take quite a few minutes to run and machine will slow down and may even freeze for a while. Lots of messages will be dumped on the console. Dont panic unless you see a lot of error messages. However if you have done everything correctly, this should not happen. You can follow the progress of the job by sending your browser to http://localhost:8088 and you will see an image like this
that shows a history of current and past jobs that have run since the last time Hadoop was started.
To see the files in the HDFS file system, point your browser to http://localhost:50070 and you will see a screen like this
To see the files in the HDFS file system, point your browser to http://localhost:50070 and you will see a screen like this
Clicking on the link "Browse the filesystem" will lead you to
If you go inside the WC-output directory, you will see the results generated by the WordCount program and you can "download" the same into your normal Ubuntu file system.
WordCount is to Hadoop what HelloWorld is to C and now that we have copied, compiled, linked, jarred and executed it let us move on and try to
3. Compile and run another, non WordCount, program in Java
If you go inside the WC-output directory, you will see the results generated by the WordCount program and you can "download" the same into your normal Ubuntu file system.
WordCount is to Hadoop what HelloWorld is to C and now that we have copied, compiled, linked, jarred and executed it let us move on and try to
3. Compile and run another, non WordCount, program in Java
According to Jesper Anderson, an Hadoop expert and also a noted wit, "45% of all Hadoop tutorials count words, 25% count sentences, 20% are about paragraphs. 10% are log parsers. The remainder are helpful."
One such helpful tutorial is a YouTube video by Helen Zeng where she explains the problem and then demonstrates the solution. The video not only explains the problem and the solution but also gives explicit instructions on how to execute the program. The actual code for the demo MarketRatings.java is available at Github and the data is also available in farm-market-data.csv that you can download in text format.
Once the code and the data is available in your machine, it can be compiled and executed using the following commands
--
========================================
# -- compile and create Jar files
javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar -d classes MarketRatings.java && jar -cvf MarketRatings.jar -C classes/ .
#----
hdfs dfs -mkdir /user/hduser/MR-input
hdfs dfs -copyFromLocal FarmersMarket.txt /user/hduser/MR-input
hdfs dfs -ls /user/hduser/MR-input
hadoop jar MarketRatings.jar MarketRatings /user/hduser/MR-input /user/hduser/MR-output
========================================
---
do note that the csv file, saved in txt format as "FarmersMarket.txt" was moved the local directory to the HDFS directory and then used by the MarketRatings program in the MarketRatings.jar called by Hadoop.
Once again, the progress of the job can be viewed in a browser at http://localhost:8088 and the contents of the HDFS filesystem can be examined at http://localhost:50075
Since Hadoop is a java framework, java is the most natural language to write MapReduce programs for Hadoop. Unfortunately, java is not neither the easiest language to master ( especially if you are from the pre-java age or you are perplexed by the complexity of things like Eclipse and Maven that all java programmers seem to be at ease with ) nor the simplest language to articulate your requirements in. Fortunately, the Gods of Hadoop have realized the predicament of the java-challenged community and have a provided a way, the "streaming api", that allows programs written in any language -- python, shell, R, C, C++ or whatever -- to use the MapReduce technique and use Hadoop.
So now we shall see how to
4. Use Hadoop streaming and run a Python application
The process is explained Michael Noll's "How to run a Hadoop MapReduce program in Python" but once again the example is that of a WordCount application.
Assuming that you have Python installed in your machine, you can copy and download the two programs mapper.py and reducer.py given in the instruction.
We will use the book text data that we have already loaded into HDFS, in the java WordCount, exercise and it is already available at /user/hduser/WC-input. So we do not need to load it again.
We simply make sure that the output directory does not exist and then call the streaming API as follows
---
As in the case with native java programs, the progress of the job can be viewed in a browser at http://localhost:8088 and the contents of the HDFS filesystem can be examined at http://localhost:50075
The jury is still out on whether there can be a trade-off between the alleged, or perceived, high performance of a native java MapReduce program and simplicity and ease of use a MapReduce program written in a scripting language. This is similar to the debate on whether programs written in Assembler are "better" than similar programs written in convenient languages like C or VB. Rapid progress in hardware technology has made such debates irrelevant and developers have preferred convenience over performance because performance is made up by better hardware.
The same debate between native java and streaming api in Hadoop / MR is yet to conclude. However Amazon Elastic Map Reduce (EMR) service has opened a new vista in using Hadoop. Using the Amazon EMR console ( a GUI front end !) one can configure the number and type of machines in the Hadoop cluster then upload the map and reduce program in any language (generally supported on the EMR machines ) specify the input and output directories and then invoke the Hadoop streaming program and wait for the results.
This eliminates the need for the entire complex process of installing and configuring Hadoop on multiple machines and reduces the MapReduce exercise (almost) to the status of an idiot-proof, online banking operation !
But the real challenge that still remains is how to convert a standard predictive statistics task, like regression, classification and clustering, into the simplistic format of map-reduce "counter" and then execute the same on Hadoop. This is demonstrated in this exercise that
5. Solves an actual Predictive Analytics problem with Map Reduce and Hadoop.
Regression is the first step in predictive analytics and this video MapReduce and R : A short example on Regression and Forecasting, is an excellent introduction to both regression and how it can be done, first in Excel, then with R and finally with a java program that uses Map Reduce and Hadoop.
The concept is simple. There is a set of 5 y values ( dependent variables ) for 5 days ( each day being an x variable ) We need to create a regression equation that shows how y is related to x and then predict the value of y on day 10. From the perspective of regression, this is trivial problem that can even be solved by hand, let alone Excel or R. The challenge is when this has to be done a million times, once for say, each SKU in a retail store or a telecom customers. The challenge becomes even bigger when these million regressions need to be done everyday to predict the value of y 5 days hence on the basis of the data of the trailing 5 days ! That is when you need to call in Map Reduce and Hadoop.
The exercise consists of these four java programs and the sample data that you can download. Then you can follow the same set of commands given in section 2 above to compile and run the programs. The same application ported to R and tailored to a retail scenario is available in another blog post, Forecasting Retail Sales - Linear Regression with R and Hadoop.
Once again, the progress of the job can be viewed in a browser at http://localhost:8088 and the contents of the HDFS filesystem can be examined at http://localhost:50075
Since Hadoop is a java framework, java is the most natural language to write MapReduce programs for Hadoop. Unfortunately, java is not neither the easiest language to master ( especially if you are from the pre-java age or you are perplexed by the complexity of things like Eclipse and Maven that all java programmers seem to be at ease with ) nor the simplest language to articulate your requirements in. Fortunately, the Gods of Hadoop have realized the predicament of the java-challenged community and have a provided a way, the "streaming api", that allows programs written in any language -- python, shell, R, C, C++ or whatever -- to use the MapReduce technique and use Hadoop.
So now we shall see how to
4. Use Hadoop streaming and run a Python application
The process is explained Michael Noll's "How to run a Hadoop MapReduce program in Python" but once again the example is that of a WordCount application.
Assuming that you have Python installed in your machine, you can copy and download the two programs mapper.py and reducer.py given in the instruction.
We will use the book text data that we have already loaded into HDFS, in the java WordCount, exercise and it is already available at /user/hduser/WC-input. So we do not need to load it again.
We simply make sure that the output directory does not exist and then call the streaming API as follows
---
========================================
hdfs dfs -ls
hdfs dfs -rm -r /user/hduser/WCpy-output
hadoop jar $HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -mapper /home/hduser/python/wc-python/mapper.py -reducer /home/hduser/python/wc-python/reducer.py -input /user/hduser/WC-input/* -output /user/hduser/WCpy-output
hdfs dfs -ls
========================================
---As in the case with native java programs, the progress of the job can be viewed in a browser at http://localhost:8088 and the contents of the HDFS filesystem can be examined at http://localhost:50075
The jury is still out on whether there can be a trade-off between the alleged, or perceived, high performance of a native java MapReduce program and simplicity and ease of use a MapReduce program written in a scripting language. This is similar to the debate on whether programs written in Assembler are "better" than similar programs written in convenient languages like C or VB. Rapid progress in hardware technology has made such debates irrelevant and developers have preferred convenience over performance because performance is made up by better hardware.
The same debate between native java and streaming api in Hadoop / MR is yet to conclude. However Amazon Elastic Map Reduce (EMR) service has opened a new vista in using Hadoop. Using the Amazon EMR console ( a GUI front end !) one can configure the number and type of machines in the Hadoop cluster then upload the map and reduce program in any language (generally supported on the EMR machines ) specify the input and output directories and then invoke the Hadoop streaming program and wait for the results.
This eliminates the need for the entire complex process of installing and configuring Hadoop on multiple machines and reduces the MapReduce exercise (almost) to the status of an idiot-proof, online banking operation !
But the real challenge that still remains is how to convert a standard predictive statistics task, like regression, classification and clustering, into the simplistic format of map-reduce "counter" and then execute the same on Hadoop. This is demonstrated in this exercise that
5. Solves an actual Predictive Analytics problem with Map Reduce and Hadoop.
Regression is the first step in predictive analytics and this video MapReduce and R : A short example on Regression and Forecasting, is an excellent introduction to both regression and how it can be done, first in Excel, then with R and finally with a java program that uses Map Reduce and Hadoop.
The concept is simple. There is a set of 5 y values ( dependent variables ) for 5 days ( each day being an x variable ) We need to create a regression equation that shows how y is related to x and then predict the value of y on day 10. From the perspective of regression, this is trivial problem that can even be solved by hand, let alone Excel or R. The challenge is when this has to be done a million times, once for say, each SKU in a retail store or a telecom customers. The challenge becomes even bigger when these million regressions need to be done everyday to predict the value of y 5 days hence on the basis of the data of the trailing 5 days ! That is when you need to call in Map Reduce and Hadoop.
The exercise consists of these four java programs and the sample data that you can download. Then you can follow the same set of commands given in section 2 above to compile and run the programs. The same application ported to R and tailored to a retail scenario is available in another blog post, Forecasting Retail Sales - Linear Regression with R and Hadoop.
========================================
rm -rf REG-classes
mkdir REG-classes
# ....
javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar -d REG-classes Participant.java
# .....
javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar:REG-classes -d REG-classes ProjectionMapper.java
# .....
javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar:REG-classes -d REG-classes ProjectionReducer.java
# .....
javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar:REG-classes -d REG-classes Projection.java && jar -cvf Projection.jar -C REG-classes/ .
# ............
hdfs dfs -rm -r /user/hduser/REG-input
hdfs dfs -rm -r /user/hduser/REG-output
hdfs dfs -ls /user/hduser
hdfs dfs -mkdir /user/hduser/REG-input
hdfs dfs -copyFromLocal /home/hduser/JavaHadoop/RegScore.txt /user/hduser/REG-input
hdfs dfs -ls /user/hduser
hdfs dfs -ls /user/hduser/REG-input
hadoop jar Projection.jar Projection /user/hduser/REG-input /user/hduser/REG-output
========================================
These five exercises are not meant to be a replacement for the full time course on Map Reduce, Hadoop that is taught at the Praxis Business School, Calcutta, but will serve as a simple introduction to this very important technology.
If you find any errors, please leave a comment. Otherwise if you like this post, please share with your friends.