June 25, 2016

Spark, Python & Data Science -- Tutorial

Hadoop is history and Spark is the new kid on the block who is the darling of the Big Data community. Hadoop was unique. It was a pioneer that showed how "easy" it was to replace large, expensive server hardware with a collection, or cluster, of cheap, low end machines and crunch through gigabytes of data using a new programming style called Map-Reduce that I have explained elsewhere. But "easy" is a relative term. Installing Hadoop or writing the Java code for even simple Map-Reduce tasks was not for the faint hearted. So we had Hive and Pig to simplify matters. Then came tools like H20 and distributions like Hortonworks to make life even simpler for non-Geeks who wanted to focus purely on the data science piece without having to bother about technology. But as I said, with the arrival of Spark, all that is now history!

Spark was developed at the University of California at Berkeley and appeared on the horizon for data scientists in 2013 at an O'Reilly conference. You can see the presentations made there, but the following one will give you a quick overview of what this technology is all about.

But the three real reasons why Spark has become my current heart-throb is because
  1. It is ridiculously simple to install. Compared to the weeks that it took me to understand, figure out and install Hadoop, this was over in a few minutes. Download a zip, unzip, define some paths and you are up and running
  2. Spark is super smart with memory management and so unlike Hadoop, starting Spark on your humble laptop will not kill it. You can keep working with other applications even when Spark is running -- unless of course you are actually crunching through 5 million rows of data. Which is what I actually did, on my laptop.
  3. And this is the killer. Coding is some simple. 50 lines of code Java code -- all that public static void main()  crap -- needed in Hadoop, reduces to two or three lines of Scala or Python code. Serious, not joking.
And unlike the Mahout machine learning library of Hadoop that everyone talked about but no one could really use, the Spark machine learning library, though based on Mahout code, is something that  you can be running at the end of this tutorial itself. So enough of chit-chat, let us get down to action and see how easy it is to get going with Spark and Python.

In this post, we will show how to install and configure Spark, run the famous WordCount program so beloved of the Hadoop community, run a few machine learning programs and finally work our way through a complete data science exercise involving descriptive statistics, logistic regression, decision trees and even SQL -- the whole works.

Though, in principle, Spark should work on Windows, the reality it is not worth the trouble. Don't even try it. Spark is based on Hadoop and Hadoop is never very comfortable with Windows. If you have access to a Linux machine either as full machine to yourself or one that has a dual boot with Windows and Linux then you may skip section [A] on creating virtual machines and go directly to  section [B] on installing Spark.

Also please understand that you need a basic familiarity with the Linux platform. If you have no clues at all about what is "sudo apt-get ..."  or have never used the "vi" or equivalent text editor then it may be a good idea to have someone with you who knows these things during the install phase. Please do understand that this is not like downloading an .exe file in Windows and double-clicking on it to install a software. But even if you have a rudimentary understanding of Linux and can follow instructions, you should be up and running.

A] Creating a Virtual Machine running Ubuntu on Windows

If your machine has only Windows -- as is the case with most Windows 8 and even Windows 10 users -- then you will have to create an Linux Virtual Machine and carry out the rest of the exercise on the VM.   This exercise was comfortably carried out on 8GB RAM laptop but even 6GB should suffice.

  1. Download Oracle VirtualBox [ including Extension pack ] software for Windows and install it on your Windows machine.
  2. Download an Ubuntu image for the Virtual Box. Make sure that you get the image for the VirtualBox and not the VMware version! This is a big download, nearly 1GB and may take some time. What you get is a zip file that you can unzip to obtain a .vdi file, a virtual disk image. Note the userid, password of the admin user that will be present in the VM [ usually userid is osboxes and password is osboxes.org, but this may be different ]
  3. Start the VirtualBox software and create new virtual machine using the vdi file that  you have just downloaded and unzipped. You can give the machine any name but it must be defined as a Linux, Ubuntu. 
    1. If you are not sure how to create a virtual machine, follow these instructions. Remember to allocate at least 6GB RAM to the virtual machine
    2. If your machine is 64 bit but VirtualBox is only showing 32 bit options then it means that virtualization has been disabled on your machine. Do not panic, simply follow instructions given here. If you dont know how to boot your machine into bios then see boot-keys.org
    3. Once your Ubuntu virtual machine starts, you will find that it runs in a small window and quite inconvenient to use. To make the VM occupy the full screen you would need to install Guest Additions to Virtual Box by following instructions given here  [ sudo apt install virtualbox-guest-additions-iso ] followed by loading the CD image as explained here
    4. In the setup options of the VM you can define shared folders between the Windows host OS and the Ubuntu guest OS. However the shared folder will be visible but not accessible to the Ubuntu userid until you do the this
    5. Steps 3 and 4 are not really necessary for Spark but if you skip them you may find it difficult or uncomfortable to work inside a very cramped window
  4. Strangely enough, the VM image does not come with Java, that is essential for Spark. So please install Java by following these instructions.
Ubuntu is so cool! Who wants Windows?

B] Install Spark

Once we have an Ubuntu machine, whether real or virtual, we can now focus on getting Python and Spark.
  1. Python - The Ubuntu 16.04 virtual machine comes with Python 2.7 already installed and is adequate if you want to use Spark at the command line. However if you want to use iPython notebooks [ and our subsequent tutorial needs notebooks ] it is better to install the same.
    1. There are many ways to install iPython notebooks but the easiest way would be to download and install Anaconda
      1. Note that this needs to be downloaded inside the Ubuntu guest OS and not the Windows host OS if  you are using a VM.
      2. When the install scripts asks if Anaconda should be placed in the system path, please say YES
    2. Start python and ipython from the Ubuntu prompt and you should see that Anaconda's version of python is being loaded.
  2. Spark - the instructions given here have been derived from this page but there are some significant deviations to accommodate the current version of the ipython notebook.
    1. Download the latest version of Spark from here.
      1. In the package type DO NOT CHOOSE source code as otherwise you will have to compile it. Choose instead the package with the latest pre-built Hadoop. 
      2. Choose direct download, not a mirror.
    2. Unzip the tgz file, move the resultant directory to a convenient location and give it a simple name. In our case it was /home/osboxes/spark16
    3. Add the following lines to the end of file .profile
      1. export SPARK_HOME=/home/osboxes/spark16
      2. export PATH=$SPARK_HOME/bin:$PATH
      3. export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
      4. export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
        1. to get the correct version of the py4j-n.n-src.zip file go to $SPARK_HOME/python/lib and see the actual value
        2. the last two paths are required because in many cases the py4j library is not found
    4. To start spark in the command line mode enter the "pyspark" command and you should see the familiar Spark screen. To quit enter exit()
    5. To start spark in the ipython notebook format enter the command $IPYTHON_OPTS="notebook" pyspark. Please note that the strategy of using profiles for starting ipython notebook may not work as the current version of jupyter does not support profiles anymore and hence this strategy was used. This will start the server and make it available at port 8888 on the localhost. To quit press ctrl-c  twice in quick succession.

    6. An alternative way of starting the notebook, not involving the IPYTHON_OPTS command is shown here. This is easier
      1. Start notebook with $ipython notebook ( or alternatively, $jupyter notebook)
      2. Execute these two lines from the first cell of the notebook
        1. from pyspark import  SparkContext
        2. sc = SparkContext( 'local', 'pyspark')
  3. Now we have Spark running on our Ubuntu machine, check out the status at http;//localhost:4040

C] Running Simple programs

If you have not familiar with Python do go through some of the first few exercises of Learning Python the Hard Way and if the concept of a notebook is alien to you then go through this tutorial.

Go to this page and scroll down to the section "Interacting with Spark" and follow the instructions there to run the WordCount application. This will need a txt file as input and any text file will do. If you cannot find a file, create one with vi or gedit and write a few sentences there and use it. Enter each of these lines as a command at the pyspark prompt

text = sc.textFile("datafile.txt")
print text
from operator import add
def tokenize(text):
    return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)

The final output in Hadoop style will be stored in a directory called "output-dir". Remember Hadoop, and hence Spark, does not allow the same output directory to be reused.

The same commands can also be entered one by one in the ipython notebook and you would get the same result

This establishes that you have Spark and Python working smoothly on your machine. Now for some real data science

D] Data Science with Spark

[New 24Jul16] Unlike Hadoop / Mahout, the machine learning library of Spark is quite easy to use. There are tons and tons of samples and even machine learning samples available. These samples along with the sample data are also available in the Spark Home directory that gets created during the installation of Spark as described above. You an run these programs using the spark-submit command as explained in this page after making small changes to bring them into the format described on that page. The basic template for converting these samples to run with spark-submit and two sample programs for clustering and logistic regression is available for download here.

To understand the nuances of the MLLIB library read the documentation, then, for example, follow the one on k-means. For more details of the API and the k-means models follow the links.

Jose A Dianes, a mentor at codementor, has created a very comprehensive tutorial on data science and his ipython notebooks are available for download at github. This uses actual data from a KDD cup competition and will lead the user through

  • Basics of RDD datasets
  • Exploratory Data Analysis with Descriptive Statistics
  • Logistic Regression
  • Classification with Decision Trees
  • Usage of SQL
After going through this tutorial, one will have a good idea of how Spark and Python can be used to address a full cycle data science problem right from data gathering to building models

Spark is a part of the curriculum in the Business Analytics program at Praxis Business School, Calcutta. At the request of our students I have created an Oracle Virtual Appliance that you can download [ 4GB though] import it into your Virtual Box and go directly to section [D]! No need for any installation and configuration of Ubuntu, Java, Anaconda, Spark or even creating the demo MLLIB code. This VM has been configured with 4GB RAM which just about suffices. Increase this to 6GB if feasible. -- Updates : [28Aug16] - New Virtual Box (password ="osboxes.org")

I was invited to Cypher2016 where I delivered a lecture and demonstration on Python, Spark, Machine Learning and running this on AWS


Anonymous 2:24 pm  

Superb... Thanks...

Anonymous 2:25 pm  


About This Blog

  © Blogger template 'External' by Ourblogtemplates.com 2008

Back to TOP