Spark, Python & Data Science -- Tutorial

Hadoop is history and Spark is the new kid on the block who is the darling of the Big Data community. Hadoop was unique. It was a pioneer that showed how "easy" it was to replace large, expensive server hardware with a collection, or cluster, of cheap, low end machines and crunch through gigabytes of data using a new programming style called Map-Reduce that I have explained elsewhere. But "easy" is a relative term. Installing Hadoop or writing the Java code for even simple Map-Reduce tasks was not for the faint hearted. So we had Hive and Pig to simplify matters. Then came tools like H20 and distributions like Hortonworks to make life even simpler for non-Geeks who wanted to focus purely on the data science piece without having to bother about technology. But as I said, with the arrival of Spark, all that is now history!

Spark was developed at the University of California at Berkeley and appeared on the horizon for data scientists in 2013 at an O'Reilly conference. You can see the presentations made there, but the following one will give you a quick overview of what this technology is all about.

But the three real reasons why Spark has become my current heart-throb is because

It is ridiculously simple to install. Compared to the weeks that it took me to understand, figure out and install Hadoop, this was over in a few minutes. Download a zip, unzip, define some paths and you are up and running
Spark is super smart with memory management and so unlike Hadoop, starting Spark on your humble laptop will not kill it. You can keep working with other applications even when Spark is running -- unless of course you are actually crunching through 5 million rows of data. Which is what I actually did, on my laptop.
And this is the killer. Coding is some simple. 50 lines of code Java code -- all that public static void main() crap -- needed in Hadoop, reduces to two or three lines of Scala or Python code. Serious, not joking.

And unlike the Mahout machine learning library of Hadoop that everyone talked about but no one could really use, the Spark machine learning library, though based on Mahout code, is something that you can be running at the end of this tutorial itself. So enough of chit-chat, let us get down to action and see how easy it is to get going with Spark and Python.

In this post, we will show how to install and configure Spark, run the famous WordCount program so beloved of the Hadoop community, run a few machine learning programs and finally work our way through a complete data science exercise involving descriptive statistics, logistic regression, decision trees and even SQL -- the whole works.

Though, in principle, Spark should work on Windows, the reality it is not worth the trouble. Don't even try it. Spark is based on Hadoop and Hadoop is never very comfortable with Windows. If you have access to a Linux machine either as full machine to yourself or one that has a dual boot with Windows and Linux then you may skip section [A] on creating virtual machines and go directly to section [B] on installing Spark.

Also please understand that you need a basic familiarity with the Linux platform. If you have no clues at all about what is "sudo apt-get ..." or have never used the "vi" or equivalent text editor then it may be a good idea to have someone with you who knows these things during the install phase. Please do understand that this is not like downloading an .exe file in Windows and double-clicking on it to install a software. But even if you have a rudimentary understanding of Linux and can follow instructions, you should be up and running.

A] Creating a Virtual Machine running Ubuntu on Windows

If your machine has only Windows -- as is the case with most Windows 8 and even Windows 10 users -- then you will have to create an Linux Virtual Machine and carry out the rest of the exercise on the VM. This exercise was comfortably carried out on 8GB RAM laptop but even 6GB should suffice.

Download Oracle VirtualBox [ including Extension pack ] software for Windows and install it on your Windows machine.
Download an Ubuntu image for the Virtual Box. Make sure that you get the image for the VirtualBox and not the VMware version! This is a big download, nearly 1GB and may take some time. What you get is a zip file that you can unzip to obtain a .vdi file, a virtual disk image. Note the userid, password of the admin user that will be present in the VM [ usually userid is osboxes and password is osboxes.org, but this may be different ]
Start the VirtualBox software and create new virtual machine using the vdi file that you have just downloaded and unzipped. You can give the machine any name but it must be defined as a Linux, Ubuntu.

If you are not sure how to create a virtual machine, follow these instructions. Remember to allocate at least 6GB RAM to the virtual machine
If your machine is 64 bit but VirtualBox is only showing 32 bit options then it means that virtualization has been disabled on your machine. Do not panic, simply follow instructions given here. If you dont know how to boot your machine into bios then see boot-keys.org
Once your Ubuntu virtual machine starts, you will find that it runs in a small window and quite inconvenient to use. To make the VM occupy the full screen you would need to install Guest Additions to Virtual Box by following instructions given here [ sudo apt install virtualbox-guest-additions-iso ] followed by loading the CD image as explained here.
In the setup options of the VM you can define shared folders between the Windows host OS and the Ubuntu guest OS. However the shared folder will be visible but not accessible to the Ubuntu userid until you do the this.
Steps 3 and 4 are not really necessary for Spark but if you skip them you may find it difficult or uncomfortable to work inside a very cramped window

Strangely enough, the VM image does not come with Java, that is essential for Spark. So please install Java by following these instructions.

Ubuntu is so cool! Who wants Windows?

B] Install Spark

Once we have an Ubuntu machine, whether real or virtual, we can now focus on getting Python and Spark.

Python - The Ubuntu 16.04 virtual machine comes with Python 2.7 already installed and is adequate if you want to use Spark at the command line. However if you want to use iPython notebooks [ and our subsequent tutorial needs notebooks ] it is better to install the same.

There are many ways to install iPython notebooks but the easiest way would be to download and install Anaconda.

Note that this needs to be downloaded inside the Ubuntu guest OS and not the Windows host OS if you are using a VM.
When the install scripts asks if Anaconda should be placed in the system path, please say YES

Start python and ipython from the Ubuntu prompt and you should see that Anaconda's version of python is being loaded.

Spark - the instructions given here have been derived from this page but there are some significant deviations to accommodate the current version of the ipython notebook.

Download the latest version of Spark from here.

In the package type DO NOT CHOOSE source code as otherwise you will have to compile it. Choose instead the package with the latest pre-built Hadoop.
Choose direct download, not a mirror.

Unzip the tgz file, move the resultant directory to a convenient location and give it a simple name. In our case it was /home/osboxes/spark16
Add the following lines to the end of file .profile

export SPARK_HOME=/home/osboxes/spark16
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

to get the correct version of the py4j-n.n-src.zip file go to $SPARK_HOME/python/lib and see the actual value
the last two paths are required because in many cases the py4j library is not found.

export SPARK_LOCAL_IP=LOCALHOST

For the changes to take effect, restart the VM, or source your .profile file using the following command: source ~/.profile
To start spark in the command line mode enter the "pyspark" command and you should see the familiar Spark screen. To quit enter exit()
To start spark in the ipython notebook format enter the command $IPYTHON_OPTS="notebook" pyspark. Please note that the strategy of using profiles for starting ipython notebook may not work as the current version of jupyter does not support profiles anymore and hence this strategy was used. [ Skip this part and go to the next step ] This will start the server and make it available at port 8888 on the localhost. To quit press ctrl-c twice in quick succession.
An alternative way of starting the notebook, not involving the IPYTHON_OPTS command is shown here. This is easier

Start notebook with $ipython notebook ( or alternatively, $jupyter notebook)
Execute these two lines from the first cell of the notebook

from pyspark import SparkContext
sc = SparkContext( 'local', 'pyspark')

Now we have Spark running on our Ubuntu machine, check out the status at http;//localhost:4040

C] Running Simple programs

If you have not familiar with Python do go through some of the first few exercises of Learning Python the Hard Way and if the concept of a notebook is alien to you then go through this tutorial.

Go to this page and scroll down to the section "Interacting with Spark" and follow the instructions there to run the WordCount application. This will need a txt file as input and any text file will do. If you cannot find a file, create one with vi or gedit and write a few sentences there and use it. Enter each of these lines as a command at the pyspark prompt

text = sc.textFile("datafile.txt")
print text
from operator import add
def tokenize(text):
return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)
counts.saveAsTextFile("output-dir")

The final output in Hadoop style will be stored in a directory called "output-dir". Remember Hadoop, and hence Spark, does not allow the same output directory to be reused.

The same commands can also be entered one by one in the ipython notebook and you would get the same result

This establishes that you have Spark and Python working smoothly on your machine. Now for some real data science

Update 29 June 2017

While it is not too difficult to set up your machine for Spark, there is far easier way to do so, that was pointed out to me by my student Sanket Maheshwari. Databricks, an online platform built by the folks who have actually created Spark, allows you to run Spark + Python (plus R, Scala as well) programs through a Jupyter-like Notebook environment on an actual AWS cluster without any fuss. To see how it works, create a free account in the community edition and you are ready to go with a single machine cluster.

D] Data Science with Spark

[New 24Jul16] Unlike Hadoop / Mahout, the machine learning library of Spark is quite easy to use. There are tons and tons of samples and even machine learning samples available. These samples along with the sample data are also available in the Spark Home directory that gets created during the installation of Spark as described above. You an run these programs using the spark-submit command as explained in this page after making small changes to bring them into the format described on that page. The basic template for converting these samples to run with spark-submit and two sample programs for clustering and logistic regression is available for download here.

To understand the nuances of the MLLIB library read the documentation, then, for example, follow the one on k-means. For more details of the API and the k-means models follow the links.

Jose A Dianes, a mentor at codementor, has created a very comprehensive tutorial on data science and his ipython notebooks are available for download at github. This uses actual data from a KDD cup competition and will lead the user through

Basics of RDD datasets
Exploratory Data Analysis with Descriptive Statistics
Logistic Regression
Classification with Decision Trees
Usage of SQL

After going through this tutorial, one will have a good idea of how Spark and Python can be used to address a full cycle data science problem right from data gathering to building models

Update 13 April 2017 : Version Issues

If you were to download, install and configure Spark with Python today following the steps given in this tutorial you will face problems because Spark 2.1 [ the current latest version ] is totally incompatible with Python 3.6+ [ the current latest version ]. However, do not panic, all is not lost! Stackoverflow is always there to solve problems!

First, the instructions that we have given are all valid, please follow them. However, after installing Python 3.6 you need to create a Python 3.5 environment with Conda as explained here. More details about Conda environments is given here.

Before you start Jupyter Notebook you need to issue the following commands :
conda create -n py35 python=3.5 anaconda
source activate py35

This will create an environment where everything will work like a charm. Once you are through, remember to issue the following command to go back to the normal environment

source deactivate

The other challenge is that Python 3 is a "little" different from Python 2 and some changes are necessary to the programs. One easy change is that all print statements have to be written as functions : print(.....), but there are other non-trivial changes as well. In the examples given here, the WordCount and K-Means program will run without major changes but the Logistic Regression program will need to be changed as explained in this stackoverflow post.

Update : 13 Apr 2017 : This post explains how Spark and Python works on a local machine. My subsequent blog posts explain how Spark+Python+Jupyter works on (a) a single EC2 instance on Amazon AWS and on (b) an EMR cluster on Amazon AWS.

Spark is a part of the curriculum in the Business Analytics program at Praxis Business School, Calcutta. At the request of our students I have created an Oracle Virtual Appliance that you can download [ 4GB though] import it into your Virtual Box and go directly to section [D]! No need for any installation and configuration of Ubuntu, Java, Anaconda, Spark or even creating the demo MLLIB code. This VM has been configured with 4GB RAM which just about suffices. Increase this to 6GB if feasible. -- Updates : [28Aug16] - New Virtual Box (password ="osboxes.org")

Update : 22 Apr 2018 : With the arrival of Google Collaboration Library the need to install software has disappeared completely. Now you can have virtual machine with a GPU and start working on datascience right way. To see how you can use Spark for the WordCount program see these notebooks. We show how upload datafiles from a local machine OR read them directly from Google Drive.

I was invited to Cypher2016 where I delivered a lecture and demonstration on Python, Spark, Machine Learning and running this on AWS

Search This Blog

blog@Yantrajaal