April 16, 2017

Spark with Python in Jupyter Notebook on a single Amazon EC2 instance

In an earlier post I have explained how to run Python+Spark program with Jupyter on local machine and in a subsequent post, I will explain how the same can be done an AWS EMR cluster of multiple machines.
In this post, I explain how this can be done on a single EC2 machine instance running Ubuntu on Amazon AWS.

The strategy described in this blog post is based on strategies described in posts written by Jose Marcial Portilla and Chris Albon. We assume that you have a basic familiarity with AWS services like EC2 machines, S3 data storage and concept of keypairs and an account with Amazon AWS. You may use your Amazon eCommerce account but you may also create one on the AWS login page. This tutorial is based on Ubuntu and assumes that  you have a basic familiarity with the SSH command and other general Linux file operation commands.

1. Login to AWS

Go to the AWS console ,login with userID and password, then go to the page with EC2 services. Unless you have used AWS before, you should have 0 Instances, 0 keypairs, 0 security groups.

2. Create (or Launch) an EC2 instance and use default options except for
a. Choose Ubuntu Server 16.04 LTS
b. Instance type t2.small
c. Configure a security group - unless you already have a security group, create a new one. Call it pyspju00. Make sure that it has at least these three rules.
d. Review and Launch the instance. At this point you will be asked to use and existing keypair or create a new one. If you create a new one, then  you can will have to download a .pem file into your local machine and use this for all subsequent operations.

Go back to the EC2 instance console and you should see your instance running :


Press the button marked Connect and you will get the instructions on how to connect to the instance using SSH.

3. Connect to your instance

Open a terminal on Ubuntu, move to the directory where the pem file is stored and connect with

ssh -i "xxxxxxx.pem" ubuntu@ec2-54-89-196-90.compute-1.amazonaws.com
you will have a different URL for your instance

From now on you will be issuing commands to the remote EC2 machine

4. Install Python / Anaconda software on remote machine

sudo apt-get update
sudo apt-get install default-jre

wget https://repo.continuum.io/archive/Anaconda3-4.3.1-Linux-x86_64.sh

get the exact URL of the Anaconda download by visiting the download site and copying the download URL

bash Anaconda3-4.3.1-Linux-x86_64.sh
Accept all the default options except on this one, say YES here
Do you wish the installer to prepend the Anaconda3 install location
to PATH in your /home/ubuntu/.bashrc ? [yes|no]
[no] >>> yes

logout of the remote machine and login back again with
ssh -i "xxxxxxx.pem" ubuntu@ec2-54-89-196-90.compute-1.amazonaws.com

5. Install Jupyter Notebook on remote machine

a. Create certificates in directory called certs

mkdir certs
cd certs
sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout pmcert.pem -out pmcert.pem

this creates a certificates file pmcert.pem ( not to be confused with the .pem file downloaded on your machine) and stores it on the remote machine

b. Jupyter configuration file

go back to home directory and execute
jupyter notebook --generate-config

now move to the .jupyter directory and edit the config file

vi jupyter_notebook_config.py
if you are not familiar with the editor, either learn how to use it or use anything else that you may be familiar with

notice that everything is commented out and rather than un-commenting specific lines, just add the following lines at the top of the file
#--------------------------------------------------------------------------------
c = get_config()

# Notebook config this is where you saved your pem cert
c.NotebookApp.certfile = u'/home/ubuntu/certs/pmcert.pem' 
# Run on all IP addresses of your instance
c.NotebookApp.ip = '*'
# Don't open browser by default
c.NotebookApp.open_browser = False  
# Fix port to 8888
c.NotebookApp.port = 8892
#--------------------------------------------------------------------------------

c. Start Jupyter without browser and on port 8892

move to new working directory
mkdir myWork
cd myWork
jupyter notebook

you will get >
Copy/paste this URL into your browser when you connect for the first time,    to login with a token:
        https://localhost:8892/?token=70b8623ec5ecf7d7d2f8b38b45112a92ec036ad3f5ed8a1d

but instead of going to local host, we will go to the EC2 machine URL in a separate browser window
https://ec2-54-89-196-90.compute-1.amazonaws.com:8892
this will throw errors about security but ignore the same and keep going until you reach this screen


in the password area, enter the value of the token that you have got in the previous step and you will see your familiar notebook screen


6. Installation of Spark

Go back to the home directory and download URL of the latest version of spark from this page.

wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar -xvf spark-2.1.0-bin-hadoop2.7.tgz 
mv spark-2.1.0-bin-hadoop2.7 spark210

edit the file .profile and add the following lines at the bottom
-----------------------------------------------------
export SPARK_HOME=/home/ubuntu/spark210
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export SPARK_LOCAL_IP=LOCALHOST
------------------------------------------------------
make sure that you have the correct version of py4j-n-nn-n-src by looking into the directory where it is stored

logout from the remote machine and then login back again

7. Running Spark 2.1 with Python

[The following step may not be necessary if your versions of Spark and Python are compatible. Please see the April 13 update on this blog for an explanation of this]

cd myWork
conda create -n py35 python=3.5 anaconda
logout / login (SSH) back
cd myWork
source activate py35

now run pyspark and note that pyspark is working with Python 3.5.2 so we are all set to start Jupyter again

jupyter notebook
note the new token=12e55cacf8cdcad2f8c77f7959047034b698f4b8f67b679a that you get

The Jupyter Notebook should now be clearly visible again at
https://ec2-54-89-196-90.compute-1.amazonaws.com:8892

Now we upload the notebook containing the WordCount program and the hobbit.txt input file, from the previous blog post.


That we can execute


This completes the exercise, but before  you go, do remember to shut down the notebook, logout of the remote machine and most important terminate the instance

8. Terminate the instance

Go to the EC2 Instance console and Terminate the instance. If  you do not do this, you will continue to be billed!



No comments: