April 16, 2017

Spark with Python in Jupyter Notebook on Amazon EMR Cluster

In the previous post, we saw how to run a Spark - Python program in a Jupyter Notebook on a standalone EC2 instance on Amazon AWS, but the real interesting part would be to run the same program on genuine Spark Cluster consisting of one master and multiple slave machines.

The process is explained pretty well in Tom Zeng's blog post and we follow the same strategy here.

1. Install AWS Command Line services by following these instructions.
2. Configure the AWS CLI with your AWS credentials using these instructions.

in particular, the following is necessary
$ aws configure
AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: us-east-1
Default output format [None]: ENTER

you will have to use your own AWS Access Key ID and AWS Secret Access Key of course!

3. Execute the following command :

aws emr create-cluster --release-label emr-5.2.0 \
  --name 'Praxis - emr-5.2.0 sparklyr + jupyter cli example' \
  --applications Name=Hadoop Name=Spark Name=Tez Name=Ganglia Name=Presto \
  --ec2-attributes KeyName=pmapril2017,InstanceProfile=EMR_EC2_DefaultRole \
  --service-role EMR_DefaultRole \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.4xlarge \
    InstanceGroupType=CORE,InstanceCount=2,InstanceType=c3.4xlarge \
  --region us-east-1 \
  --log-uri s3://yj01/emr-logs/ \
  --bootstrap-actions \
    Name='Install Jupyter notebook',Path="s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh",Args=[--r,--julia,--toree,--torch,--ruby,--ds-packages,--ml-packages,--python-packages,'ggplot nilearn',--port,8880,--password,praxis,--jupyterhub,--jupyterhub-port,8001,--cached-install,--copy-samples]

note the options have been modified a little
a) number of machines is 1+2
b) the S3 bucket used is yj01 in s3://yj01/emr-logs/
c) the password is set as "praxis"
d) the directive to store notebooks on S3 has been removed as this is causing problems. Now the notebooks will be stored in the home directory of the user=hadoop on the master node

this command returns ( or something similar)
    "ClusterId": "j-2LW0S8SAX5OC4"

4. Log in to the AWS console and go to the EMR section.

The cluster will show up as starting

and will then move into Bootstrapping mode

and after about 22 minutes will move into Waiting mode. If that happens earlier then there could have been an error in the bootstrap process. Otherwise you will see this

5. Login to Jupyter hub
Note the URL of the Master Public DNS : ec2-54-82-207-124.compute-1.amazonaws.com
and point your browser to : http://ec2-54-82-207-124.compute-1.amazonaws.com:8001

Login with user = hadoop and password = praxis  ( supplied in the command) and you will get the familiar Notebook interface

There will be samples directory containing sample programs covering a wide range of technologies and data science applications. Extremely useful to cut-and-paste from!

Create a work directory and upload the Wordcount and the Hobbit.txt file, used in the original Spark+Python blog post

Notice the changes necessary for cluster operations

Cells 1 -3 reflect the fact that we are now using a cluster, not a local machine
Cells 4, 12 show that the program is NOT accessing the local file storage on the Master Node but the HDFS file system on the cluster

To explore the HDFS file system, go back to this screen

and then press "View All" ... Click on the HDFS link and take your browser to
and see

and you can browse to the hadoop user home HDFS directory where the "hobbit.txt" file was stored and where the "hobbit-out" directory has been created by the Spark program. In fact, all HDFS operations can be carried out from the Notebook cells like this

!hdfs dfs -put hobbit.txt /user/hadoop/
!hdfs dfs -get /user/hadoop/hobbit-out/part* .
!hdfs dfs -ls hobbit-out/
!hdfs dfs -rm hobbit-out/*
!hdfs dfs -rmr hobbit-out
!hdfs dfs -rm hobbit.txt

You can also see the various Hadoop resources -- including the two active nodes through this interface
After Jupyterhub is started, the notebooks can be accessed by going directly to port 8880 and using the password=praxis

Finally it is time to
6. Terminate the cluster!

Go to the cluster console, choose the active cluster and press the terminate button. If termination protection is in place, you would need to turn it off.

Notes :
1. The same task can be done through the EMR console, without having to use the AWS CLI command line because most of the parameters used in this command can be passed through the console GUI. For example, look at this page.
2. Because of the error with the S3 we are storing our programs and data in the master node where it gets deleted when the cluster is terminated.  Ideally this should be placed in an s3 bucket using this option --s3fs
3. The default security group created by the create-cluster command does not allow SSH into port 22. However if this is added, then standard SSH commands can be used to access and transfer files into the master
4. Tom Zeng's post says that SSH tunnelling is required. However I did not need to use process nor follow any of the complex FoxyProxy business to access. Not sure why. Simple access to port 8001 and 8880 worked fine -- Mystery?

No comments: