February 26, 2014

Data Science - a DIY approach

Everybody wants to become a data scientist but given the huge number of tools and techniques that are available no one really seems to know where to start and how to go about acquiring the right skills.
image taken from http://www.rosebt.com

In the process of teaching data science to students at the Praxis Business School I have realised that there is a huge amount of information that is freely available on the web and thanks to Google, one can find this information very easily. If one is motivated enough, one can learn data science on his own. In an earlier post I have given a very broad Introduction to Data Science but in this post, that I plan to update every now and then, I have tried to assemble a collection of tutorials, grouped by topic that will help anyone get a good grip on this subject by actually doing something practical.

R : Statistical Programming Language

Forget about SAS, SPSS and other proprietary languages that want to extort money from you and simply invest your time and effort in mastering R, a free but excellent tool that is used most widely in data science.

Here is a "Level 0 tutorial for getting started with R"  [ alternate link ] by April Galyardt that will get you started. This tutorial will tell you how to install R but you should also consider installing R Studio, a free graphic development environment that makes R easier to use.

Galyardt's tutorial is very comprehensive but if you are that impatient type for whom doing anything in depth is too tiring and yet you want to be familiar with this cutting edge technology then I suggest that you go through this 5-part Beginners Guide to R from Computerworld.

Both these two tutorials require you to install R on your machine, which is not too difficult but then even if that is too much for you, I suggest that you visit Data Camp, or Try R at the O'Reilly Code School. This will give you a flavour of what is it that people do with R.

Statistics

The ancient science of statistics has suddenly got a new lease of life and celebrity status because of a sudden surge of interest in data science. If you leave aside the Map Reduce and the associated world of Hadoop, then [ statistics + statistical tools ] is virtually synonymous with data science. Hence it is essential to get a good grip of statistics.

As an ancient science that is taught in many undergraduate and postgraduate courses, there are many excellent books on statistics but today's data scientist must not only be proficient in statistics but also in a tool like R.

An excellent place to start learning Statistics along with R is the open source, copyright free OpenIntro website. Not only will this allow you to download an excellent  book in PDF format but will offer you a set of lab exercises based on R where you can try out what you have learnt.

Other fairly comprehensive tutorials on statistics with R are available at  Chi Yau's r-tutor.com,  from Kelly Black of Clarkson University. and from King of Coastal Carolina University. However if you are seriously interested in a career with data science you should purchase a regular text book on Statistics for reference purposes and for this you can look at either Statistics for Management by Levin or Applied Business Statistics by Black.

A key concept in Statistics is that of "tests of significance" and one needs to have the big picture of when to use which test. This is explained rather well, and concisely on this page ( with the corresponding codes in R given here.)

Traditional statistics is generally restricted to descriptive and interpretive statistics that is well covered in these links. However data science is more interested in predictive statistics and this is addressed in areas like Data Mining and Machine Learning that will be addressed later. However check out this book for a quick Introduction to Data Science  and also download a PDF file of the book.

For a look at using R for more advanced tasks you can take a look at this "meta book" that points you to other books that shows you how to tackle more complex problems.

Python with Anaconda

Python is another very strong challenger for the position of the best, or most powerful tool, for data science and it makes sense to get a hang of it. This is because it not only supports much of the functionality that R provides, through the Pandas library, but is also quite compatible with Hadoop and the world of Map-Reduce.

A quick way to get a hang of Python without installing it on your machine is to try out the Python tutorial at AfterHoursProgramming or somewhere similar. But  a better way to get going would be to download Anaconda, that not only installs Python but also provides a rich GUI interface to execute Python programs in addition to the standard command line approach. Once you have your Python (with or without Anaconda) in place, you can try out this workshop from Open Tech School free of cost. Or you can jump straight into this very comprehensive tutorial on using Python for data science (alternate )that actually uses data from a Kaggle competition. On the other hand, if you would rather use Python for all statistical work, consider this free, online book by Thomas Haslwanter that contains tons of iPython notebooks with built in codes.

An excellent roadmap to get going with Data Science with Python is available with Elite Data Science.  This includes pointers tutorials to the four key Python libraries namely Numpy, Pandas, Matplotlib and Scikit

In the past, choosing between R and Python was difficult because both offered very similar features, but from 2017 going forward, Python would be a better choice because of its tighter integration with big data technologies like Spark and AI tools like Tensor Flow.

Google COLABoratory

But why bother installing Anaconda, Python and many other useful software and stress your overloaded laptop when Google offers a free Ubuntu based Virtual Machine in its Colab platform?  And that too with GPU, not just a CPU! Using the familiar Jupyter notebook interface this allows data scientists to work on the latest python and and many other tools at no cost! To get going on this remarkable platform follow this tutorial.

Visualisation

Visualisation is another key area of data science because a business wants data scientists to tell a good story with the data. My blog on visualisation is a nice starting point for all those who are interested in getting a quick hang of to get started with free tools like Google Charts. Tableau is an excellent and widely used tool that is used for visualisation that has a  public version for free download that you can try out with the free tutorials available in my blog post. Other excellent tools are Google Fusion Tables that you can explore here or Chartbuilder that you can try out online here or learn more about in this tutorial.  If you are working with R then you must learn ggplot2, an extremely powerful R package.

Since Python is now the darling of the data science community, this tutorial on Seaborn will be pretty useful.


Data Mining / Machine Learning

As we said earlier, traditional statistics is generally limited to descriptive and interpretive work but the world wants predictions and predictive statistics is where we get into data mining and machine learning. Please understand that data mining and machine learning is a complex subject and you need to get a good grounding on the algorithms that are used for Classification, Clustering, Association Rules, Collaborative Filtering,  Text Analytics and other complex tasks. A quick overview of all these techniques is available in this Overview of Data Mining Techniques.

It is difficult to take a short cut through this subject but if you have a basic idea of what all this means then you can try out the examples and exercises given in this book on R-DataMining. Rattle is an excellent add-on to R that gives you a GUI interface for Data Mining. You can download rattle and then try out with this short but descriptive tutorial with this data.

On the other hand if you are more of a programmer and less of a statistician, you may prefer to use Python for your task. A good way to get started is to download and read "A Programmers Guide to Data Mining" that not only introduces the subject but also gives loads of ready made Python code for you to try out.

If you are allergic to programming, whether in R or Python, you can check out H2O a free GUI / browser based tool that you can download and install on your machine to carry out the full range of machine learning activities.

 A good tutorial on Machine Learning with R is available here.

Neural Networks
Many of the traditional data mining and machine learning tasks are now being performed by using Artificial Neural Networks (ANN)and there is a school of thought that believes that ANNs will eventually replace virtually all other techniques. Whether this happens or not, this post gives a quick overview of the different kinds of ANN that are currently being used. A more detailed explanation of this remarkable technology is available through this set of 4 videos created on Youtube from the remarkable 3Blue1Brown channel. One of the reasons for the popularity of artificial neural networks is that Google has developed and opensourced TensorFlow. To understand how ANN and Tensorflow work, you may "play" with it in your browser or download and install it on your machine.

Free Downloadable Books
Here is a list of 12 books that you can legally download and use in your quest for mastery over data science.

Putting it all together : The Big Picture
There is more to datascience that to be able to run statistical analysis with R. One needs to be be able to "play" with data and seek out hidden patterns that apparently do not seem to exist. For example look at this "Island of Games" data puzzle to see what we mean. Data Science is much more than just a bag of tools and techniques -- it is a way of doing things. But what do you actually do ? Look through this beautiful handbook from the School of Data and try out the exercises given here. So you learn by doing and this blog post lists a set of activities that you can do and if you can do it well, you can consider yourself a data scientist. But in case you cannot do all that, just do a Kaggle project with this tutorial and you are on your way.

If you go all through all this then you are almost there on your way to become a data scientist. But to handle Big Data you need to understand Map-Reduce and Hadoop.

Hadoop / Spark / Big Data

A good place to start is this post on how to Demystify Map Reduce and Hadoop with this DIY tutorial. However an even quicker way to get started with Hadoop, plus Pig, Hive and the whole works is to download the   Hortonworks HDP sandbox and run it as Virtual Machine with the Oracle Virtual Box environment. A comprehensive explanation of how the Hortonworks HDP sandbox can be used along with R and H2O to perform actual machine learning tasks is explained in this post on Big Data for the Non-Geek.  An even more interesting exercise is to see how the statistical power of R and be used with the distributed capability of Hadoop as explained in this post.However there are many other websites that you can learn from.

But Hadoop is on its way out and is being replaced by its close cousin, Spark. To understand why and to get started on Spark, check out my tutorial Spark, Python & Data Science.  Two other tutorials will explain how to run Spark+Python program on a single Amazon AWS EC2 instance and on an Amazon AWS EMR Cluster.

Evergreen SQL

Last but not the least, relational databases and SQL is an absolute must for anyone who is interested in data science but since it is so well known and so widely used in the technology community that I refrain from singing its praises or giving any pointers to how it can be learnt. However if you think that you know all about SQL then please take a look at two books by Celko  that talk about some pretty advanced SQL in general and how SQL can be used to handles trees and hierarchies in particular.

Code RepositoryNew

While it is always a pleasure to write code, it is always better to locate existing code, copy-paste and then improve on it. Why re-invent the wheel. Here are two fantastic code repositories
  1. Practical AI by Goku Mohandas
  2. Papers with Code and Data

In the meantime subscribe to the twitter feed of KDNuggets to keep yourself abreast of what is happening in the world of Data Scientists -- that the wise men at Harvard have declared to be the Sexiest Job of the Twenty First Century.

This survey has addressed the tools and techniques that a data scientist uses in his daily job. What is missing here is an understanding of the business domain -- like Retail, Finance, Telecom -- where these tools need to be used to deliver business value. To learn all this you may consider joining the One Year program on Business Analytics at the Praxis Business School, Kolkata  where we teach all this and more. Sorry for that little advertisement but I need to earn a livelihood as well :-)
Business Analytics Program

This post can be accessed through this short link http://bit.ly/dsdiy

I will be updating this post based on changes in my knowledge and perception of this field, so you might see some changes every now and then.   Last update 28 Mar 2014, 04 Apr 2014, 31 May 2014, 19 Jun 2014, 19 Jul 2014, 28 Aug 2014, 12 Sep 2014, 9 Dec 2014, 30 Jan 2015, 30 Mar 2015, 8 Apr 2015, 18 Apr 2015, 9 May 2015, 22 May 2015, 22 Aug 2015, 9 Jun 2016, 26 Jun 2016, 13 Apr 2017 , 16 Apr 2017, 16 Oct 2017, 29 Jan 2018, 24 Apr 2019  , 14 Jul 2019