June 30, 2016

Build your website at the lowest cost

This blog post will show you how to create a fairly decent website at a guaranteed lowest cost and that too without writing any code. Take a look. This post was originally written for iot-hub and the approach is currently being used at Yantrajaal as well. However, when Yantrajaal was created in 1999, none of these technologies existed and I had to take a more expensive route, that you do not need today.

The first step to creating your, or  your company's, digital identity is to build a website. Most people begin by purchasing web hosting services either from a web hosting company or from a value added reseller and have them build their own website. While this may be fine, a do-it-yourself approach will get you going at the minimum possible cost. This post will tell you how you can do this.

1. Purchase a domain name, from a domain registrar like TierraNet or any other similar company. This will cost you around US$ 14 / year. You can get an absolutely free domain name from Freenom but these domains will  end with .tk, .ml, .ga, .cf, .gq and not with the usual .com, .net, .org etc. Irrespective of where you purchase your domain from make sure that you have complete access to configure or modify the DNS records corresponding  to your domain, preferably through a GUI interface.

2. For hosting your website you have two options (a) Get a traditional web server from a hosting company like x10hosting, that could be free or have a monthly charge. Make sure that you have access to the CPanel application to manage your website. (b) The other option is to use Google's blogger platform. Unless  you want to build a transactional website with PhP-MySQL ( or equivalent ) support, the blogger platform is far easier to work with and is an excellent starting point. The blogger option is strongly recommended.

This post assumes that  you have chosen the blogger option.

3. Create a blog by following instructions given in this tutorial. For the name of the blog, choose the same character string as you have for the domain. If your domain is xyz.com then your blog should be xyz.blogspot.com. This is not essential but is a nice to have feature.

4. A blog looks different from a traditional website because unlike the latter it does not have a fixed home page nor does it have a set of navigational tabs across the top. To get over this problem, follow instructions given in this post.

5. Now  you need to link your domain xyx.com to your blog xyz.blogspot.com. For this you need to login into your domain registrar account (created in step 1) and then navigate to the screen that allows  you to manage the DNS. There will be DNS records that would need to be added, modified. To do so, you need to go to this Google Support Site and follow instructions there. Remember, you are modifying the top level domain, that is www.xyz.com and not a subdomain like foo.xyz.com and choose the appropriate instructions. [New] The Google Support Site is good for top level domain like www.xyz.com. However if you already have a site like www.xyz.com and you wish to create a subdomain like icecream.xyz.com, then you should follow the simpler instructions at smallbusiness.chron.com.

The DNS records that you add may have a conflict with existing DNS records that the registrar would have provided by default ( that points visitors to a default, under construction website). If in doubt, keep a screenshot of the earlier records and delete all of them. The new records should do the work. Anything to do with DNS servers takes some time to take effect. So after completing this step, go away to do something else for three or four hours ( though Google claims that it will take 24 hours) and then see if you can load  your website at www.xyz.com. If everything is OK, you should see your blog.

June 25, 2016

Spark, Python & Data Science -- Tutorial

Hadoop is history and Spark is the new kid on the block who is the darling of the Big Data community. Hadoop was unique. It was a pioneer that showed how "easy" it was to replace large, expensive server hardware with a collection, or cluster, of cheap, low end machines and crunch through gigabytes of data using a new programming style called Map-Reduce that I have explained elsewhere. But "easy" is a relative term. Installing Hadoop or writing the Java code for even simple Map-Reduce tasks was not for the faint hearted. So we had Hive and Pig to simplify matters. Then came tools like H20 and distributions like Hortonworks to make life even simpler for non-Geeks who wanted to focus purely on the data science piece without having to bother about technology. But as I said, with the arrival of Spark, all that is now history!

Spark was developed at the University of California at Berkeley and appeared on the horizon for data scientists in 2013 at an O'Reilly conference. You can see the presentations made there, but the following one will give you a quick overview of what this technology is all about.

But the three real reasons why Spark has become my current heart-throb is because
  1. It is ridiculously simple to install. Compared to the weeks that it took me to understand, figure out and install Hadoop, this was over in a few minutes. Download a zip, unzip, define some paths and you are up and running
  2. Spark is super smart with memory management and so unlike Hadoop, starting Spark on your humble laptop will not kill it. You can keep working with other applications even when Spark is running -- unless of course you are actually crunching through 5 million rows of data. Which is what I actually did, on my laptop.
  3. And this is the killer. Coding is some simple. 50 lines of code Java code -- all that public static void main()  crap -- needed in Hadoop, reduces to two or three lines of Scala or Python code. Serious, not joking.
And unlike the Mahout machine learning library of Hadoop that everyone talked about but no one could really use, the Spark machine learning library, though based on Mahout code, is something that  you can be running at the end of this tutorial itself. So enough of chit-chat, let us get down to action and see how easy it is to get going with Spark and Python.

In this post, we will show how to install and configure Spark, run the famous WordCount program so beloved of the Hadoop community, run a few machine learning programs and finally work our way through a complete data science exercise involving descriptive statistics, logistic regression, decision trees and even SQL -- the whole works.

Though, in principle, Spark should work on Windows, the reality it is not worth the trouble. Don't even try it. Spark is based on Hadoop and Hadoop is never very comfortable with Windows. If you have access to a Linux machine either as full machine to yourself or one that has a dual boot with Windows and Linux then you may skip section [A] on creating virtual machines and go directly to  section [B] on installing Spark.

Also please understand that you need a basic familiarity with the Linux platform. If you have no clues at all about what is "sudo apt-get ..."  or have never used the "vi" or equivalent text editor then it may be a good idea to have someone with you who knows these things during the install phase. Please do understand that this is not like downloading an .exe file in Windows and double-clicking on it to install a software. But even if you have a rudimentary understanding of Linux and can follow instructions, you should be up and running.

A] Creating a Virtual Machine running Ubuntu on Windows

If your machine has only Windows -- as is the case with most Windows 8 and even Windows 10 users -- then you will have to create an Linux Virtual Machine and carry out the rest of the exercise on the VM.   This exercise was comfortably carried out on 8GB RAM laptop but even 6GB should suffice.

  1. Download Oracle VirtualBox [ including Extension pack ] software for Windows and install it on your Windows machine.
  2. Download an Ubuntu image for the Virtual Box. Make sure that you get the image for the VirtualBox and not the VMware version! This is a big download, nearly 1GB and may take some time. What you get is a zip file that you can unzip to obtain a .vdi file, a virtual disk image. Note the userid, password of the admin user that will be present in the VM [ usually userid is osboxes and password is osboxes.org, but this may be different ]
  3. Start the VirtualBox software and create new virtual machine using the vdi file that  you have just downloaded and unzipped. You can give the machine any name but it must be defined as a Linux, Ubuntu. 
    1. If you are not sure how to create a virtual machine, follow these instructions. Remember to allocate at least 6GB RAM to the virtual machine
    2. If your machine is 64 bit but VirtualBox is only showing 32 bit options then it means that virtualization has been disabled on your machine. Do not panic, simply follow instructions given here. If you dont know how to boot your machine into bios then see boot-keys.org
    3. Once your Ubuntu virtual machine starts, you will find that it runs in a small window and quite inconvenient to use. To make the VM occupy the full screen you would need to install Guest Additions to Virtual Box by following instructions given here  [ sudo apt install virtualbox-guest-additions-iso ] followed by loading the CD image as explained here
    4. In the setup options of the VM you can define shared folders between the Windows host OS and the Ubuntu guest OS. However the shared folder will be visible but not accessible to the Ubuntu userid until you do the this
    5. Steps 3 and 4 are not really necessary for Spark but if you skip them you may find it difficult or uncomfortable to work inside a very cramped window
  4. Strangely enough, the VM image does not come with Java, that is essential for Spark. So please install Java by following these instructions.
Ubuntu is so cool! Who wants Windows?

B] Install Spark

Once we have an Ubuntu machine, whether real or virtual, we can now focus on getting Python and Spark.
  1. Python - The Ubuntu 16.04 virtual machine comes with Python 2.7 already installed and is adequate if you want to use Spark at the command line. However if you want to use iPython notebooks [ and our subsequent tutorial needs notebooks ] it is better to install the same.
    1. There are many ways to install iPython notebooks but the easiest way would be to download and install Anaconda
      1. Note that this needs to be downloaded inside the Ubuntu guest OS and not the Windows host OS if  you are using a VM.
      2. When the install scripts asks if Anaconda should be placed in the system path, please say YES
    2. Start python and ipython from the Ubuntu prompt and you should see that Anaconda's version of python is being loaded.
  2. Spark - the instructions given here have been derived from this page but there are some significant deviations to accommodate the current version of the ipython notebook.
    1. Download the latest version of Spark from here.
      1. In the package type DO NOT CHOOSE source code as otherwise you will have to compile it. Choose instead the package with the latest pre-built Hadoop. 
      2. Choose direct download, not a mirror.
    2. Unzip the tgz file, move the resultant directory to a convenient location and give it a simple name. In our case it was /home/osboxes/spark16
    3. Add the following lines to the end of file .profile
      1. export SPARK_HOME=/home/osboxes/spark16
      2. export PATH=$SPARK_HOME/bin:$PATH
      3. export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
      4. export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
        1. to get the correct version of the py4j-n.n-src.zip file go to $SPARK_HOME/python/lib and see the actual value
        2. the last two paths are required because in many cases the py4j library is not found
    4. For the changes to take effect, restart the VM, or source your .profile file using the following command: source ~/.profile
    5. To start spark in the command line mode enter the "pyspark" command and you should see the familiar Spark screen. To quit enter exit()
    6. To start spark in the ipython notebook format enter the command $IPYTHON_OPTS="notebook" pyspark. Please note that the strategy of using profiles for starting ipython notebook may not work as the current version of jupyter does not support profiles anymore and hence this strategy was used. [ Skip this part and go to the next step ] This will start the server and make it available at port 8888 on the localhost. To quit press ctrl-c  twice in quick succession.

    7. An alternative way of starting the notebook, not involving the IPYTHON_OPTS command is shown here. This is easier
      1. Start notebook with $ipython notebook ( or alternatively, $jupyter notebook)
      2. Execute these two lines from the first cell of the notebook
        1. from pyspark import  SparkContext
        2. sc = SparkContext( 'local', 'pyspark')
  3. Now we have Spark running on our Ubuntu machine, check out the status at http;//localhost:4040

C] Running Simple programs

If you have not familiar with Python do go through some of the first few exercises of Learning Python the Hard Way and if the concept of a notebook is alien to you then go through this tutorial.

Go to this page and scroll down to the section "Interacting with Spark" and follow the instructions there to run the WordCount application. This will need a txt file as input and any text file will do. If you cannot find a file, create one with vi or gedit and write a few sentences there and use it. Enter each of these lines as a command at the pyspark prompt

text = sc.textFile("datafile.txt")
print text
from operator import add
def tokenize(text):
    return text.split()
words = text.flatMap(tokenize)
print words
wc = words.map(lambda x: (x,1))
print wc.toDebugString()
counts = wc.reduceByKey(add)

The final output in Hadoop style will be stored in a directory called "output-dir". Remember Hadoop, and hence Spark, does not allow the same output directory to be reused.

The same commands can also be entered one by one in the ipython notebook and you would get the same result

This establishes that you have Spark and Python working smoothly on your machine. Now for some real data science

Update 29 June 2017

While it is not too difficult to set up your machine for Spark, there is far easier way to do so, that was pointed out to me by my student Sanket Maheshwari.  Databricks, an online platform built by the folks who have actually created Spark, allows you to run Spark + Python (plus R, Scala as well) programs through a Jupyter-like Notebook environment on an actual AWS cluster without any fuss. To see how it works, create a free account in the community edition and you are ready to go with a single machine cluster.

D] Data Science with Spark

[New 24Jul16] Unlike Hadoop / Mahout, the machine learning library of Spark is quite easy to use. There are tons and tons of samples and even machine learning samples available. These samples along with the sample data are also available in the Spark Home directory that gets created during the installation of Spark as described above. You an run these programs using the spark-submit command as explained in this page after making small changes to bring them into the format described on that page. The basic template for converting these samples to run with spark-submit and two sample programs for clustering and logistic regression is available for download here.

To understand the nuances of the MLLIB library read the documentation, then, for example, follow the one on k-means. For more details of the API and the k-means models follow the links.

Jose A Dianes, a mentor at codementor, has created a very comprehensive tutorial on data science and his ipython notebooks are available for download at github. This uses actual data from a KDD cup competition and will lead the user through

  • Basics of RDD datasets
  • Exploratory Data Analysis with Descriptive Statistics
  • Logistic Regression
  • Classification with Decision Trees
  • Usage of SQL
After going through this tutorial, one will have a good idea of how Spark and Python can be used to address a full cycle data science problem right from data gathering to building models

Update 13 April 2017 : Version Issues

If  you were to download, install and configure Spark with Python today following the steps given in this tutorial you will face problems because Spark 2.1 [ the current latest version ] is totally incompatible with Python 3.6+ [ the current latest version ]. However, do not panic, all is not lost!  Stackoverflow is always there to solve problems!

First, the instructions that we have given are all valid, please follow them. However, after installing Python 3.6 you need to create a Python 3.5 environment with Conda as explained here. More details about Conda environments is given here.

Before you start Jupyter Notebook you need to issue the following commands :
conda create -n py35 python=3.5 anaconda
source activate py35

This will create an environment where everything will work like a charm.  Once you are through, remember to issue the following command to go back to the normal environment

source deactivate

The other challenge is that Python 3 is a "little" different from Python 2 and some changes are necessary to the programs. One easy change is that all print statements have to be written as functions : print(.....), but there are other non-trivial changes as well. In the examples given here, the WordCount and K-Means program will run without major changes but the Logistic Regression program will need to be changed as explained in this stackoverflow post.

Update : 13 Apr 2017 : This post explains how Spark and Python works on a local machine. My subsequent blog posts explain how Spark+Python+Jupyter works on (a) a single EC2 instance on Amazon AWS and on (b) an EMR cluster on Amazon AWS.

Spark is a part of the curriculum in the Business Analytics program at Praxis Business School, Calcutta. At the request of our students I have created an Oracle Virtual Appliance that you can download [ 4GB though] import it into your Virtual Box and go directly to section [D]! No need for any installation and configuration of Ubuntu, Java, Anaconda, Spark or even creating the demo MLLIB code. This VM has been configured with 4GB RAM which just about suffices. Increase this to 6GB if feasible. -- Updates : [28Aug16] - New Virtual Box (password ="osboxes.org")

Update : 22 Apr 2018 : With the arrival of Google Collaboration Library  the need to install software has disappeared completely. Now you can have virtual machine with a GPU and start working on datascience right way. To see how you can use Spark for the WordCount program see these notebooks. We show how upload datafiles from a local machine OR read them directly from Google Drive.

I was invited to Cypher2016 where I delivered a lecture and demonstration on Python, Spark, Machine Learning and running this on AWS

June 23, 2016

Tales from IIT Kharagpur

The IIT KGP Pre 1993 group in a facebook was created as a reaction and rebuff to the official IIT KGP Alumnus group that had become a hotbed of political and religious bigotry. The Pre 1993 crowd is believed to be more of the "kool kgp type" and even if some of them do have strong views on political and religious matters, such things are kept aside in this group. Posts in this group are more in the form of happy reminiscences that the old men like to tell, listen to and enjoy.

Since it is difficult to search through Facebook posts, I have created these pointers to the stories that I had contributed to this column. But to respect the privacy of the closed group where these stories were pointed, I have made sure that only members can read the whole story -- and the incredible comments that other members have made.

6th July 2015
Sometime in the recent past, Surath Chatterjee, like many of us, reached his 50th year and some of his friends had thrown a party for him at Bali. Unfortunately I could not join him at Bali but my thoughts went back to another birthday party that we in Azad C-Top-West had thrown for him. This might not have been at Bali, but I suppose it was no less spectacular! [ continued here! ]

21 June 2016
This was in our third or fourth year, 1982 or 1983 and even the names of the co-conspirators is fading out. It was the night of the Vishwakarma Puja, that heralds the start of the festive season in Bengal. Some of us from Azad, C-top West had visited Salua / Prembazar end of the campus for a shot of Mahua and on the way back we decided to stop by the various Vishwakarma puja pandels that had sprouted around the campus. The farthest one was the one at the Workshop and we decided to go there .. and on the way of course was the dark and ghostly structure of the Old Building. We had heard of the ghosts of the freedom fighters who were martyred there but that we did not really care. What we were really worried about was the security staff because we had decided -- on the spur of the moment and under the influence of Madam M -- to climb to the top of Old Tower! [ continued here!]

June 15, 2016

Society and the Second Law of Thermodynamics

When I think back about it, the New Delhi that I had visited as a child in the late 1960s was so much cleaner, nicer and better than the city I occasionally go to today. As a tourist in Europe in the late 1980s I did not have to think about the terror, and counter-terror, that is now being unleashed in Brussels and Paris. The Kolkata that I live in today is an urban disaster compared to the Calcutta that I went to school in. It is difficult to deny that, net-net, there has been a decay and degradation in the quality of urban life over the past 50 years.

Is it only in urban life? And is it only for the past 50 years? The dry statistics captured by economists in the Human Development Index -- a vector consisting of life expectancy, education and per-capita income -- and talked about by governments in power would claim that the world is becoming a better place. But the common man in the bazaars of India, and perhaps the world as well, would beg to differ. With his native intelligence, or rather his intuition, most people would tend that to believe that the past was better, cleaner and more peaceful than the murderous mayhem that he finds himself in at the moment. Is this intuition correct? Or should we believe the economic indicators that claim that we are better off?

Globally, and socially, are things changing for the better or for the worse?

To seek an apolitical answer to this question, let us travel back in time, to the early years of the nineteenth century when steam power was driving the industrial revolution and propelling Europe towards its pinnacle of economic, political and social glory. Everyone was trying to understand the mechanics of the steam engine and improve its efficiency and it was the Frenchman, Sadi Carnot (1796 - 1832) who came up with a body of knowledge that is known as thermodynamics today. His ideas were far from perfect but thanks to subsequent interpretations by Lord Kelvin, Rudolf Clausius and Ludwig Boltzmann, we have today the Second Law of Thermodynamics that states that in any physical process the entropy of the universe always increases. This technical statement is generally explained as that the physical world can only move from a state of order to disorder or from a state of lesser chaos to higher chaos! Though there is a counter perspective that refuses to view entropy as disorder but only as a measure of energy dispersal, the dominant, popular narrative continues with analogy of chaos and disorder.
image credit

The Second Law of Thermodynamics is one of the key pillars of modern science, on par with Newton’s Laws of Mechanics, Maxwell’s Laws of Electromagnetism, Einstein’s Theory of Relativity and Quantum Mechanics of Schrodinger and Heisenberg. It’s veracity is beyond challenge. However one must note that the Second Law refers to the entropy of the universe as a whole. The universe consists of a system and its environment. The system could be something as a simple as an empty box or as complex as a spaceship, the planet Earth or even a galaxy. The environment is everything outside the box, or the spaceship or Earth or the galaxy. Hence it is quite possible that the entropy, or disorder, in a system can decrease but the corresponding increase in the entropy of the environment would be such that the sum total of the decrease and the increase will result in an overall increase of the entropy of the universe. There is no known deviation from, or violation, of this Second Law in the domain of physical sciences. The arrow of time itself is defined by the change towards greater disorder.

Is it possible for a law that is valid in the physical sciences to be applicable in the social sciences?

Can we say that a society at rest continues to be at rest until acted upon by an external idea? Can we say that the rate of change in a society is proportional to the strength of an external idea and is inversely proportional to the size of the society? Can we say that most, if not all, social actions are accompanied by equal and opposite reactions? Even if we answer these questions with a qualified yes, then we are transferring Newton’s Laws from the physical world to the world of social sciences. Extending this logic to the Second Law means that society will inevitably go from order to disorder, from lesser chaos to greater chaos!

That the world is getting progressively “worse” was a concept that sent a shock through European society in the nineteenth and early twentieth century when the consequences of the Second Law seeped into public consciousness. People argued that the flowering of the European civilisation with its beautiful cities and stable social structures did not quite agree with notion of increasing chaos. But then people did not realise that this decrease in “chaos” in Europe was accompanied by a far greater increase in “chaos” that was being unleashed in colonies across Asia, Africa and South America! The net chaos of the universe was indeed increasing. Similarly in post independence India, we may claim that the order enforced by the Constitution has reduced the anarchy prevalent in the era of Mughals and the Marathas but this claim is undermined when we see the growth of maoist, islamist and separatist insurgency and the continuing politics of caste that has been unleashed.

The collective wisdom of the Hindu psyche represents this inevitable decay in the context of a Mahayuga that lasts for 4.32 million years and consists of four Yugas. Chaos, disorder or “evil” increases monotonically as society traverses through Satya, Treta, Dwapar and eventually reaches Kali Yuga where it ends in a blaze of violence. Though “modern” humans appeared 200,000 years ago and “civilisation” 6000 years ago, the figure of 4.32 million is not incompatible with the first evidence of humanoid behaviour that emerged 6 million years ago, when our ancestors started walking erect leaving their hands free to wield the tools and weapons that change the world.

A modern perspective would be to look at social entropy, defined as a measure of dissatisfaction within a social, political and economic system -- and we can see at once that there is no doubt that this increasing all around us at an alarming rate. Whether it is the killing fields of the Islamic State in Iraq and Syria, the Arab Spring in North Africa, the rising intolerance in North America, the siege mentality in Europe or closer home the “syndicate”-raj in Rajarhat or the anarchy in JNU, discontent is huge and growing rapidly. Neither does any amount of art, literature or other finer elements of culture, nor any spectacular technology, like SpaceX or solar cells, seem to have any calming effect. This is where we are tempted to take one of the grandest ideas of science and apply them to complex social phenomenon! Whether we should do so could be the subject of research, but if we could, what are the consequences?

Accepting the inevitability of increasing disorder would mean that the world would become an increasingly uncomfortable place. Elections may come and go but the average misery for the common man can never reduce. Occasionally we may encounter some superior technology that, for example, may wipe out a particular disease and reduce misery in a limited context, but in the long term it will lead to situations where things would be even worse. Occasionally, there can emerge a charismatic leader, a messiah, who will lead his followers towards something positive and triumphant but then again, with the passage of time, the world will regress  to what would be worse than what it was before. The end of colonial expansionism led into the rise of fascism and the World Wars. The end of fascism led to the rise of communism. The end of communism, and the purported End of History, led to the rise of Islamic terror. What is the next abomination?

The Second Law could be a powerful discouragement for trying to do anything constructive. Since you really cannot change the world for the better, why bother? But nevertheless,  while global entropy, or disorder, will always increase, it is possible to reduce it in a small, local environment -- Singapore is an example and Elon Musk’s proposed colony on Mars could be another.  Nearer home, acting locally to impose order and governance could be the only viable, but selfish, strategy. So while America, Europe, Iraq, Kashmir, GST and secularism are undoubtedly important, what really matters is whether the garbage is being cleared and the neighbourhood lumpens are behind bars. Panchayat is more important than Parliament!

For the rest of the world? And in the long term? We may have to live with the dismal consequences of the Second Law -- all the king’s horses and all the king’s men [cannot] put Humpty Dumpty together again.

This article originally appeared in Swarajya, the magazine that reads India right.