June 29, 2015

From Hadoop Streaming to RHadoop

The challenge of combining the statistical power of R and the "Big Data" capabilities of Hadoop is something that has always fascinated me. Over a year ago, I had finally broken free from from the stupidity of the WordCount ( and various other counting ) programs and tried to solve a real like retail problem with linear regression using R and Hadoop. This is documented in my blog post Forecasting Retail Sales -- Linear Regression with R and Hadoop. In this case however I had used the Hadoop streaming API to call to separate R programs.

Subsequently I had come across the Hortonworks HDP platform that dramatically simplified the process of installing and running Hadoop. This is explained in my blog post Big Data for the Non Geek, where in addition to installing Hadoop, I have also explained how to overcome the challenges of installing the RHadoop packages on top Hadoop on the Hortonworks platform.

Hortonworks has a nice example of how to run an rHadoop program on the HDP platform but I was very keen to see how to make my port my Retail Sales program from the traditional streaming mode to the rHadoop mode. This means replacing the two R program LinReg-map.R and LinReg-red.R into one R program LinReg-MR.R and run this, not from the linux command prompt but from R Studio itself.

This process is explained in this post.

First I had to check whether my original LinReg-map.R and LinReg-red.R would work on the Hortonworks HDP platform. Fortunately, they did but a small change was required in the command line -- not the two -file properties attached right at the very end

# -- used in the Hortonworks HDP .. two -file commands required
hdfs dfs -rm -r /user/ru1/retail/out900
# ---
hadoop jar /usr/hdp/ -D mapred.job.name='RetailR' -mapper /home/ru1/retail/LinReg-map.R -reducer /home/ru1/retail/LinReg-red.R -input /user/ru1/retail/in-txt -output /user/ru1/retail/out900 -file LinReg-map.R -file LinReg-red.R 

Next thing that we had to do was to convert the five tab separated TXT files used as input into five corresponding comma separated CSV files. I am sure tab separated TXT files can also be used but for the time being it was easier to convert the data files to CSV than to explore the TXT option.

Finally, the two R program were merged into one R program and this is available in the rHadoop directory of the original Retail github repository.

Significant changes are as follows --
In the MAP part

-- the original Map program

DailySales <- read.table("stdin",col.names=c("date","sku","sale"))
for(i in 1:nrow(DailySales)){
  Key <- as.character(DailySales[i,]$sku)
  Val <- paste(as.character(DailySales[i,]$date),"$",as.character(DailySales[i,]$sale))
  cat(Key,gsub(" ","",Val),"\n")

-- the modified Map subroutine

mapper1 = function(null,line) {
    ckey = line[[2]]
    cval = paste(line[[1]],line[[3]],sep = "$")

The Reduce part is greatly simplified

-- the original Reduce program

mapOut <- read.table("stdin",col.names=c("mapkey","mapval"))
CurrSKU <- as.character(mapOut[1,]$mapkey)
CurrVal <- ""
days <- ""
sale <- ""
for(i in 1:nrow(mapOut)){
  SKU <- as.character(mapOut[i,]$mapkey)
  Val <- as.character(mapOut[i,]$mapval)
  DataVal <- unlist(strsplit(Val,"\\$"))
  if (identical(SKU,CurrSKU)){
    CurrVal = paste(CurrVal, Val)
    if (FIRSTROW)  {
      days <- DataVal[1]
      sale <- DataVal[2]
    } else {
    days = paste(days,DataVal[1])
    sale = paste(sale,DataVal[2])
  else {
    CurrSKU <- SKU
    CurrVal <- Val
    days <- DataVal[1]
    sale <- DataVal[2]

-- the modified Reduce subroutine

reducer1 = function(key,val.list) {
  days <- ""
  sale <- ""
  for(line in val.list) {
    DataVal <- unlist(strsplit(line, split="\\$"))
    days <- paste(days,DataVal[[1]])
    sale <- paste(sale,DataVal[[2]])

  retVal <- EstValue(key,days,sale,9)

the "key" difference is that instead of using the cat command to emit the key-value pair,  we use the keyval function of the rmr2 package to move data from the mapper to the reducer. Also the reducer gets all the values for one key and so no sequential processing required to isolate the values associated with one key.

The actual Linear Regression is done by the EstValue function and ideally it should have had no changes at all. However there was ONE change that was required and that is shown here.

#PastSale = Reduce("+",sale)
PastSale = 0
for (j in 2:length(sale))PastSale = PastSale + sale[j]
#PastSale= sale[2]

The total of Past Sales, though not required for the Regression was being calculated by the Reduce function but somehow this would never work in rHadoop. Had to be replaced with a manual loop and that too starting from 2! However the answer in both cases -- streaming and rHadoop -- is the same.

To be honest, rHadoop is actually no different from the Hadoop streaming process. In fact it is one and the same but there are some small changes that one needs to make and the benefit is that one can work from the familiar confines of the RStudio as you can see from this screenshot.

Update :  A very comprehensive tutorial on RHadoop is available here. Here is a guide on how to run RHadoop on Amazon AWS EMR

May 19, 2015

Technology, Management & Systems : The Holy Trinity ?

India has a strange fascination for engineers and MBAs. Everybody wants to become one, or preferably both. So is the case with systems, or as they say in India, the IT sector. But this fascination is not because of any natural aptitude for these disciplines but simply because they help one to get a job in an otherwise dismal economy. This is unfortunate, because if we step back for a moment and think through  issues that haunt this country, it would seem that our salvation lies in leveraging this holy trinity to dig us out of the hole that we find ourselves in!

Let us consider a few exemplary scenarios.

Till the 1990s, telephones in India were a disgrace. While landline technology was readily available and widely used all across the developed world, we were still at the mercy of the corrupt and inefficient P&T department that ensured that very few of us had access to one. This changed dramatically with the arrival of cell-phones that  bypassed the constraints of the local loop and managed to put a phone in every hand. But just as we were about to take off, the evil empire struck back with the 2G scam that again put us back by several decades in 3G and 4G services.
If we go back further, to the 1960s, we would see that India was facing a massive shortage of food and waiting for PL480 handouts from the United States to fight famines. Then again it was a burst of technology -- food technology, in the form of better fertilizers and high yielding crops -- that saved the day. We managed to stave off starvation but once again the theft in the public distribution system and the mismanagement of the supply chain -- as visible in images of grains rotting in godowns --   brought back the spectre of Kalahandi and haunts us even today.

In fact, India lost the plot much earlier, in the middle ages when we failed to board the Renaissance bus. While the Mughals were celebrating the high noon of their culture with the Taj Mahal and Urdu shayaries, Europe was passing us by with the structured and rational thinking of Galileo, Rousseau and Newton. Administrative systems like the modern nation state, with its civil and military services, economic systems like joint stock companies, banks, insurance agencies and educational systems built around universities that granted formal degrees never really got off the ground in India in the same way that they did in Europe. Not until the British brought them here.

Systems, in this context, are a metaphor for the rational approach to address social and commercial problems -- free of divine directives or bureaucratic whimsy. Such systems result in the development of superior technology and in the efficient usage of natural and social resources. This in turn reduces the  degree of social conflict, increases the physical living standards and empower societies with the luxury of confidence. Such systematic societies can then confront, conquer and convert societies that are irrational, unstructured or unsystematic -- as we have seen in the case of Europeans conquering large parts of America, Asia and all of Australia.

Computer “systems”, that appeared much later, draw their inspiration, and of course the moniker systems, from the same structured, rational,  or systematic approach that they bring to bear on solving problems and are a specific example of a more general approach. Can such systems free the future of India from the confines of its past and its present?

A little introspection will show us that we are an inherently irrational in our thoughts and chaotic, if not anarchic, in our deeds. In our irrationality, we could have still lived with our little obsessions with gods, “godmen” and superstitions but the real problem is when the irrationality seeps into our secular and political structures. We scream and protest against corruption but are the first when it comes to boasting about how we could bribe our way through the system and -- this is far worse -- a majority of us would not hesitate to seek bribes and favours whenever we are in a position to do so. This is particularly true for anyone with any kind of discretionary power within the government -- whether it is a peon, who will not let you meet the officer, or the officer himself who can sign a paper and make your life a little easier.

Corruption is a characteristic that is woven into the warp and woof of the Indian administration and yet we have the unedifying spectacle hundreds and thousands of irrational souls sitting on dharna at Jantar Mantar and asking for a Lokpal -- as if one man can do what a whole zoo full of institutions like the CAG, CVC, CBI could not achieve! We refuse to believe, to quote a Bengali maxim, that the evil spirit is in the very mustard that is being used for the exorcism!

Our irrationality extends to our naive belief in the democratic process where we cannot see through the chicanery of the promises made by the candidates. We believe that it is correct to vote for the person who promises illegal and untenable benefits for me and my caste. This natural irrationality is extended and reinforced by our love for anarchy. We believe that burning buses and other public property, calling bandhs and blockading roads is a natural and justified way in which our “leaders” or elected representatives can help us towards a better life. This irrationality is not the exclusive preserve of the uneducated and we see even people with college education passionately espousing economic theories that have been discarded in the dustbins of history. Robust, rational pragmatism is hard to come by in this country.

We could go on with similar anecdotes but the point is that as a nation we are incapable of governing ourselves. In state after state and in every Lok Sabha election we have voted for, elected and given to ourselves the terrible governments that we, as an irrational and anarchic people, justly deserve. But does this mean that it is time for the British Parliament to repeal the Indian Independence Act of 1947? Fortunately, there could be a better solution  based on the trinity of technology, management and systems.

Expecting government officials to shun corruption or the electorate to vote rationally is like expecting Kolkata to have a weather like Darjeeling. There may be a few exceptions here and there but by and large, very, very unlikely. We need to work with these two chips on our shoulders. Bending, breaking and abusing the system is a leitmotif of India. Individual Indians may be sane and rational but in a mass and as a collective, they will never be. This the foundational principle on which the governance of India needs to be based.

Since people are at the heart of the problem we need to minimise their role in discretionary decisions and, to the extent possible, from the delivery process. Cell phones succeeded where land line phones failed because they did not need an army of corrupt, anarchic people to maintain the thousands of lines running across the country. The towers are unmanned and the central switches need a few competent people. This is a perfect example of a technology trumping the accumulated hubris of centuries and is the model that we must try to emulate in other areas as well.

This though, is easier said than done!

The technology is never the issue but to implement it against the wishes of people, who see this as an infringement of their fundamental right to be corrupt or anarchic, is the real challenge. This is where smart management techniques come in very handy. The key is to use the carrot and the stick to cajole, convince, convert, confuse or coerce everyone so that they have no option but to be yoked to structured, technology-enabled systems. Individual brilliance and creativity is great and diversity is something wonderful to celebrate but, if cars do not stop at traffic lights but only when people people block the road, then society collapses into the kind of anarchy that India is familiar with. Net-net we need to design systems that will bring technology and management techniques into the governance process in a manner that minimizes the need for people in the governance process.

Is this possible at all ? To a large extent, yes.

Since data should be the basis of any rational decision, our systems must forcibly collect data and place it in the public domain. Next a clear set of algorithms, or rules, must be put in place so that the data itself drives decisions -- say, for example, approvals for or limits on expenditure, the quantum of taxes due -- in a way where humans have only a supervisory role. Finally the data, the process of arriving at a decision and the decision itself must be automatically visible to the public. This is a generic template for transparent governance. A simple example would be a Wikimapia style map showing physical locations of NREGA projects along with time-stamped, GPS encoded pictures shot before and after the project is executed -- without which no further funds will be released to the panchayat in question. Three previous articles in these columns have shown how similar systems can indeed be designed to help expedite justice in courts, facilitate elections and track corruption at the operational level.

The design and implementation of such systems would of course eliminate a lot of redundant but hugely lucrative positions in the administration and so would be stoutly resisted by an army of the most corrupt. This can be overcome only if the elected leadership has the political will and the administrative wherewithal to place a few honest and technophile administrators at key decision making posts in government. This is the only, and minimal, ask if we want to see technology enabled rationality in the governance of this country.

The tyranny of a Singapore style benevolent dictatorship may pose too big a risk for a big, multicultural country like India but the tyranny of systems developed and deployed by a few smart and well meaning people employed by an elected government is the answer to India’s perennial problems.

This article originally appeared in Swarajyamag -- the Magazine that reads India right

May 03, 2015

Maps of India : DIY with R and GADM data

Displaying spatial data on maps is always interesting but most Visualisation tools do not offer facilities to create maps of India, especially at the state and lower levels. In this post, we will show how such maps can be made.

The base data for such maps, the "polygons" that define the country, the states, the districts and even the talukas ( or sub-divisions) is available from an organisation called Global Administrative Areas or gadm.org. Country level files for almost all countries are available in a variety of formats including R and these are at three different levels. For India, these files can be downloaded as IND_admN.RData where R = 1,2,3. These will form the raw data from which we will create our maps.

Unfortunately, the GADM files represent a truncated Kashmir. How I wish that the Government of India and the National Atlas and Thematic Mapping Organisation would publish similar files for us. Anyway, we work with what we readily have ...

Working with R, we will need two R packages :

# Load required libraries

Assuming that the downloaded RData file is located in the R working directory, the following code will generate a basic India showing the states

# simple map of India with states drawn
# unfortunately, Kashmir will get truncated
spplot(ind1, "NAME_1", scales=list(draw=T), colorkey=F, main="India")
Now suppose there is some data ( economic, demographic or whatever ...) and we wish to colour each state with a colour that represents this data. We simulate this scenario by assigning a random number ( between 0 and 1) to each state and then defining the RGB colour of this region with a very simple function that converts the data into a colour value. [ This idea borrowed from gis.stackexchange ]

# map of India with states coloured with an arbitrary fake data
ind1$NAME_1 = as.factor(ind1$NAME_1)
ind1$fake.data = runif(length(ind1$NAME_1))
spplot(ind1,"NAME_1",  col.regions=rgb(0,ind1$fake.data,0), colorkey=T, main="Indian States")

Now let us draw the map of any one state. First check the spelling of each state by listing the states:

and then executing these commands :
# map of West Bengal ( or any other state )
wb1 = (ind1[ind1$NAME_1=="West Bengal",])
spplot(wb1,"NAME_1", col.regions=rgb(0,0,1), main = "West Bengal, India",scales=list(draw=T), colorkey =F)

# map of Karnataka ( or any other state )
kt1 = (ind1[ind1$NAME_1=="Karnataka",])
spplot(kt1,"NAME_1", col.regions=rgb(0,1,0), main = "Karnataka, India",scales=list(draw=T), colorkey =F)

If we want to get and map district level data then we need to use the level 2 data as follows :

# load level 2 india data downloaded from http://gadm.org/country
ind2 = gadm

and then plot the various districts as

# plotting districts of a State, in this case West Bengal
wb2 = (ind2[ind2$NAME_1=="West Bengal",])
spplot(wb2,"NAME_1", main = "West Bengal Districts", colorkey =F)

To identify each district with a beautiful colour we can use the following commands :
# colouring the districts with rainbow of colours
wb2$NAME_2 = as.factor(wb2$NAME_2)
col = rainbow(length(levels(wb2$NAME_2)))
spplot(wb2,"NAME_2",  col.regions=col, colorkey=T)

As in the case of the states, we can assume that each district has some (economic or demographic) data and we wish to colour the districts according to the intensity of this data, then we can use the following code :

# colouring the districts with some simulated, fake data
wb2$NAME_2 = as.factor(wb2$NAME_2)
wb2$fake.data = runif(length(wb2$NAME_1)) 
spplot(wb2,"NAME_2",  col.regions=rgb(0,wb2$fake.data, 0), colorkey=T)

But we can be even more clever by allocating certain shades of colour to certain ranges of data as with this code, adapted from this website

# colouring the districts with range of colours
col_no = as.factor(as.numeric(cut(wb2$fake.data, c(0,0.2,0.4,0.6,0.8,1))))
levels(col_no) = c("<20%", "20-40%", "40-60%","60-80%", ">80%")
wb2$col_no = col_no
myPalette = brewer.pal(5,"Greens")
spplot(wb2, "col_no", col=grey(.9), col.regions=myPalette, main="District Wise Data")

To move to the district, sub-division ( or taluk) level we need to use the level three data file :

# load level 3 india data downloaded from http://gadm.org/country
ind3 = gadm

# extracting data for West Bengal
wb3 = (ind3[ind3$NAME_1=="West Bengal",])

and then plot the subdivision or taluk level map as follows :

#plotting districts and sub-divisions / taluk
wb3$NAME_3 = as.factor(wb3$NAME_3)
col = rainbow(length(levels(wb3$NAME_3)))
spplot(wb3,"NAME_3", main = "Taluk, District - West Bengal", colorkey=T,col.regions=col,scales=list(draw=T))

Now let us get a map of the district - North 24 Parganas. Make sure that the name is spelt correctly.

# get map for "North 24 Parganas District"
wb3 = (ind3[ind3$NAME_1=="West Bengal",])
n24pgns3 = (wb3[wb3$NAME_2=="North 24 Parganas",])
spplot(n24pgns3,"NAME_3", colorkey =F, scales=list(draw=T), main = "24 Pgns (N) West Bengal")

and within North 24 Parganas district, we can go down to the Basirhat Subdivision ( Taluk) and draw the map as follows: 

# now draw the map of Basirhat subdivision
# recreate North 24 Parganas data
n24pgns3 = (wb3[wb3$NAME_2=="North 24 Parganas",])
basirhat3 = (n24pgns3[n24pgns3$NAME_3=="Basirhat",])
spplot(basirhat3,"NAME_3", colorkey =F, scales=list(draw=T), main = "Basirhat,24 Pgns (N) West Bengal")

This is the highest resolution ( or lowest administrative division ) that we can go with data from gadm. However even within a map,  one "zoom" into and enlarge an area by specifying the latitude and longitudes of a zoom box as shown here.

# zoomed in data
wb2 = (ind2[ind2$NAME_1=="West Bengal",])
wb2$NAME_2 = as.factor(wb2$NAME_2)
col = rainbow(length(levels(wb2$NAME_2)))
spplot(wb2,"NAME_2",  col.regions=col,scales=list(draw=T),ylim=c(23.5,25),xlim=c(87,89), colorkey=T)

With this it should be possible to draw any map of India. For more comprehensive examples of such maps, please see this page.

new PostScript : The full code for creating this maps as well as additional information on how to place text and markers on these maps is available on my specialist visualisation blog.

April 22, 2015

Mapping Money Movements to Trap Corruption

In the chequered history of parliamentary legislation in India, the RTI Act stands out as a significant milestone that puts activities of the government under public scrutiny. But even though the Act gives a legitimate platform for citizens to ask questions on, the process is cumbersome and answers are often given in a manner that is not easy to understand or make use of.

But why must a citizen have to ask for something that is his by birthright? Why can the information not be released automatically? But then who decides what information is to be released? At what level of detail? At what frequency?

The biggest challenge facing India is corruption. It is the mother of all problems because it leads to and exacerbates all other problems. If controlled, the money saved can be used to address most deficiencies in health, education and other social sectors.

Misguided people wrongly believe that having a strong Lokpal will solve the problem but when bodies as powerful as the CBI, the CVC have been subverted and compromised by political interests, it is foolish to expect one more publicly funded body, like the Lokpal to be any better. Instead, let us explore how a crowd sourced and data driven approach can help both track and crack the problem.

The central and state governments in India, between them, spend about Rs 25 lakh crores every year. Even if, very optimistically, only 10% of this is lost in corruption, then the presumptive loss to the public exchequer is Rs 2.5 lakh crores every year, compared to the one time loss of Rs 1.8 lakh crores in the  CoalGate scam.

Can we follow the clear stream of public money as it slowly gets lost in the dreary desert sands of government corruption? Most public money starts flowing from the commanding heights of the central government and passes through a complex hierarchy of state and central government departments, municipalities, zilla parishads, panchayats until it reaches the intended beneficiary, who could be a citizen, an employee or a contractor. Flows also begin from money collected as local taxes and routed through similar channels. The accompanying diagram gives a rough idea of this process.

In the language of mathematics, this is a graph consisting of a collection of nodes connected to each other through directed edges. Each node represents a government agency or public body and each directed edge represents a flow of money. Inward pointing edges means that the node or agency receive money while outward pointing edges represent  payments made. Green nodes are sources where money enters -- this could be tax, government borrowings or even the RBI printing notes, while yellow nodes are end-use destination of public money --  salaries, contractor payments, interest payments and direct benefits transferred to citizens.

Ideally, the flow of money through this network should be such that over a period of time the total amount that flows into the network at all green nodes should equal the total that flows out into all yellow nodes. In reality the sum will never add up because there is significant leakage, or theft, in the network. The money lost in transit within the graph between green and yellow nodes is one quantifiable measure of corruption. Another, less obvious case is the inexplicable or unusually high flows of money  -- as in the case of the Chaibasa Treasury at the height of the Fodder Scam or for a sudden spurt of expenditure in widening all roads inside a particular IIT Campus. Any deviation from norms, either historical or from similar expenditure elsewhere, needs an investigation and explanation.

Can such anomalies and deviations be detected? and how does the RTI Act fit into this picture?

This graph is large, the number of nodes and edges is very high and the problem seems insurmountable.  But we can  divide and conquer the problem because many tasks can be done in parallel by independent groups. Instead of focussing on the entire graph at one go, we can zoom-in on certain segments of the graph and examine specific nodes at higher magnification or smaller granularity. In principle, if the flow through every node is accounted for, then the flow through whole graph gets accounted for automatically.

How do we conquer each node?

Since each node is a government organisation, it falls under the RTI Act and we can demand details of all its cash flows. Once this is in the public domain it can be examined by private volunteer investigators either manually or by automated software specifically designed for forensic audit. In fact if Google search bots, or software robots, can crawl through the entire web to track down, rate and index trillions of unstructured web pages, it would not be difficult to build software that can track down and reconcile each and every cash flow transaction in India provided the data is publicly available in a digital format.

The results of such investigations and the unbalanced flows that are revealed should also be in the public domain and would be the starting point of either a formal CAG directed audit or citizen activism directed at the agency concerned. If the cash flows of a particular panchayat or a government agency do not add up or seem suspicious, affected parties should take this up locally either through their elected representatives or through more specific and focussed RTI requests.

This may seem complicated but is not really so. All that we are demanding is that bodies that deal with public money should publish their financial accounts into the public domain in a standardised format. Obviously the accounting format specified for listed companies may not be appropriate since assets and liabilities are accounted for differently and there is no question of profit or loss for public bodies. Instead, the focus is on the cash flow statement. Specifically, how much money is coming in? and where is it going to?

How will this actually work in practice?

First, the CAG and the Institute of Chartered Accountants of India will create a format to report all cash flows in public bodies in terms of a nationally consistent set of cost codes and charge accounts. Next the CIC will mandate that all public bodies must upload this information every quarter into a public website maintained by the CIC. As this information accumulates online, volunteer auditors, public activists, anti-corruption campaigners and even the CAG itself, if it wants to, can collaboratively build a website, like wikipedia or wikimapia, that pulls data from the underlying CIC website and displays the cash flow graph. Even as the graph gets built, people can start looking for missing or unusual flows.

So there is no persistent workload on either the CAG or the CIC. However each public body must prepare its cash flow statement and put it on the CIC website. In any case, a public body is expected to keep a  record of  all cash flows --  through cash, cheque or EFT. This needs to be put into CAG designated format and uploaded periodically to the CIC website.  

In fact the CIC website should already be ready because in October 2014,  the DoPT  announced that henceforth all replies under the RTI Act will be uploaded to the web in any case!

In the initial stages, the graph will have major discrepancies in cash flow and this could be because all agencies, or nodes, have not been identified or are not reporting information. At this point, activists can put in specific RTI requests to the concerned agencies to report their information in the CAG format and the CIC must ensure immediate compliance with the same.  Over a period of time and with a number of iterations the cash flows through all parts of the network should first become visible and then balance out. Compliance will be achieved by continuous RTI pressure at the grassroots, applied in a parallel and distributed manner, on defaulting government agencies.

When known outflows do not match inflows or if there are deviations from expected norms then activists can draw the attention of the media and opposition politicians who can raise a hue and cry. This should lead to the usual process of investigation where the CVC, the CBI or even the Supreme Court can get involved. After all,  Al Capone, the notorious US gangster was finally convicted on the basis of a forensic audit, not a shoot-out police operation!

Unlike the top-down Lokpal driven approach, this bottom-up strategy calls for no new law, no new agency, no new technology, no new infrastructure. All it needs is a format to be defined by the CAG and for the CIC to ensure that any RTI demand in this format be addressed immediately. With this, the flow of public money will become visible and if this transparency leads to a reduction of only 10% of the presumptive loss of Rs 2.5 lakh crores, it would still mean an additional Rs 25,000 crore for the Indian public every year.

19th century British India had the vision, the audacity and the tenacity to carry out the Great Trigonometrical Survey that created the first comprehensive map ( see diagram)  necessary to govern this vast country. Armed with the digital technology of the 21st century, a similar mapping of all public cash flows will lead to greater transparency in the governance of modern India.

a high resolution version of this image is available from wikimedia.

This article first appeared in Swarajya -- the magazine that helps India think Right

April 17, 2015

Big Data for the non-Geek : Hadoop, Hortonworks & H2O

Hadoop is a conceptual delight and an architectural marvel. Anybody who understands the immense challenge of crunching through a humungous amount of data will appreciate the way it transparently distributes the workload across multiple computers and marvel at the elegance with which it does so.

image from nextgendistribution.com
Thirty years after my first tryst with data -- as relational database management systems that I had come across at the University of Texas at Dallas -- my introduction to Hadoop was an eye opener into a whole new world of data processing. Last summer, I managed to Demystify Map Reduce and Hadoop by installing it on Ubuntu and running a few Java programs but frankly I was more comfortable with Pig and Hive that allowed a non-Java person -- or pre-Java dinosaur -- like me to perform meaningful tasks with Map-Reduce. RHadoop should have helped bridge the gap between R based Data Science ( or Business Analytics) and the world of Hadoop but integrating the two was a tough challenge and I had to settle with hadoop streaming API as a means of executing predictive statistics tasks like linear regression with R in a Hadoop environment.

The challenge with Hadoop is that it is meant for men, not boys. It does not run on Windows -- the OS that boys like to play with -- nor does it have a "cute" GUI interface. You need to key in mile long unix commands to get the job done and while this gives you good bragging points in the geek community, all that you can expect from the MBAwallahs and the middle managers at India's famous "IT companies" is nothing more than a shrug and a cold shoulder.

But the trouble is Hadoop ( and Map Reduce) is important. It is an order of magnitude more important than all the Datawarehousing and Business Intelligence that IT managers in India have been talking about for the past fifteen years. So how do we get these little boys ( who think they are hairy chested adult men -- few women have any interest in such matters) to try out Hadoop?

Enter Hortonworks, Hue and H2O ( seriously !) as a panacea to all problems.

First, you stay with your beloved Windows (7 or 8) and install a virtual machine software. In my case, I chose and downloaded Oracle VM VirtualBox that I can use for free, and installed it on my Windows 7 partition. ( I have an Ubuntu 14.04 partition where I do some real work). A quick double-click on the downloaded software is enough to get it up and running.

Next I downloaded the Hortonworks Hadoop Data Platform (HDP) sandbox as an Oracle Virtual Appliance and and imported it into the Oracle VM Virtual Box ( or this) and what did I get? A completely configured single node Hadoop cluster along with Pig, Hive and a host of applications that constitute the Hadoop ecosystem! And all this in just about 15 minutes -- unbelievable!

Is this for real? Yes it is!

For example the same Java WordCount program that I had compiled and executed last year, worked perfectly with the same shell script that I modified with the corresponding Hadoop libraries present in the sandbox.

# for compiling
rm -r WCclasses
mkdir WCclasses
javac -classpath /usr/hdp/ -d WCclasses WordMapper.java
javac -classpath /usr/hdp/ -d WCclasses SumReducer.java
javac -classpath /usr/hdp/ -d WCclasses WordCount.java
jar -cvf WordCount.jar -C WCclasses/ .
# for executing/running

hdfs dfs -rm -r /user/ru1/wc-out2
hdfs dfs -ls /user/ru1
hdfs dfs -ls /user/ru1/wc-inp
hadoop jar WordCount.jar WordCount /user/ru1/wc-inp /user/ru1/wc-out2

But what really takes the cake is that all data movement from the Unix file system to the HDFS file system is through upload/download through the browser. Not the dfs command, -copyFromLocal! This is due to the magic of Hue, a free and opensource product that gives a GUI to the Hadoop ecosystem. Hue can installed, like Pig or Hive on any Hadoop system but in the Hortonworks sandbox, it comes pre-installed.

In fact, since Pig and Hive comes bundled in the sandbox, it is very simple to run Pig and Hive programs by following these tutorials.

But as a data scientist, one must be able to run R programs with Hadoop. To do so, follow instructions given here. But there are some deviations
  1. You log into the sandbox ( or rather ssh -root@ -p 2222) and install RStudio server, but it will NOT be visible at port 8787 as promised until you follow instructions given in this post. Port 8787 in the guest operating system must be made visible to the host operating system.
  2. You can start with installing the packages rmr2 and rhdfs. The other two, rhbase and plyrmr are not really necessary to get started with. Also devtools is not really required. Just use wget to pull the latest zip files and install the same.
  3. However RHadoop will NOT WORK unless the packages are installed in the common library and NOT in the personal library of the userid used to install the packages. See the solution to problem given in this post. This means that the entire installation must be made with the root login. Even this is a challenge because when you use the sudo command, the environment variable CMD_STREAMING is not available and without this package rhdfs cannot be installed. This Catch22 can be overcome by installing without the sudo command BUT giving write privilege to all on the system library, which would be something like /usr/lib64/R/library. 
  4. RStudio server would need a non-system, non-admin userid to access and use
Once you get past all this, you should be able to run the simple R+Hadoop program given in this post, but also the Hortonworks R Hadoop tutorial that uses linear regression to predict website visitors.

 R and Hadoop is fine but converting a standard machine learning algorithm to make it work in the Map Reduce format is not easy, and this is where we use H2O, a remarkable open source product that allows standard machine learning tasks like Linear Regression, Decision Trees, K-Means to be performed through a GUI interface. H2O runs on Windows or Unix as a server that is accessed through a browser at localhost:54321. To install it on the Hortonworks HDP sandbox in the Oracle VM VirtualBox, follow instructions given in this Hortonworks+H2O tutorial.

In this case, you will (or might) face these problems
  1. The server may not be available at ( the principal IP address ) but at the ( or local host) 
  2. The server may not become visible until you configure the VM to forward ports as explained in the solution given in this post.
 Once you have H2O configured with Hadoop on the sandbox, then all normal machine learning tasks should be automatically ported to the map-reduce format and can benefit from the scalability of Hadoop.

So what have we achieved ? We have ...
  1. Hadoop, the quintessential product to address big data solutions
  2. Hortonworks that eliminates the problems associated with installing a complex product like Hadoop, plus its associate products like Pig, Hive, Mahout etc
  3. The Oracle VM VirtualBox that allows us to install Hortonworks in a Windows environment
  4. Hue that gives a GUI to interact with the Hadoop ecosystem, without having to remember long and complex Unix commands.
  5. RHadoop that allows us to use RStudio to run Map Reduce programs. For more see this.
  6. H2O that allows us to run complete machine learning jobs, either in stand alone mode or as Map Reduce jobs in Hadoop through an intuitive GUI.
If you think this is too geeky, then think again if you really want to get into data science! In reality, once we get past the initial installation, the rest is very much GUI driven and the data scientist may just feel that he ( or she) is back in the never-never land of menu driven software where you enter values and press shiny buttons. But in reality you would be running heavy duty Hadoop programs, that in principle, can crunch through terabyes of data.

March 21, 2015

Why Not Vote Through ATMs?

The Chief Election Commissioner has recently stated that the EC is planning to use web and mobile based technology to allow citizens to cast their vote in local and national elections. Is this feasible? Will it be fair? Or secure? The challenge is indeed daunting but let us see how we can leverage an existing technology infrastructure to reach this goal easily and at a very low cost.

Online voting is not a new idea. The Computer Society of India and many public limited companies are already using the internet to allow members and shareholders to vote by logging into websites. The challenge is to make sure that only those who are authorised to vote are allowed to login to the site and this is ensured by sending unique userids and passwords by email. Obviously this assumes that every member of the electorate has a valid and validated email ID. But this would not be true when we consider the kind of people who are voting in panchayat, municipal,  state legislature and Lok Sabha elections.

image "borrowed" from http://nohandcuffs.com

This difficulty can be overcome with small, inexpensive hardware devices  that are used for secure logins in some banks and multinational companies but the cost and difficulty of distributing such devices is very high. Finally we need to secure such web based systems from  sophisticated hackers, cyber criminals and from cyber attacks from hostile countries. Since this is a big challenge, we need to consider alternatives.

Consider the ATM network that already spans the entire country.

Having evolved over the years, ATMs are viewed as stable and mature platforms for critical services. Moreover each usage of an ATM card is clearly and unequivocally tagged to a bank account that in turn is connected to person identified by a rigorous Know-Your-Customer (KYC) process. So while I can always give my ATM card to my wife and tell her the PIN code, I cannot deny or repudiate the actions that are performed with the card. The bank holds me responsible for any money withdrawn or transferred except when -- and this is rare -- the card has been stolen, along with the PIN, and used. The first ATM in India was set up by HSBC in 1987 and after a quarter of century of usage, the technology has instilled a sense of comfort both in banks as well as in people that is indeed reliable. Moreover all ATMs can be used to access funds lying in any bank in India.

What if the software in all ATM machines were to be upgraded to include an additional feature to allow voting? Just as the LPG Consumer number, given by the oil companies, is being connected to a bank account, with or without the existence of the Aadhar card, so can a Voter Card number be connected to a bank account and become verifiable through an associated ATM card. This means that if one were to slide an ATM card into an ATM machine it will uniquely identify the individual on the Electoral Roll. Now it is a simple matter for the system to determine if one is eligible to vote in any particular election that is being conducted by the Election Commission at any point of time. For example, if on a particular date, a by-poll is being conducted in a particular constituency, only those who are registered voters in that constituency, would be allowed to press a button and cast their votes -- and obviously, only once. For everyone else, the button to vote will be de-activated. If multiple elections are being conducted on the same date, a menu can be shown to the user so that he can choose the one specific election where he can cast his vote.

And as a by-product, the voter need not be physically presented in the geographic area where the election is being held. Votes can be cast by people living or working far away from their home constituencies as long as they can reach an ATM machine.

Now let us do some back-of-the-envelope calculations to check the feasibility of the numbers involved. The total number of voters in India is approximately 82 crores while the number of bank accounts is around 65 crores. That looks nice until we realise that many people have multiple accounts and so the number of distinct account holders could be about 10 crores, leaving a gap of 70 crores. However under the PM’s Jan Dhan Yojana for financial inclusion, almost 11 crore new bank accounts have been opened in the last 6 months. If we continue at this pace then it is a matter of 4 or 5 years before we can provide every voter with a bank account. In fact, in the more densely populated and urban areas, full coverage of almost all voters with bank accounts can occur much earlier.

So the proposal to use bank account linked ATM cards to validate voters and allow them to cast votes at ATM machines could actually be an extension of the Government of India’s publicly stated goal of total financial inclusion for the entire population. Universal franchise and universal banking could in fact be made two sides of the same coin of technological development!

But wait, there is more to come! Last year, the President of India inaugurated RuPay, an Indian domestic card scheme launched by the National Payments Corporation of India (NPCI). It was created to fulfill the Reserve Bank of India’s desire to have a domestic electronic payments system in India. RuPay facilitates electronic payment at all Indian banks and financial institutions, and is comparable with MasterCard and Visa in India. Banks in India are authorized to issue RuPay debit cards to their customers for use at ATMs, PoS terminals, and e-commerce websites. Many banks, including all major public sector banks, currently issue RuPay cards to their customers and RuPay cards are also issued at about 200 cooperative and rural banks to promote financial inclusion.

According to data published by the NCPI, there are almost 1.45 lakh ATMs in India that accept the RuPay card. This number is comparable to the 8.35 lakh polling booths that were used during the 2014 Lok Sabha polls and so can be used to reduce the load on the traditional EVM based booths significantly. This will also lead to a surge in the issuance and usage of RuPay cards and help it break into the market dominated by the global giants.

Let us see how the process works.

First, the voter card numbers of all eligible voters needs to be replaced with a uniform 16 digit number that reflect state and the constituency that the voter is eligible to vote in. Next, every voter would need to have a KYC-compliant bank account that will be linked to this voter card number. Any attempt to link the same voter card number to multiple bank accounts will be caught automatically. So the RuPay debit card, or any other ATM card, that is linked to the account becomes the de-facto voting card for those choosing to use the electronic voting option. However all those who still want to cast votes at EVM booths can continue to use the traditional voter card. As a one-time effort, people must decide which option they will exercise. .

Fortunately most of the physical components of the ATM network would remain the same and only the software would need to be changed. Today, when someone uses an ATM card, the validity of the card, the PIN and  the balance available in the bank account is checked against the user’s own bank computers. In this case, a similar validation will be done against the Election Commission’s computers to determine the voter number of the person and the constituency that he is eligible to vote in.

Finally, when an election is being held in any constituency, the software will allow only those who are eligible to vote in that constituency to cast their votes. So instead of having to go over to that one, single designated booth where he is registered, the vote can simply walk into the nearest ATM machine and vote in a safe, secure and convenient manner. Not only will this be convenient for the voter, it will also reduce the pressure on the election machinery as the number of polling booths required will be significantly reduced. Moreover, with votes being cast in a distributed manner the threat of booth capture, physical intimidation of voters and the casting of false votes can be significantly reduced.

The scheme has many advantages. There is no major investment in physical infrastructure. Using bank accounts and RuPay cards will accelerate financial inclusiveness and popularize the card while the Election Commission will need less money and manpower. Finally, voters will be able to cast their votes in a secure and convenient manner from anywhere in the country.

Who could ask for anything more?

This article appeared in March Issue of Swarajya

About This Blog

  © Blogger template 'External' by Ourblogtemplates.com 2008

Back to TOP