April 22, 2015

Mapping Money Movements to Trap Corruption

In the chequered history of parliamentary legislation in India, the RTI Act stands out as a significant milestone that puts activities of the government under public scrutiny. But even though the Act gives a legitimate platform for citizens to ask questions on, the process is cumbersome and answers are often given in a manner that is not easy to understand or make use of.

But why must a citizen have to ask for something that is his by birthright? Why can the information not be released automatically? But then who decides what information is to be released? At what level of detail? At what frequency?

The biggest challenge facing India is corruption. It is the mother of all problems because it leads to and exacerbates all other problems. If controlled, the money saved can be used to address most deficiencies in health, education and other social sectors.

Misguided people wrongly believe that having a strong Lokpal will solve the problem but when bodies as powerful as the CBI, the CVC have been subverted and compromised by political interests, it is foolish to expect one more publicly funded body, like the Lokpal to be any better. Instead, let us explore how a crowd sourced and data driven approach can help both track and crack the problem.

The central and state governments in India, between them, spend about Rs 25 lakh crores every year. Even if, very optimistically, only 10% of this is lost in corruption, then the presumptive loss to the public exchequer is Rs 2.5 lakh crores every year, compared to the one time loss of Rs 1.8 lakh crores in the  CoalGate scam.

Can we follow the clear stream of public money as it slowly gets lost in the dreary desert sands of government corruption? Most public money starts flowing from the commanding heights of the central government and passes through a complex hierarchy of state and central government departments, municipalities, zilla parishads, panchayats until it reaches the intended beneficiary, who could be a citizen, an employee or a contractor. Flows also begin from money collected as local taxes and routed through similar channels. The accompanying diagram gives a rough idea of this process.

In the language of mathematics, this is a graph consisting of a collection of nodes connected to each other through directed edges. Each node represents a government agency or public body and each directed edge represents a flow of money. Inward pointing edges means that the node or agency receive money while outward pointing edges represent  payments made. Green nodes are sources where money enters -- this could be tax, government borrowings or even the RBI printing notes, while yellow nodes are end-use destination of public money --  salaries, contractor payments, interest payments and direct benefits transferred to citizens.

Ideally, the flow of money through this network should be such that over a period of time the total amount that flows into the network at all green nodes should equal the total that flows out into all yellow nodes. In reality the sum will never add up because there is significant leakage, or theft, in the network. The money lost in transit within the graph between green and yellow nodes is one quantifiable measure of corruption. Another, less obvious case is the inexplicable or unusually high flows of money  -- as in the case of the Chaibasa Treasury at the height of the Fodder Scam or for a sudden spurt of expenditure in widening all roads inside a particular IIT Campus. Any deviation from norms, either historical or from similar expenditure elsewhere, needs an investigation and explanation.

Can such anomalies and deviations be detected? and how does the RTI Act fit into this picture?

This graph is large, the number of nodes and edges is very high and the problem seems insurmountable.  But we can  divide and conquer the problem because many tasks can be done in parallel by independent groups. Instead of focussing on the entire graph at one go, we can zoom-in on certain segments of the graph and examine specific nodes at higher magnification or smaller granularity. In principle, if the flow through every node is accounted for, then the flow through whole graph gets accounted for automatically.

How do we conquer each node?

Since each node is a government organisation, it falls under the RTI Act and we can demand details of all its cash flows. Once this is in the public domain it can be examined by private volunteer investigators either manually or by automated software specifically designed for forensic audit. In fact if Google search bots, or software robots, can crawl through the entire web to track down, rate and index trillions of unstructured web pages, it would not be difficult to build software that can track down and reconcile each and every cash flow transaction in India provided the data is publicly available in a digital format.

The results of such investigations and the unbalanced flows that are revealed should also be in the public domain and would be the starting point of either a formal CAG directed audit or citizen activism directed at the agency concerned. If the cash flows of a particular panchayat or a government agency do not add up or seem suspicious, affected parties should take this up locally either through their elected representatives or through more specific and focussed RTI requests.

This may seem complicated but is not really so. All that we are demanding is that bodies that deal with public money should publish their financial accounts into the public domain in a standardised format. Obviously the accounting format specified for listed companies may not be appropriate since assets and liabilities are accounted for differently and there is no question of profit or loss for public bodies. Instead, the focus is on the cash flow statement. Specifically, how much money is coming in? and where is it going to?

How will this actually work in practice?

First, the CAG and the Institute of Chartered Accountants of India will create a format to report all cash flows in public bodies in terms of a nationally consistent set of cost codes and charge accounts. Next the CIC will mandate that all public bodies must upload this information every quarter into a public website maintained by the CIC. As this information accumulates online, volunteer auditors, public activists, anti-corruption campaigners and even the CAG itself, if it wants to, can collaboratively build a website, like wikipedia or wikimapia, that pulls data from the underlying CIC website and displays the cash flow graph. Even as the graph gets built, people can start looking for missing or unusual flows.

So there is no persistent workload on either the CAG or the CIC. However each public body must prepare its cash flow statement and put it on the CIC website. In any case, a public body is expected to keep a  record of  all cash flows --  through cash, cheque or EFT. This needs to be put into CAG designated format and uploaded periodically to the CIC website.  

In fact the CIC website should already be ready because in October 2014,  the DoPT  announced that henceforth all replies under the RTI Act will be uploaded to the web in any case!

In the initial stages, the graph will have major discrepancies in cash flow and this could be because all agencies, or nodes, have not been identified or are not reporting information. At this point, activists can put in specific RTI requests to the concerned agencies to report their information in the CAG format and the CIC must ensure immediate compliance with the same.  Over a period of time and with a number of iterations the cash flows through all parts of the network should first become visible and then balance out. Compliance will be achieved by continuous RTI pressure at the grassroots, applied in a parallel and distributed manner, on defaulting government agencies.

When known outflows do not match inflows or if there are deviations from expected norms then activists can draw the attention of the media and opposition politicians who can raise a hue and cry. This should lead to the usual process of investigation where the CVC, the CBI or even the Supreme Court can get involved. After all,  Al Capone, the notorious US gangster was finally convicted on the basis of a forensic audit, not a shoot-out police operation!

Unlike the top-down Lokpal driven approach, this bottom-up strategy calls for no new law, no new agency, no new technology, no new infrastructure. All it needs is a format to be defined by the CAG and for the CIC to ensure that any RTI demand in this format be addressed immediately. With this, the flow of public money will become visible and if this transparency leads to a reduction of only 10% of the presumptive loss of Rs 2.5 lakh crores, it would still mean an additional Rs 25,000 crore for the Indian public every year.

19th century British India had the vision, the audacity and the tenacity to carry out the Great Trigonometrical Survey that created the first comprehensive map ( see diagram)  necessary to govern this vast country. Armed with the digital technology of the 21st century, a similar mapping of all public cash flows will lead to greater transparency in the governance of modern India.

a high resolution version of this image is available from wikimedia.

This article first appeared in Swarajya -- the magazine that helps India think Right

April 17, 2015

Big Data for the non-Geek : Hadoop, Hortonworks & H2O

[Note] -- Hadoop, IMHO, is history. Rather than waste time with all this, suggest you check up my blog post on Spark with Python.

Hadoop is a conceptual delight and an architectural marvel. Anybody who understands the immense challenge of crunching through a humungous amount of data will appreciate the way it transparently distributes the workload across multiple computers and marvel at the elegance with which it does so.

image from nextgendistribution.com
Thirty years after my first tryst with data -- as relational database management systems that I had come across at the University of Texas at Dallas -- my introduction to Hadoop was an eye opener into a whole new world of data processing. Last summer, I managed to Demystify Map Reduce and Hadoop by installing it on Ubuntu and running a few Java programs but frankly I was more comfortable with Pig and Hive that allowed a non-Java person -- or pre-Java dinosaur -- like me to perform meaningful tasks with Map-Reduce. RHadoop should have helped bridge the gap between R based Data Science ( or Business Analytics) and the world of Hadoop but integrating the two was a tough challenge and I had to settle with hadoop streaming API as a means of executing predictive statistics tasks like linear regression with R in a Hadoop environment.

The challenge with Hadoop is that it is meant for men, not boys. It does not run on Windows -- the OS that boys like to play with -- nor does it have a "cute" GUI interface. You need to key in mile long unix commands to get the job done and while this gives you good bragging points in the geek community, all that you can expect from the MBAwallahs and the middle managers at India's famous "IT companies" is nothing more than a shrug and a cold shoulder.

But the trouble is Hadoop ( and Map Reduce) is important. It is an order of magnitude more important than all the Datawarehousing and Business Intelligence that IT managers in India have been talking about for the past fifteen years. So how do we get these little boys ( who think they are hairy chested adult men -- few women have any interest in such matters) to try out Hadoop?

Enter Hortonworks, Hue and H2O ( seriously !) as a panacea to all problems.

First, you stay with your beloved Windows (7 or 8) and install a virtual machine software. In my case, I chose and downloaded Oracle VM VirtualBox that I can use for free, and installed it on my Windows 7 partition. ( I have an Ubuntu 14.04 partition where I do some real work). A quick double-click on the downloaded software is enough to get it up and running.

Next I downloaded the Hortonworks Hadoop Data Platform (HDP) sandbox as an Oracle Virtual Appliance and and imported it into the Oracle VM Virtual Box ( or this) and what did I get? A completely configured single node Hadoop cluster along with Pig, Hive and a host of applications that constitute the Hadoop ecosystem! And all this in just about 15 minutes -- unbelievable!

Is this for real? Yes it is!

For example the same Java WordCount program that I had compiled and executed last year, worked perfectly with the same shell script that I modified with the corresponding Hadoop libraries present in the sandbox.

# for compiling
rm -r WCclasses
mkdir WCclasses
javac -classpath /usr/hdp/ -d WCclasses WordMapper.java
javac -classpath /usr/hdp/ -d WCclasses SumReducer.java
javac -classpath /usr/hdp/ -d WCclasses WordCount.java
jar -cvf WordCount.jar -C WCclasses/ .
# for executing/running

hdfs dfs -rm -r /user/ru1/wc-out2
hdfs dfs -ls /user/ru1
hdfs dfs -ls /user/ru1/wc-inp
hadoop jar WordCount.jar WordCount /user/ru1/wc-inp /user/ru1/wc-out2

But what really takes the cake is that all data movement from the Unix file system to the HDFS file system is through upload/download through the browser. Not the dfs command, -copyFromLocal! This is due to the magic of Hue, a free and opensource product that gives a GUI to the Hadoop ecosystem. Hue can installed, like Pig or Hive on any Hadoop system but in the Hortonworks sandbox, it comes pre-installed.

In fact, since Pig and Hive comes bundled in the sandbox, it is very simple to run Pig and Hive programs by following these tutorials.

But as a data scientist, one must be able to run R programs with Hadoop. To do so, follow instructions given here. But there are some deviations
  1. You log into the sandbox ( or rather ssh -root@ -p 2222) and install RStudio server, but it will NOT be visible at port 8787 as promised until you follow instructions given in this post. Port 8787 in the guest operating system must be made visible to the host operating system.
  2. You can start with installing the packages rmr2 and rhdfs. The other two, rhbase and plyrmr are not really necessary to get started with. Also devtools is not really required. Just use wget to pull the latest zip files and install the same.
  3. However RHadoop will NOT WORK unless the packages are installed in the common library and NOT in the personal library of the userid used to install the packages. See the solution to problem given in this post. This means that the entire installation must be made with the root login. Even this is a challenge because when you use the sudo command, the environment variable CMD_STREAMING is not available and without this package rhdfs cannot be installed. This Catch22 can be overcome by installing without the sudo command BUT giving write privilege to all on the system library, which would be something like /usr/lib64/R/library. 
  4. RStudio server would need a non-system, non-admin userid to access and use
Once you get past all this, you should be able to run the simple R+Hadoop program given in this post, but also the Hortonworks R Hadoop tutorial that uses linear regression to predict website visitors.

 R and Hadoop is fine but converting a standard machine learning algorithm to make it work in the Map Reduce format is not easy, and this is where we use H2O, a remarkable open source product that allows standard machine learning tasks like Linear Regression, Decision Trees, K-Means to be performed through a GUI interface. H2O runs on Windows or Unix as a server that is accessed through a browser at localhost:54321. To install it on the Hortonworks HDP sandbox in the Oracle VM VirtualBox, follow instructions given in this Hortonworks+H2O tutorial.

In this case, you will (or might) face these problems
  1. The server may not be available at ( the principal IP address ) but at the ( or local host) 
  2. The server may not become visible until you configure the VM to forward ports as explained in the solution given in this post.
 Once you have H2O configured with Hadoop on the sandbox, then all normal machine learning tasks should be automatically ported to the map-reduce format and can benefit from the scalability of Hadoop.

So what have we achieved ? We have ...
  1. Hadoop, the quintessential product to address big data solutions
  2. Hortonworks that eliminates the problems associated with installing a complex product like Hadoop, plus its associate products like Pig, Hive, Mahout etc
  3. The Oracle VM VirtualBox that allows us to install Hortonworks in a Windows environment
  4. Hue that gives a GUI to interact with the Hadoop ecosystem, without having to remember long and complex Unix commands.
  5. RHadoop that allows us to use RStudio to run Map Reduce programs. For more see this.
  6. H2O that allows us to run complete machine learning jobs, either in stand alone mode or as Map Reduce jobs in Hadoop through an intuitive GUI.
If you think this is too geeky, then think again if you really want to get into data science! In reality, once we get past the initial installation, the rest is very much GUI driven and the data scientist may just feel that he ( or she) is back in the never-never land of menu driven software where you enter values and press shiny buttons. But in reality you would be running heavy duty Hadoop programs, that in principle, can crunch through terabyes of data.