December 12, 2014

Tragic, Magic Sunderbans

We have been living in Calcutta for so long and yet rarely do we find the time to visit the Sunderbans, a UNESCO World Heritage Site, that is just next door. The Sunderbans is famous for two things, first the Royal Bengal Tiger and the unique estuarine-mangrove ecosystems that has protected Calcutta and Bengal from the fury of many Bay of Bengal cyclones and the resultant tidal bores. In the context of global warming and the rise in the level of the sea, it is very likely that a hundred years from now the Sunderbans will get submerged and then live on only in history and in people's memory. Hopefully, these pictures will help us remember this tragic, magic land!

Our destination was the Royal Bengal Resort, near the Sajnekhali Reserve Forest

But we started off from Godkhali, 100km from Calcutta, just beyond Canning and right across the Bidyadhari River from Gosaba, one of the largest and busiest bazaar/village in the Sunderbans.

There are no roads in the Sunderbans and the only transportation is a launch or a motorized river boat. (There is hardly any fresh, sweet drinking water either, but that is another story )

and what you see is the endless vista of the dense mangrove, sundari, khalsi, hetal forests through which the launch keeps moving

the picture above shows a Hetal forest, where tigers typically hide

and when the tide goes down -- and it goes down nearly 20 to 30 feet, twice a day, you may get stuck on mud flats like this

and then you can see the birds

that come looking for crabs. Note that the red crabs have one large claw!

and as you head deeper and deeper into the dense, swampy jungle you will get to see the estuarine crocodile

crawling through the pneumatophores or aerial roots that help these tidal forests to breathe! Moving even deeper, we got a glimpse of deer and wild boar

you would notice that the deer in the picture above is "alert", with its tail up and its leg crooked. Normally this would mean that it is sensing danger and the biggest danger in the whole of Sunderbans, after the terrible cyclones, is the Royal Bengal Tiger!

Of course you need to be very lucky to spot a tiger ( or incredibly unlucky, as ten to twenty people are killed by tigers every year !) but the closest that we came to one was when we spotted these pugmarks !

at  Sudhanyakhali where the forest department has set up a sweet water pool to lure tigers and other wildlife like this monitor-lizard

and created a set of clearings where hopefully the tiger would appear.

But even though we could not see the tiger, it is an integral and dangerous part of the life of people out here. Unlike tigers in other parts of the country, the Royal Bengal Tiger of the Sunderbans is by nature not averse to attacking human since it views humans as its legitimate prey. As a result, there is hardly any village in the Sunderbans that has not lost a couple of people to the tiger in the last five years. Fortunately, the tiger will never cross a fence even though, given its immense strength, it could very easily do so and so there has been some effort to fence in the dangerous jungle -- click to enlarge and watch the next picture carefully

but the land is so vast that this is a very difficult task.

In fact, for the people who live here, there is the tiger on the land, the crocodile in the water, the soil is saline and unfit for agriculture and the absence of either roads or electricity means that there is no industry. The only option is the extremely dangerous task of gathering honey from the tiger infested forest and fishing in shark and crocodile infested waters.

and trying to eke out a living behind the levees that barely manage to keep the sea water out of their  homes.

Other than tourists like us, who go and spend a few rupees in this desolate land, the only source of succour is Bon-Bibi, the Queen of the Forest, a variation of the Divine Feminine that is worshipped all over India as well as in every nook and corner of the Sunderbans as well.

After two days of wandering in the watery wilderness of the swamps and forests, we finally headed back to Gosaba where we saw the bungalow built by the intrepid Mr Hamilton ( after whom an island has been named ) and where Rabindranath Tagore had spent some time

Thus ended our two nights and three days very pleasantly spent at the Sunderbans. In case any is interested in visiting Sunderbans you may contact Mr Chatterjee at 9874971845  and he will put you in touch with some very nice people who can help you travel through this difficult land.

October 14, 2014

Three Myths and a Lie : Charbak Unplugged

The world today is awash in lies and myths, because nothing is what it seems to be. Paid media, biased press -- no one seems to be telling the truth anymore. Is this how it is supposed to be ? As an aside, any analysis of philosophy in India invariably begins with a commentary, or denunciation, of Charbak’s philosophy of extreme materialism -- reenam kritwa, ghreetam pibet! Borrow money and have fun. Now that we have invoked Charbak, let us get along with our thoughts.
image from

There are so many lies floating around that it is not too difficult to pick any three of them. Consider the fallacy of the demographic dividend that has been tomtommed by everyone from an opportunist Nobel Laureate to the semi-literate leftist. The main thesis is that if a country has lots and lots of young people it will automatically become an economic powerhouse. Why? Because young people can work. What it ignores completely is that more kids mean less spent on each for education so what you get is a bunch of useless monkeys who are of no use. The labour they could have provided is more easily available through automation and all that this demographic surplus does is to suck more and more resources out of a strained ecosystem thus degrading it further. No country has ever pulled itself out of poverty by producing more and more babies but politicians love the concept of a demographic dividend because a vast mass of poor, uneducated people who can be ruled over very easily. The next elusive concept of a win-win situation. A long time ago in a country called Paradise far, far away there lived a farmer and fisherman who would exchange rice and fish and live happily ever after on a daily diet of fish curry and rice, but everywhere else, and ever since then, let us understand that the global economy is a Zero sum game. If you win, someone else has to lose and that is why the economic topology of the world will always have gradients, with peaks and valleys, irrespective of how loudly we scream that the world is flat! In the past, Europe became rich at the expense Africa and Asia, then it was the turn of the oil countries, currently it is China and perhaps someday it will be India. What is true at the level of nations is also true at the levels of individuals. Each of us, as individuals or corporates, are interested in winning and whether the trade is commercial or even emotional, you would always be fighting to get more than what you have paid for, in the currency of your choice. Which leads us, most naturally, to the next fallacious concept that by default, man is inherently good and it is circumstances that lead him astray. Very similar is the assumption -- made these days in the context of Islamist jihadis -- that the silent majority is inherently good, nice and peaceful and it is a small fringe that displays barbarism. Not so. Man is inherently selfish and will turn violent if he sees his interests threatened and if, a very big if, he has the courage and ability to carry through on his violence. But very often, he does not have the courage to be violent and takes recourse to the philosophy of non-violence. Yes, Vivekananda has talked about “the potential divinity of man” but the operative word is potential and not divinity! Once the potential is achieved man can become good but until then he is just another selfish animal and the selfishness is coded right into his genes. Richard Dawkins has very clearly established that selfishness, of the gene at the expense of its host organism, is the driver behind the evolution of living organisms and so, the human race standing at the top of the pyramid, cannot but be the most selfish organism on the planet. The demographic dividend, the win-win scenario and the inherent goodness of man are just three of the many myths that we live with but there is a bigger lie that forever lies beneath the surface of our daily dialogues -- that the ascent of man continues and we are in the process of doing something great and good. Whether it is Malala or Kailash trying to drag children out of their misery or Akasaki, Amano and Nakamura “discovering” the blue LED that will keep the planet green, the world is obsessed with doing something that will improve the current lot of man. But if you step back for a while and think about : “Does it really matter?” We have been on this planet for 5 billion years and the known universe has been around for 15 billion. Then of course there could be other universes. In all this how does it matter whether human beings -- who have been around for a mere 100,000 years -- live or die? and whether they can use blue LED to save energy? Only an arrogance of monstrous proportions can make us feel that we do matter and hence we should go around doing what we believe is right. We have just had Durga Puja, where, as in all other years, truly beautiful idols of the goddess were made and installed in fantastically decorated puja pandels. But then, on Bijoya Dashami day it is all thrown away to remind us that everything is temporary, transient and transitional. The devi and the pandel is not permanent and so is the case with the world. You may spend hours and hours decorating and beautifying one corner of your Puja pandel but then on the last day, it would be swept up with everything else and thrown into the river. So is the case with the world in general. We may think that we are doing a great job in making the world a better place to live in but in the long run it simply does not matter. It is all transient and temporary and will be washed away by the passage of time. So what should one do? Give up and commit suicide? For the vast majority of us, it is Charbak and his philosophy of hedonistic materialism. Eat, drink and be happy. But is it really as “bad” as that ? Can I not do anything meaningful like helping my neighbour? or discovering blue LEDs ? Go ahead, indulge yourself and feel happy about it but the point is that in the long run it just does not matter. You may just as well make yourself happy by drinking beer instead. Beer and blue LEDs are just two sides of the same coin of human cognition as Ralph Waldo Emerson has said so elegantly in his poem, Brahma.
Far or forgot to me is near, Shadow and sunlight are the same, 
The vanished gods to me appear, And one to me are shame and fame.
That was Charbak for the vast majority. But for a tiny minority, there could be something else, that lies beyond pale of reason, of logic, that has created this illusion of a physical world. The world that is hinted at in movies like the Matrix, the simulation argument of Nick Bostrom and of course by the original Advaitin, Sankara, and his thesis of “Brahma Satya, Jagat Mithya” -- that the world is an illusion. Could it be that we are not real at all ? Should we doubt our own existence? But then again
They reckon ill who leave me out; When me they fly, I am the wings; 
I am the doubter and the doubt, And I the hymn the Brahmin sings.


P.S. As an aside, and as a shameless plug, the author would like to state that he has written a book -- The Road to pSingularity -- that explores this concept a little more.

P.P.S. This article originally appeared in India America Today

September 03, 2014

Liberty Schools : Putting Education on Steroids

The Green Revolution freed India from the spectre of famine by using technology aggressively and effectively to boost agricultural productivity. Operation Flood made India not only a milk sufficient nation but also the world’s largest producer of milk. Education, on the other hand, is still one of India’s greatest weaknesses because the Government has been a spectacular failure in delivering even primary education to vast swathes of our population. The depth of this tragedy is evident when the Prime Minister has to admit that even after 68 years of independence, most of our schools do not have toilets -- let alone labs, libraries and broadband internet !

What we need is the equivalent of a Green Revolution, an Operation Flood that will give a huge, non-linear, exponential boost to our delivery model so that educational services can reach each and every corner of the country. We need both quantity and quality and we need it in the next 5 years. Is this possible ?

Why not ?

During the Second World War, German submarines were sinking so many British and American ships in the Atlantic that the entire allied war effort was in danger of coming to grinding halt. Building a new ship would almost take a year while sinking it was a matter of minutes. In this grim scenario, US and British shipbuilders worked together to create an incredible strategy that allowed to a ship to be launched in 24 days and delivered, on an average, in 42 days. At its peak, in 1943, the six US shipyards between them, were delivering three ships every day !

This was made possible by using pre-fabricated components that were freely interchangeable across different shipyards and mathematical models, like linear programming, that allowed optimisation of men, materials and time to achieve these very ultra high rates of productivity.

Let us do the same with schools !

There are 542 MPs in India and each gets Rs 5 crores under the MPLAD scheme each year for five years. That is Rs 25 crores. Let us keep apart Rs 20 crores from each MPs MPLAD fund and use it to build two new schools each constituency -- this means 1000+ new schools in 5 years ! As an incentive for the MP, the school will be named after him or her. Now can that transform India radically ?

Not quite. If funds could transform a country, then India would have become a paradise by now. The Union Budget presented in 2013 allocated Rs 20,000+ crores to primary education but no one really knows what happened to that money. What we need instead is a powerful management structure to turn this expenditure into a tangible physical asset. What are the components of such a structure ?

  1. A mechanism to identify vacant or government owned land in each constituency and acquire the same.
  2. A standard design of a school -- built with pre-fabricated components --  that will initially accommodate 100 students in year 1 and scale up to 1000 students in 5 years and 5000 students in 10 years.
  3. A partnership with at least three to five civil engineering / construction companies who will have pre-fabrication facilities in different parts of the country so that, after the initial gestation period, two “Liberty School”s can be inaugurated, somewhere in the country, by a Lok Sabha MP every three days !
  4. A central teacher management facility that will select 10,000 post-graduates across various disciplines every year, impart a basic teacher training curriculum and send them to teach in the new schools.
  5. A central content creation facility that will use the CBSE syllabus to create textbooks and online content in various local languages.
  6. A public sector company, let us call it Bharat Vidya Nigam, similar to IOCL, NTPC or SBI but reporting directly to the HRD ministry, that drive this process through a team of professional managers hired at competitive salaries from the market.

Spending MPLAD money ( that is tax payer’s money) to set up schools is one thing, but running them efficiently is another matter. To guard against the unfortunate reality that most government institutions degenerate over time and to prevent too much of centralisation, these “Liberty Schools” must be placed in a BOOT framework -- build, own, operate and transfer. But unlike a normal BOOT where private players build infrastructure and then transfer the same to public bodies, the process in this case will be reversed.

Let us now look at the operational aspect of running these schools.

After ten years of operation, the “Liberty schools” will be auctioned off, in a fair, honest and transparent manner to any entrepreneurial organisation that will be run the school as a private limited company with the stipulation  that between 26% - 49% of the equity in the company will be held by a trust whose trustees will be elected by the parents of the students who are currently studying in the school. This will ensure that the interests of the primary stakeholder, that is the student, is served adequately.

Education will not be free and tuition fees will be determined on the basis of actual operating expenses. However upto 90% of the fees of specific students -- chosen on the basis of economic and social parameters -- will be subsidised by a consortium of central, state, municipal and even non-governmental funding agencies. For example, if there are 1000 students in a school, the Central Government may support 200, the State Government another 200 and others another 50. So in this case, 45% students will have 90% of their fees reimbursed through a direct transfer of benefits to a bank account attached to their Aadhar card, but depending on local circumstances, these ratios would be different in each school or state. The criteria for selecting students for subsidy and the distribution of the subsidy will be managed centrally with the concurrence of local stakeholders and would be done in a manner that is transparent to the public.

For far too long, the educational establishment in India has been comfortably chewing the cud of public money and generously generating tonnes of bovine excreta, which is no doubt very useful in certain circumstances. But now the time has come to think big, think different, take a little risk, pump in some high power steroids and make it produce some real milk that will nourish our youth with real knowledge. Let us kickstart a totally new era of public education in India.

July 30, 2014


An internet of minds

Aldous Huxley in his book, The Doors of Perception, a travelogue of his mescaline-fuelled trip through the mental landscape, compares the mind to a reducing valve. Every sentient individual is in touch with and is aware of all that is happening anywhere else but to preserve sanity in the face of an intolerable information overload, the mind chokes off most of the information and restricts itself to the safety and comfort of his local environment. The concept of a universal mind or consciousness, the brahman, of which an individual or atman is but an image is also the foundation of Sankara’s Advaita Vedanta. Both Huxley and Sankara in their own way suggest that all “minds” in the universe are somehow connected together and are in a position to share information. Doesn’t this ring a bell somewhere ?
image Peace of Mind by Dokon taken from

A telephone bell perhaps ? Since every telephone on this planet is connected, or certainly connectable, to every other telephone that exists. Then you have the internet where again every computer is a part of a gigantic computer network and by extension we are talking about the internet-of-things where domestic and industrial gadgets would be able to communicate among themselves and make a difference to our lives.

But can minds communicate with each other ? without the obvious intermediation of, say,  normal speech or the printed word ? Can we bypass the traditional sensory organs and establish a direct contact between two minds ? Strange as it may seem, this is no more in the domain of myth or even science fiction. It is something that is just around the corner in hospital and university labs.

It all began in 1998 with Kevin Warwick at the University of Reading, UK where he managed to open doors by simply thinking about it. It was no magic but pure science where signals from nervous system were picked up and transmitted through RFID enabled implant on his hand to a electro-mechanical device that opens doors. This technology was subsequently picked up by medical engineers who have now enabled paraplegics to control the movement of wheelchairs by thought -- similar in concept to what was proposed by Craig Thomas in his book Firefox, the thought controlled Russian fighter plane. At the retail and commercial level, Emotiv has developed an easy to use, non-intrusive cap that allows players in a computer game to make moves by simply thinking about them and many products -- devices and applications -- that allow thought control are now available at the retail level.

Outbound signals, from the brain to the world are interesting but what is even more useful are inbound signals of the kind that are processed in bionic eyes and allow blind people to “see” the world around them -- in a rather fuzzy manner, at the moment -- by delivering the feed from cameras directly into part of the brain that normally processes information received from nerves in the eye. Recently, both outbound and inbound signals have been hooked together at Harvard where it has been demonstrated that a rat can be made to wag its tail by a human being who simply thinks, or “wills”, about the action. The last lap from man to man has also been demonstrated at the University of Washington

The stage where we are today, when a man can make a rat move its tail by thinking about it, is roughly similar to the state of computer networking in the 1940s when George Stibitz sent a piece of data from a teletypewriter to a complex number calculator through a wire. That was how the internet, that we know today, started. Then of course there was TCP/IP, HTTP, the Mosaic browser, WiFi, bluetooth and the rest is history. Can this happen with human minds ?

Why not ?

Technically, there is nothing that stops us from picking up the electro-chemical signals in our brain, routing them out to an networking device that is attached to the body, then out through normal computer networking channels to another device attached to the body of another person and then delivering them into the appropriate point the recipient's brain. All the physical components are in place but what is missing, at the moment, is the ability to process a blizzard of signals, extract the wheat from the chaff, the needle from the haystack and then converting it into a signal that will mimic a real world situation and fool the recipient's brain. This means sophisticated signal processing and pattern recognition in real-time -- something that is very much  possible and conceivable in the current environment.

Assuming that this happens in the next ten or fifteen years how would the world change ? Privacy concerns would mean that even if we could, it is unlikely that anyone would be broadcasting his thoughts into the internet -- just as we do not make the contents of our hard disk public and accessible on the web. However thoughts and ideas from various designated sources could be pooled in a “thought server” -- similar to a web server -- and made available to anyone who has the “thought browser”. The Mosaic of the Mind ? Or it could be that two consenting individuals could allow themselves to be “paired” as in the Bluetooth devices that allow file transfer from one device to another.

The internet is no more about TCP/IP and the ability to send signals from one computer to another. The world-wide-web, an application that runs on the underlying infrastructure of the internet, has spawned a whole new global culture of eCommerce, collaboration, social media that has led to amazing uses and applications that the network engineers in the 1940’s could never even envision in their wildest imaginations. Unlike our ancestors who lived in caves, modern man has a great intimacy and dependence on the industrial world. We live in climate controlled homes, move around in motorized vehicles and eat food that is grown, or perhaps manufactured, in man-made and machine controlled environments. As a result our bodies no longer need the ability to run miles or survive extreme climates. Instead we have learnt to use our brains to do many things that were not possible in the past -- like write books, create music and do mathematics.

Going forward, as human brains get tied to digital devices and through them to other brains, new abilities and social constructs will emerge and define the contours of tomorrow’s world. Would that would be the next step in the ascent of man ?
this article was first published in

June 15, 2014

Introduction to Hadoop

My effort to learn Hadoop and Map Reduce resulted in the presentation. If you like it and find it useful do leave a comment.

June 11, 2014

Forecasting Retail Sales - Linear Regression with R and Hadoop

image from
A retail store tracks the volume of sale for each stock-keeping-unit (SKU) that the store deals with. Given the sales for days 1 through 5, is it possible to predict the sales on days 6 through 10 ? Common sense dictates that sales will remain constant and the average sales per day for the first 5 days will be the same as the average sales per day for the next 5 days. However this may not always be the case, if there is a rising or falling trend. If there is a strongly rising trend, caused by a some strong promotional activity, then the assumption of constant sale will lead to a stock-out and loss of potential business. Similarly, if there is a strongly falling trend, then a similar assumption will lead to accumulation of dead stock and hence a loss related to excessive inventory. Instead of days, the same analysis can be done on the basis of weeks, fortnights or even month. Net-net given the sales over 5 periods of time, it is useful to be able to predict the same for the next 5 periods. How can we do this ? Without resorting to the simplistic "average daily sale" strategy ?

A simple solution is to use linear regression, a well known and widely used statistical tool. If you have the sales data for a particular SKU for the past 5 days, you can "fit" a regression line, determine the slope and the intercept of this line and use the resulting linear regression "model" to predict the expected sales for the next 5 days. Based on these predictions,  you can place orders for these SKUs so that the gap between expected and the actual is minimum. A software tool like R can be used to solve this problem very easily.

All this is well known. But when the number of SKUs is of the order of 50,000 - 70,000 then the time required  -- to build so many regression models, even with R, and then using each to estimate the sales quantity for different SKUs -- becomes enormous ! In fact, if one has to do this on a rolling-basis every day to predict the sales over the next 5 days, then it becomes impossible. Even before we have a solution to today's prediction, the next set of data is waiting and getting stale !

This is where Hadoop steps in. By splitting the regression problem for 50,000 - 70,000 SKUs across multiple computers, it is quite possible to solve the entire problem in a reasonable amount of time. This means that the person responsible for placing orders for the replenishment of inventory would know which of the SKUs would need to be ordered in a higher quantity and which to be ordered in a lower quantity. This is the Linear Regression problem that we will solve with R and Hadoop.

R is not necessary for regression. Any programming language like Java can be used but using R ( or for that matter, a similar tool like Python ) allows the ready-made function -- lm()  for linear regression -- to be used without re-inventing the wheel. In fact, R is a free and open-source statistical tool that is very widely used across the data analytics community. There are two ways to use R with Hadoop. First, we can use the streaming feature of Hadoop with R scripts or we can use the RHadoop set of packages from Revolution Analytics ( which include rhdfs, rmr2 and rhbase). The RHadoop path initially looks easier because it allows one to operate from within the familiar  R environment, but configuring RHadoop is difficult ( or at least, the author was unsuccessful despite a lot of effort). Moreover RHadoop is in reality using the same streaming feature of Hadoop to get the job done. So there is no loss if one ignores RHadoop and uses the native streaming feature of Hadoop directly.

So now we will see how to solve the Sales Forecasting Problem.

We have miniaturized the problem by assuming the Retail Store stocks and sells only 3 products, namely salt, soap and soda. On a particular day, arbitrarily designated as Day 08, the sales of these three SKUs was as follows and this was stored in a file called DailySales08.txt
8 soap 90
8 salt 90
8 soda 120
where the first column represents the day, the second the SKU name (or code) and the third column is the sales on Day 08. There are 4 other files, namely DailySales09.txt, DailySales10.txt, DailySales11.txt, DailySales12.txt.  In reality, each of these files will have very large number of records,  with one record for each SKU

Based on the data for 5 days, from day 08 to day 12, we need to estimate the data for 6th to 10th day, or for day 13 to day 17. Once we run the R program in Hadoop, the following output is generated
salt dates [ 8 - 12 ] : 400  next [ 6 - 10 ] 175 : 17  -- 575  
soap dates [ 8 - 12 ] : 600  next [ 6 - 10 ] 1025 : 239  -- 1625  
soda dates [ 8 - 12 ] : 620  next [ 6 - 10 ] 845 : 187  -- 1465 

where each row represents the picture for each SKU, where we can see in row 1
  • SKU is "salt"
  • cumulative actual sales on days 8 - 12 ( the first 5 days of the analysis ) is 400
  • cumulative expected sales from 6th to 10th day is 175
  • the estimated sale on the last, 10th day, that is day 17 is 17 ( just a coincidence !)
  • the total estimated sales over the 10 day period is 575 ( 400 actual, 175 estimated)
Why is the actual sales in the first 5 days 400 but the predicted sales in the next 5 days only 175 ? See what the regression data reveals :

The black dots represent the actual sales on the first 5 days [ day 8 - day 12 ] Based on this the model has created the regression line : sales = 170 -9*days and with this the estimated values for all the days can be calculated and shown as red dots on the graph. Because of the falling trend, the expected sale in the next 5 days is significantly lower than in the first 5 days. Or so says the regression data !

To run this program, a development environment was created an Ubuntu 14.04 laptop running R 3.0.2 and Hadoop 2.2.0 installed in a single cluster mode as described in my earlier post Demystifying Hadoop and MR with this DIY tutorial.

Section 4 of that tutorial showed how the Hadoop streaming utility was used to run a WordCount program in Python. The same strategy is used in this case, where we have replaced the python programs with two R scripts, LinReg-map.R and LinReg-red.R and a shell script was used to execute the map-reduce job. The source code of all three scripts along with the 5 datafiles are available at the Git Repository prithwis/Retail.

Once the Mapper (LinReg-map.R) runs, the output looks like this, though in reality, this output will not be stored but instead "streamed" to the Reducer

salt 10$120 
salt 11$50 
salt 12$60 
salt 8$90 
salt 9$80 
soap 10$100 
soap 11$140 
soap 12$160 
soap 8$90 
soap 9$110 
soda 10$150 
soda 11$130 
soda 12$140 
soda 8$120 
soda 9$80 

here the Key is the SKU name, and the Value is a string formed by the concatenation of the date and the quantity sold, separated by the $ char.

In this case there were only 15 records ( 3 SKUs x 5 days ) but even if the number of SKUs is very high, the task of creating this sorted list of <key, value> can be distributed across multiple servers in the Hadoop cluster. This sorted list of records can now be distributed again to multiple servers for the second, reducer, program LinReg-red.R to execute. Hadoop ensures that all records pertaining to any one key ( or SKU) is sent to only one machine where the Linear Regression function is executed.

The reducer program reads through all the <Key, Value> pairs for each Key ( or SKU), splits the Val at the $ char isolate the date and the sale value for that date and create two lists one of dates and the other of the corresponding sale values. These two lists are passed, along with the key (SKU) to the user defined EstValue() function. The fourth parameter N, in our case 9, represents the number of days between the last day of the period and the first day for which data is available. In this case, first day was 8, N is 9, so the last day is the 10th day or day 17.

The EstValue() function is where the Linear Regression module lm() is finally called with the two lists for days, sales as input. For a quick recap of how Linear Regression is done in R, read this tutorial. A little bit of data manipulation is done in which, the days (8,9,10,11,12) are replaced by the more generic (1,2,3,4,5) and so the estimates are done for days (6,7,8,9,10) instead of (13,14,15,16,17). This transformation does not have any implication on the result.

There are 3 ways of testing / running this set of programs of which the first two can be done on a laptop

  • To test the R scripts without calling Hadoop, one can simply pipe the commands as follows : cat DailySales*.txt | ./LinReg-map.R | sort | ./LinReg-red.R > output.txt . This simulates the entire streaming process by sending the data from the 5 data files into the "stdin" of the mapper script that in turn streams the data to the Unix sort utility which in turn streams the sorted key-value pairs to the reducer script which in turn sends the "stdout" output into a file called output.txt This the output that you can see in the post above
  • To run the same scripts on the Hadoop Single Machine Cluster installed on a laptop, we use the following shell script

#hdfs dfs -ls 
#hdfs dfs -mkdir /user/hduser/Retail-in
#hdfs dfs -copyFromLocal /home/hduser/RetailSales/DailySales*.txt /user/hduser/Retail-in
hdfs dfs -ls /user/hduser/Retail-in
hdfs dfs -rm -r /user/hduser/Retail-out
hadoop jar /usr/local/hadoop220/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar -D'RetailR' -mapper /home/hduser/RetailSales/LinReg-map.R -reducer /home/hduser/RetailSales/LinReg-red.R -input /user/hduser/Retail-in/* -output /user/hduser/Retail-out 
hdfs dfs -ls /user/hduser/Retail-out

this scripts creates the Retail-in directory in HDFS and loads the DailySales files from local directory to the HDFS filesystem. It deletes the output directory, if it exists, and then calls the Hadoop streaming program with the 4 mandatory parameters : mapper script, reducer script, input directory and output directory ( all with fully qualified names, to avoid any ambiguity ). The only additional parameter is the job name (RetailR) that helps track the job on http://localhost:8088 and http://localhost:50070

In both these cases, the output is the same. [Update] - To see how to do this directly in RHadoop, see this post.

Now that we know that the program works fine, how do we scale up ? When we have thousands of SKUs and we want to use data from, say 15 or 20 days to build the regression model, the number of records will go up dramatically. One can of course procure multiple servers and configure all of them with Ubuntu, R and Hadoop but this is a very big, complicated and error-prone task. The simple solution is to use the
  • Amazon Web Services Elastic Map Reduce ( AWS/EMR) services, where the Mapper and Reducer programs can be run without any change on the same ( or if necessary on much, much larger) data to get identical results obtained in the first two methods.
To try out AWS/EMR, you need to visit the AWS website with a credit card and sign up for a loginid. Then follow the steps given in this tutorial by Raffael Vogler to run the LinReg map and reduce scripts. Follow the steps but instead of Vogler's programs, use the ones described post. You should also ignore the Bootstrapping step as well two lines of -jobconf and
-jobconf map.output.key.field.separator=\t that were meant to be placed in the Arguments box since these are not required for the Linear Regression programs. Running the programs with the test data given in this post will take around 10 mins, 7 to provision and configure the machine and 3 mins to run the job. This will result in a charge of around US$ 2 or US$ 3 that will be billed to the credit card used to create the loginid. Should you use AWS/EMR do remember to terminate the cluster at the end of the exercise as otherwise the billing will continue.

AWS/EMR really removes the hassles of configuring Hadoop and makes running Map Reduce jobs as easy as, well almost, send a Gmail message ! Everything is GUI oriented. You choose the number of type of machines and input the location of the data files and the map and reduce scripts. So after building and testing your R scripts on a laptop, you can scale up to hundreds of servers in  a few minutes  and that too for only a few minutes ! Who could ask for anything more ?

In this post, we have defined a simple sales prediction problem that could be faced in any retail store and we have shown how it can be solved with Hadoop and R. The approach taken has been adopted from a YouTube video created and uploaded by Fady El-Rukby and even though he solves a completely different problem and uses native Java, not streaming R, we have used the same data and compared results to make sure that the Linear Regression function of R is working correctly. To learn more about R and Data Science in general, please read this post on Data Science - A DIY approach and to get business perspective join the Business Analytics Program at Praxis Business School, Calcutta

June 03, 2014

HIVE and PIG to simplify Hadoop

[Note] -- Hadoop, IMHO, is history. Rather than waste time with all this, suggest you check up my blog post on Spark with Python.

When I was doing engineering at IIT, Kharagpur, the computers that we had were not even as powerful as a low-cost non-smart phone today and other than the basic concept of programming, nothing that we learnt is of any relevance today. So when we start a teaching a course on Business Analytics, that lies at the bleeding edge of  current technology and business practices, there is simply no option but to take the Do-It-Yourself approach of first learning a subject and then teaching it to students. Fortunately, there are many kind and knowledgeable souls on this planet who have taken the pains to explain new and difficult concepts to ancients like us and thanks to Google, it is not too difficult to locate them.

Using this route, I first learnt what is Data Science and then created this compilation of tutorials and training materials that anyone can use to learn about this new subject in greater depth. The next big challenge was to Demystify Hadoop and Map Reduce as these two key concepts play a very significant role in this area of interest. Writing Map Reduce programs in java, as is the standard practice, is a non-trivial task and many people have sought to simplify matters by adopting other approaches. One is to use the Hadoop streaming API and use a program written in any executable language like Python or R. HIVE and PIG are two other products that have evolved to ease and facilitate the use of MR techniques with Hadoop systems.

HIVE simulates an SQL based query engine sitting on top of the data stored in HDFS file system on Hadoop. Anyone familiar with SQL will immediately feel at home with the DDL, DML (load, insert) and Select commands.

PIG (and its humourously named command prompt, GRUNT > ) is a scripting language that allows one to run queries on data stored on HDFS without writing complex MR programs in Java.

In this post we will

  1. Install HIVE and use SQL commands to load and retrieve data from an HDFS file system.
  2. Install PIG and use it to retrieve the same data 
  3. Do the same task with the usual Java program ( already shown in an earlier blog post.)
We assume that you have followed instructions in the earlier blog post and you single machine cluster of Hadoop installed on a Ubuntu ( preferably 14.04) machine.

Varad Meru of Orzota has created a set of four excellent tutorials that we will use to get a grip on PIG and HIVE.

The first one talks about installing Hadoop 1.0.3, but we will ignore that because we have already learnt to install Hadoop 2.2.0.

The data that is used in the three other tutorials is called the Book Crossing Dataset that you can download as a zip file and then extract ONLY the file called BX-Books.csv for the purpose of the next three tutorials.

From this file we will answer the question of how many books are published in each calendar year. Not really rocket-science but enough to meet the requirements of requirements of how HIVE and PIG work.

The second tutorial Hive for Beginners gives clear, step by step instructions to carry out the task. Almost every instruction works perfectly. The following listing show the shell script used for all three tutorials (HIVE, PIG, Java).


# --- hive and common data cleaning and loading

#hdfs dfs -mkdir /user/hive
#hdfs dfs -mkdir /user/hive/warehouse
#hdfs dfs -chmod g+w /tmp
#hdfs dfs -chmod g+w /user/hive/warehouse
#hdfs dfs -mkdir /user/hduser/BXData-in
#sed 's/&amp;/&/g' BX-Books.csv | sed -e '1d' |sed 's/;/$$$/g' | sed 's/"$$$"/";"/g' > BX-BooksCorrected.txt
#hdfs dfs -copyFromLocal /home/hduser/BXData/BX-BooksCorrected.txt /user/hduser/BXData-in

#hive -f goBX2.sql > goBX2.output

# ---- pig
#pig goBX3.pig

# --- java

#rm -rf LocalClasses
#mkdir LocalClasses
# ....
#javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar -d LocalClasses
# .....
#javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar -d LocalClasses
# .....
#javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar:LocalClasses -d LocalClasses && jar -cvf BookXDriver.jar -C LocalClasses/ .

#hadoop jar BookXDriver.jar BookXDriver /user/hduser/BXData-in /user/hduser/BXData-MR-out


instead of typing long HIVE commands by hand, we have created a file call goBX2.sql to store the various HIVE commands and by selectively un-commenting lines, we execute the different commands.


--use default;
--show databases;
--show tables;
--LOAD DATA INPATH '/user/hduser/BXData-in/BX-BooksCorrected.txt' OVERWRITE INTO TABLE BXDataSet;
select yearofpublication, count(booktitle) from bxdataset group by yearofpublication;


The only deviation from the instructions is
  1. One error in the CREATE TABLE command. Since ";" is the EOL for HIVE files, the first CREATE TABLE statement failed because it contained a ";" symbol. This problem was solved by changing it to "\;" before the execution could proceed.
Also note that output is stored in file goBX2.output.

After using HIVE, the same task is performed using PIG by following instructions given in the tutorial PIG for Beginners.

There were two deviations from the instructions
  1. PIG was throwing a fearful error ERROR org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl - Error whiletrying to run Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected. that was causing a major abort. This was tracked down to this StackOverflow thread and the following command, issued from $PIG_HOME directory solved the problem : ant clean jar-all -Dhadoopversion=23 .. However please note that the command takes nearly 25 minutes to execute as it virtually rebuilds many Hadoop, PIG and related jars
  2. the PIG_CLASSPATH is set to the conf directory which in the case of Hadoop 2.2.0 is set to $HADOOP_INSTALL/etc/hadoop
  3. Also do note that after HIVE has loaded data into a table, it removes the data from the HDFS filesystem. So before PIG can start, the data has to be reloaded from the local file system to HDFS once again ! Simply uncomment the line in the shell script and run it once again
the PIG commands were stored in a file goBX3.pig and executed from the shell script 

BookXRecords = LOAD '/user/hduser/BXData-in/BX-BooksCorrected.txt' USING PigStorage(';') AS (ISBN:chararray, BookTitle:chararray, BookAuthor:chararray, YearOfPublication:chararray, Publisher:chararray, ImageURLS:chararray, ImageURLM:chararray, ImageURLL:chararray);
GroupByYear = GROUP BookXRecords BY YearOfPublication;
CountByYear = FOREACH GroupByYear GENERATE CONCAT((chararray)$0,CONCAT(':',(chararray)COUNT($1)));
STORE CountByYear INTO '/user/hduser/BXData-out-pig/BXDataQueryResult' USING PigStorage('t');


In this case, the output is stored in the HDFS file system that can be accessed thorough the browser at localhost:50075 and downloaded.

Finally, after using HIVE and then PIG to generate the data, one can use the standard Java route as explained in this fourth and final tutorial. There is really no need to configure Eclipse with Hadoop Plug-in ( the version for Hadoop 2.2.0 is not yet ready or stable, as of now ). You can simply download the three java files : BookXDriver, BookXMapper, and BookXReducer and then use the javac command from the ubuntu prompt as given in the shell script above. Once again the output will be stored in the HDFS directory /user/hudser/BXData-MR-out ( as show in the diagram above ) and can be downloaded for comparison with the two other results.

Ok, here is the final screenshot of the applications console available in the browser at localhost:8088 that shows all the three jobs to have executed successfully.

If you find any errors in this post, please leave a comment. If you find it useful, do share it with your friends ... and also check out the Business Analytics program at Praxis Business School.