July 27, 2017

OLAP Data Cube with SQL

As an erstwhile DBA, a long time user and a great admirer of the SQL language -- that has stood the test of time for the last 30 years -- I have always sought to use SQL in many useful ways. In an earlier post, I had shown how SQL can be used to solve a classic data science problem, namely Clustering, using the K-Means algorithm and today, I demonstrate how SQL can be used to process OLAP data cubes and generate the popular cross-tabs table.

Data cubes, or OLAP cubes, are a way to store historic data using the dimensional model, as opposed to the relational or 3rd normal form model. These data cubes can be "sliced" and "diced" to reveal data relevant to particular dimensions. Because of the immense popularity and ubiquity of relational databases, like Oracle and MySQL, data in the dimensional model is routinely stored in relational tables and retrieved -- by slicing and dicing the cube -- using standard SQL constructs like the WHERE clause. This is called Relational OLAP or ROLAP.

Data cubes are very popular because they allow multidimensional data to be collapsed to any two dimensions and shown as a "CrossTab" -- and human beings can comfortably visualise only two dimensions on a page or a screen. Unfortunately, creating CrossTabs is not very easy with normal SQL and that is why there exist a genre of specialist products -- Multidimensional OLAP or MOLAP -- that allow users to create CrossTabs by "rotating" the data cube as necessary.

Microsoft SQL-Server, a RDBMS product, has a proprietary construct called CUBE that allows this feature but this is not available in most RDBMS products and certainly not in MySQL, the free and open-source product that is the most widely used RDBMS on the planet.

The following slide deck shows how MySQL can be used to "rotate" an OLAP data cube and generate CrossTabs for any cube of dimension 3 or higher


(please view the slide deck in full screen mode)
We also show how a "pivot" table, so beloved of Excel users can also be generated using MySQL and hence by extension in any RDBMS.

But why would anyone wish to use SQL or MySQL to build and work with data cubes when MOLAP tools are available?
  1. First, SQL is easily understood and widely used by a vast majority of IT professionals
  2. Second, MySQL is a free and open-source product that is used in almost every web application
  3. Third, SQL is supported in a multi-machine, clustered environments like Hadoop/Hive and Spark and so this technique can be used -- at least in principle -- to support data cubes built with ultra large data sets.
Unless one wants the bells and whistles that come along with most MOLAP products, MySQL is good enough for almost any OLAP activity and can be scaled up with Hive / Spark for very large data.

Acknowledgement : The technique demonstrated in this post has been adopted from information provided at http://www.artfulsoftware.com/infotree/qrytip.php?id=78