September 26, 2015

Magic of SQL in scoring Data Mining Models

As a former DBA I love the fact that SQL has a life of its own and the way it is still used after it was first conceived of in the 1980s as the natural way to query Relational Database Management Systems. A vast amount of data in the world is still available in SQL compliant RDBMS tables and today when business analytics and data science seems to be overshadowing SQL, I was delighted to find that SQL can still play a very important role in the implementation of complex data mining applications. This post explains how this evergreen tool is still very, very relevant in data mining.

Many data mining tools like Rattle, RapidMiner are used to create "models" for Classification / Decision Trees and Heirarchical Clustering but then the models have to put into production by using them to score large datasets. This is where SQL can play a very powerful role. The models created in the data mining tools need to be exported as PMML documents and then converted to SQL using any of the PMML to SQL tools available on the web. One such tool is PMML2SQL that has been used in this example. The generated SQL codes can be used to run against data stored in an RDBMS table and this new data will be classified or clustered as per the model. This is known as scoring and is widely used when a data mining model is put into "production".

In this example we begin with the well known iris data set ( irisdata.csv )and import it into Rattle. Before doing so it would be a good idea to remove the '.' from the data fields as these tend to interfere with the SQL codes later on.

Then we create a Decision Tree model with this data

This data is then exported into a PMML file irisdata_rpart.xml. This XML file is then uploaded into pmml2sql and it generates the corresponding code

that we store in a txt file irisdata-rpart-sql.txt.

As a simple way to get started we use the SQLite3 database that is available through the SQLite Manager natively available inside the Mozilla browser and use the csv import facility to create a table irisdata

and then define the datatype in the different columns of the table irisdata

Once the table is ready, we copy the SQL command portion from irisdata-rpart-sql.txt file and execute the same

and see how the model has classified the data. In this case the training and production dataset is the same so (a) the accuracy is very high and (b) we see that the classification has happened correctly. The three kinds of iris flower have been placed in categories 1, 2, 3. This "scored data" can now be exported into irisdata-scored-rpart.csv for further use.

We now move to the heirarchical clustering model and again use Rattle to build the same

that we export as irisdata-hclust.xml which is then uploaded into pmml2sql and the corresponding SQL code is generated.

This is stored in irisdata-hclust-sql.txt for use against the actual data. However we note that the SQLite does not -- unfortunately -- support the mathematical functions like SQRT, POWER and we have to move to an fully SQL compliant RDBMS like MySQL.

The iris data already stored in the SQLite RDBMS is exported as an SQL file irisdata-createtable1.sql that contains both the table creation script as well as the data insertion script. A few cosmetic changes have to be made to make this script compatible with MySQL. Typically the quotes around the table name has be removed and the datatypes for each column have to specified more precisely. The changes are reflected in irisdata-createtable2.sql. This script is now executed in the GUI client (in this case SQLyog) of MySQL running on Windows 7.

and a new table irisdata is created

Now we pick up the SQL code available in irisdata-hclust-sql.txt and execute the same against the data available in the irisdata table.

Once again we see that the cluster scores, 1,2,3 reflect the original three kinds of iris flowers that were available in the iris dataset. This data can now be exported to a CSV file irisdata-scored-hclust.csv for further analysis. All the files referred to in this post are available for download in this zip file

The beauty and value of this technique lies in the fact that your data mining tool need not have the ability to handle very large datasets. Any model built with a statistically reasonable amount of data with these tools can be "moved into production" and used against very large datasets that are stored in rugged reliable RDBMS systems. The only requirement is that column names in the RDBMS must be the same as the variable names used in building the datamodel. If this is not the case, then the SQL codes must be changed to reflect the different column names in the RDBMS table. 

In fact if the quantum of data is even higher than what a standard RDBMS can handle, then the data can be loaded in Hadoop / HIVE and then HQL ( a very close cousin of SQL) can be used along with the ultra high scalability of a Hadoop cluster to achieve the same result.

No comments: