November 19, 2016

K-means Clustering with SQL

Ever since I started my career in software and IT as a database administrator, I have had a great fascination, if not love, for the SQL language. I also know that there are many in the software fraternity (and sorority) who share my comfort with using SQL. Even after moving into the current enthusiasm, or should I call it a fad, with data analytics, I see that there exists many who prefer the ease and convenience of SQL even in big data.

For example, Hadoop had to hide the complexity of its Map-Reduce magic behind HIVE. Spark offers SparkSQL. In an earlier post, I had shown how SQL can be used for scoring Data Mining models developed with R through the PMML route.

In this post, we show how K-means clustering, a very common and widely used data mining activity can be done with SQL, using a technique adopted from Joni Salonen's blog post. The following codes were executed in MySQL.

First we create the tables :

drop table if exists km_data;      -- contains initial data, only 2 dimensions
drop table if exists km_clusters;  -- contains the final clusters centres
drop table if exists km_steps;     -- shows intermediate steps
create table km_data (id int primary key, cluster_id int, x1 float, x2 float);
create table km_clusters (id int auto_increment primary key, x1 float, x2 float);
create table km_steps (id int, x1 float, x2 float);


Then we create a stored procedure to implement the K-means algorithm

drop procedure if exists kmeans;
DELIMITER //
CREATE PROCEDURE kmeans(v_K int)
BEGIN
TRUNCATE km_clusters;
truncate km_steps;

-- initialize cluster centers
INSERT INTO km_clusters (x1, x2) SELECT x1, x2 FROM km_data LIMIT v_K;
REPEAT
    insert into km_steps select * from km_clusters;
    -- assign clusters to data points
    UPDATE km_data d SET cluster_id = (SELECT id FROM km_clusters c 
        ORDER BY POW(d.x1-c.x1,2)+POW(d.x2-c.x2,2) ASC LIMIT 1);
    -- calculate new cluster center
    UPDATE km_clusters C, (SELECT cluster_id, 
        AVG(x1) AS x1, AVG(x2) AS x2 
        FROM km_data GROUP BY cluster_id) D 
  SET C.x1=D.x1, C.x2=D.x2 WHERE C.id=D.cluster_id;
UNTIL ROW_COUNT() = 0 END REPEAT;
END//

This procedure updates tables without a WHERE condition. This is not allowed in the default "Safe Update" mode of MySQL. Go to Preferences -> SQL Queries and uncheck the "Safe Updates" option. Otherwise this procedure will give an error.

Next we load some sample data :

truncate km_data;
insert into km_data (id, cluster_id,x1,x2) values (1,0,185,72);
insert into km_data (id, cluster_id,x1,x2) values (2,0,170,56);
insert into km_data (id, cluster_id,x1,x2) values (3,0,168,60);
insert into km_data (id, cluster_id,x1,x2) values (4,0,179,68);
insert into km_data (id, cluster_id,x1,x2) values (5,0,182,72);
insert into km_data (id, cluster_id,x1,x2) values (6,0,188,77);

insert into km_data (id, cluster_id,x1,x2) values (7,0,180,71);
insert into km_data (id, cluster_id,x1,x2) values (8,0,180,70);
insert into km_data (id, cluster_id,x1,x2) values (9,0,183,84);
insert into km_data (id, cluster_id,x1,x2) values (10,0,180,88);
insert into km_data (id, cluster_id,x1,x2) values (11,0,180,67);
insert into km_data (id, cluster_id,x1,x2) values (12,0,177,76);


Finally we call procedure  with the parameter = 2, the expected number of clusters

call kmeans(2);


That is it! As simple as that ..

Of course, one may always wonder why anyone would use SQL for this when R and Python offers possible better functions, but then one sometimes does things just for the heck of it and that is what we have done here.

No comments: