Ever since I started my career in software and IT as a database administrator, I have had a great fascination, if not love, for the SQL language. I also know that there are many in the software fraternity (and sorority) who share my comfort with using SQL. Even after moving into the current enthusiasm, or should I call it a fad, with data analytics, I see that there exists many who prefer the ease and convenience of SQL even in big data.
For example, Hadoop had to hide the complexity of its Map-Reduce magic behind HIVE. Spark offers SparkSQL. In an earlier post, I had shown how SQL can be used for scoring Data Mining models developed with R through the PMML route.
In this post, we show how K-means clustering, a very common and widely used data mining activity can be done with SQL, using a technique adopted from Joni Salonen's blog post. The following codes were executed in MySQL.
First we create the tables :
Then we create a stored procedure to implement the K-means algorithm
Next we load some sample data :
Finally we call procedure with the parameter = 2, the expected number of clusters
That is it! As simple as that ..
Of course, one may always wonder why anyone would use SQL for this when R and Python offers possible better functions, but then one sometimes does things just for the heck of it and that is what we have done here.
For example, Hadoop had to hide the complexity of its Map-Reduce magic behind HIVE. Spark offers SparkSQL. In an earlier post, I had shown how SQL can be used for scoring Data Mining models developed with R through the PMML route.
In this post, we show how K-means clustering, a very common and widely used data mining activity can be done with SQL, using a technique adopted from Joni Salonen's blog post. The following codes were executed in MySQL.
First we create the tables :
drop table if exists km_data; -- contains initial data, only 2 dimensions
drop table if exists km_clusters; -- contains the final clusters centres
drop table if exists km_steps; -- shows intermediate steps
create table km_data (id int primary key, cluster_id int, x1 float, x2 float);
create table km_clusters (id int auto_increment primary key, x1 float, x2 float);
create table km_steps (id int, x1 float, x2 float);
Then we create a stored procedure to implement the K-means algorithm
drop procedure if exists kmeans;
DELIMITER //
CREATE PROCEDURE kmeans(v_K int)
BEGIN
TRUNCATE km_clusters;
truncate km_steps;
-- initialize cluster centers
INSERT INTO km_clusters (x1, x2) SELECT x1, x2 FROM km_data LIMIT v_K;
REPEAT
insert into km_steps select * from km_clusters;
-- assign clusters to data points
UPDATE km_data d SET cluster_id = (SELECT id FROM km_clusters c
ORDER BY POW(d.x1-c.x1,2)+POW(d.x2-c.x2,2) ASC LIMIT 1);
-- calculate new cluster center
UPDATE km_clusters C, (SELECT cluster_id,
AVG(x1) AS x1, AVG(x2) AS x2
FROM km_data GROUP BY cluster_id) D
SET C.x1=D.x1, C.x2=D.x2 WHERE C.id=D.cluster_id;
UNTIL ROW_COUNT() = 0 END REPEAT;
END//
This procedure updates tables without a WHERE condition. This is not allowed in the default "Safe Update" mode of MySQL. Go to Preferences -> SQL Queries and uncheck the "Safe Updates" option. Otherwise this procedure will give an error.
Next we load some sample data :
truncate km_data;
insert into km_data (id, cluster_id,x1,x2) values (1,0,185,72);
insert into km_data (id, cluster_id,x1,x2) values (2,0,170,56);
insert into km_data (id, cluster_id,x1,x2) values (3,0,168,60);
insert into km_data (id, cluster_id,x1,x2) values (4,0,179,68);
insert into km_data (id, cluster_id,x1,x2) values (5,0,182,72);
insert into km_data (id, cluster_id,x1,x2) values (6,0,188,77);
insert into km_data (id, cluster_id,x1,x2) values (7,0,180,71);
insert into km_data (id, cluster_id,x1,x2) values (8,0,180,70);
insert into km_data (id, cluster_id,x1,x2) values (9,0,183,84);
insert into km_data (id, cluster_id,x1,x2) values (10,0,180,88);
insert into km_data (id, cluster_id,x1,x2) values (11,0,180,67);
insert into km_data (id, cluster_id,x1,x2) values (12,0,177,76);
Finally we call procedure with the parameter = 2, the expected number of clusters
call kmeans(2);
That is it! As simple as that ..
Of course, one may always wonder why anyone would use SQL for this when R and Python offers possible better functions, but then one sometimes does things just for the heck of it and that is what we have done here.
No comments:
Post a Comment