A better way to cluster data

A year or so ago I wrote a blog on how to cluster points of geographical data from within an SQL database. It was a working solution but inelegant and a bit clunky so I’ve finally got around to tidying it up and turning it into a proper  function.

PostgreSQL and PostGIS don’t have built-in clustering functions so I looked to the R statistical language and to using the PL/R extension in Postgres. Sorry, this post assumes R and PL/R have been installed – if you haven’t Boston GIS have a very good guide.

The function is a lot simpler than the previous one, and doesn’t require any special data types – its works just like any other function.

CREATE OR REPLACE FUNCTION r_cluster_kmeans(x float8, y float8, clust_no float8)
AS $$
   if (pg.state.firstpass == TRUE) {
       pg.state.firstpass <<- FALSE
       c <- kmeans(cbind(farg1, farg2), clust_no)
       assign("cluster", c$cluster, env = .GlobalEnv)


And using it it equally easy. The function just needs the X and Y co-ordinates or the point, and the number of clusters you want to create, and it returns the specific cluster the point belongs to:

       r_cluster_kmeans(ST_X(the_geom), ST_Y(the_geom), 5) OVER () AS cluster,
       the_geom AS geom
FROM mytable;

In this example, it assigns each point in “mytable” to one of 5 clusters.

The post A better way to cluster data appeared first on Wooster Consulting.