A Sojourn* Into Clustering

I was looking at the towpath amenities project in the week before we went on vacation, mainly to play with database reporting software, and I noticed that my amenities all were pretty closely grouped together. This stands to reason, since the data is a ready-made cluster — it’s composed of amenities within a kilometer of Sand Island, so the clustering may just be an artifact of that search criterium — but also because the data set encompasses the compact¬† Main Street restaurant district. Continuing on with my reporting experiments, I looked at all amenities within a mile of Sand Island, and now found myself looking at two distinct groups of amenities, the one around Main Street, and another on the south side of the Lehigh. This also stands to reason — Chamber-of-Commerce types like to joke that we’re the city with two downtowns — but again I wondered if it was some artifact of the analysis, or even if I was seeing patterns that didn’t really exist, and that got me thinking of what I actually thought I meant by “cluster.”

Turns out, it’s a fairly big subject, with different ways of describing what “cluster” might mean — usually (and intuitively), it’s a subset of similar items within a larger data set, but then what does “similar” mean, and how similar do the members of a cluster have to be, especially compared to the rest of the set? For each way of understanding what a cluster is, there are various ways of finding the clusters within a data set. This whole subject is apparently a big deal, a subject of ongoing research, and an important tool in the fields of machine learning and big data.

My problem was spatial, so for me “similar” meant “close together in terms of location.” Some Googling found that there were plenty of GIS solutions to clustering problems, and in fact PostGIS contains several functions implementing the more common and important clustering algorithms, including DBSCAN, the algorithm that comes closest to what I think “clustering” should mean for my situation.

And here is where things became complicated…

The clustering functions are not available in the version of PostGIS that I had installed. So I decided to upgrade PostGIS, did a bit of research and found many articles with titles like “How to Brick Your Database By Updating PostGIS.” The process itself is not difficult, it uses old-school “make” rather than a package manager, and the pitfalls are easily avoided, but now I was scared and I thought I’d better back up my whole database system before continuing. What this meant though, was that first I had to make room on my hard drive, which has one (small, overcrowded) main partition and a (large, empty) secondary area. First thing would be to back up the secondary partition to the NAS drive — something I’ve been remiss on ever since I installed Mint — then I’d move both my music (35 GB) and my photos (12 GB) over to the secondary drive, and then update the music and photo software so it knew where all the files went — it was starting to sound like that song about the hole in the bucket…

I got through the first part, backing up the drive (which took hours), before we went on vacation. There was no Internet at our cabin, and I didn’t bring my computer anyway, so the rest had to await my return. The remainder of the hard drive cleanup (music and photos) also took some time but went smoothly enough, and I did a full backup of my databases.

From here the process was a bit anticlimactic: I downloaded the new version, ran make and typed a few things into the database, and I was done without bricking a damn thing. I needed to lay on my fainting couch and rest for a day after that, but when I finally got around to using the new functions they were a breeze.

I found some clusters and drew polygons around them — the subjects of another post —¬† but I have more to do to figure out what these things are actually telling me.

*Hat tip to Achewood, still my favorite Internet thing ever.