cloister
Full Member
Offline
Posts: 138
Re: split array of points into local neighbourhood
Reply #3 - Feb 9th , 2009, 10:18pm
There are some algorithms out there for doing cluster analysis. I have read about them, but have never implemented them myself, so I'm not super confident that this explanation is going to be correct but I'll give it a shot: The basic idea is that you posit the existence of some number of clusters, N, within the data. Each cluster is described by a position P and a radius R. Things that fall inside the circle {P, R} belong to the cluster, things that don't, don't. The problem then is to establish actual values for P[i] and R[i] for values of i in [0..N-1]. The basic algorithm is that you seed them with arbitrary values (usually random points for the P values and some selected value for R), and then you iteratively adjust the values of P and R until enough of the data is covered by the clusters that you're happy. In each iteration, you do more or less this: * loop over all the data points * for each point, calculate the probability that it belongs to each cluster (more on this later). Assign it to the cluster for which it has the highest probability of belonging. * for each cluster, recalculate the cluster's P value as some appropriately weighted average of the positions of the data points that belong to it, and the R value as some appropriately weighted average of the distance of the cluster's data points to the cluster's new center. * stop when some threshhold of goodness has been reached. This is often decided as "when an iteration passes and no data points change their cluster assignment status" or on the basis of covering 90% of the data, or whatever. The bit about calculating the probability that a given data point belongs to a given cluster involves the fundamental assumption that the closer a given point is to a cluster's center, the more likely it is to belong to that cluster. In practice, you use a gaussian distribution curve that uses the cluster's R value to establish its one-standard-deviation radius as the actual formula for the probability, although I believe that any monotonically declining function (e.g. a linear decay that reaches zero at radius R, 1/R, or anything else like that) should work too. Gaussian distributions are usually used for cluster analysis because real world data often fits gaussian distributions around the "true" center. The other wrinkle, which you've probably figured out already, is "how do you know, a-priori, how many clusters to look for?" There are a couple of ways. If you don't mind user input, you can just have the user click on perceived cluster centers, then click a "start" button when they've identified them all. If you want an automatic solution, then usually the above algorithm is tweaked in such a way that R is constrained not to get bigger than some maximum value. When the algorithm terminates (because no points changed ownership from one iteration to the next), you check to see if enough of the data was covered for you to call it a good solution. If not, you increase the number of clusters by one and start over. So more or less you ask "can you cover the data with 1 cluster that's no bigger than R=(whatever)?" And the algorithm comes back and says "Well, here's the best 1 cluster I can find, but only 75% of the data is covered," and you come back and say "no, not good enough. How about 2 clusters?" And you go back and forth until you hit your coverage goal.