> uwt7 \bjbjUU "z7|7|WlLLLLLLL`TTT8\`S'2444444&&&&&&&$( *z&L44444&LL44
'4^L4L4&4&#LL#4"`TF##4#'0S'#++#``LLLLk-Means Clustering & Finding k
What is k-Means Clustering?
When looking at data for the purpose of classification, there are several ways to approach classifying the examples in a given set. For example, we have parametric approaches, semi-parametric approaches, and nonparametric approaches.
As Ethem Alpaydin explains in Introduction to Machine Learning, "in the parametric approach, we assume that the sample comes from a known distribution." However, often that is simply not the case (Alpaydin, 133). In addition to the parametric methods, we also have nonparametric methods - which take the position that nothing can be assumed about the input density - or as Alpaydin puts it - "the data speaks for itself" (Alpaydin, 153).
Somewhere in between these two opposite approaches lies the class of semi-parametric methods - those where we assume "a mixture of distributions," or "a parametric model for each group in the sample" (Alpaydin, 133-134). It is under this umbrella that k-means clustering falls.
Simply put, k-Means Clustering is an algorithm among several that attempt to find groups in the data (Alpaydin 139). In pseudo code, it is shown by Alpaydin (139) to follow this procedure:
Initialize mi, i = 1,,k, for example, to k random xt
Repeat
For all xt in X
bit ( 1 if || xt - mi || = minj || xt - mj ||
bit ( 0 otherwise
For all mi, i = 1,,k
mi ( sum over t (bit xt) / sum over t (bit )
Until mi converge
The vector m contains a reference to the sample mean of each cluster. x refers to each of our examples, and b contains our "estimated [class] labels" (Alpaydin, 137).
Explained perhaps more simply in words, the algorithm roughly follows this approach:
1) Choose some manner in which to initialize the mi to be the mean of each group (or cluster), and do it. 2) For each example in your set, assign it to the closest group (represented by mi). 3) For each mi, recalculate it based on the examples that are currently assigned to it. 4) Repeat steps 2-3 until mi converge.
Now that we have some rudimentary understanding of what k-means is, what are some practical applications of it?
Applications of the k-Means Clustering Algorithm
Briefly, Alpaydin mentions optical character recognition, speech recognition, and encoding/decoding as example applications of k-means (Alpaydin, 134). However, a survey of the current literature on the subject offers a more in depth treatment of some other practical applications, such as "data detection for burst-mode optical receiver[s]" (Zhao, et al.), and recognition of musical genres (Turnbull and Elkan), which are specialized examples of what Alpaydin mentions.
As Zhao, Nehorai, and Porat describe "burst-mode data-transmission systems," a "significant feature of burst-mode data transmissions is that due to unequal distances between" sender and receivers, "signal attenuation is not the same" for all receivers. Because of this, "conventional receivers are not suitable for burst-mode data transmissions." The importance, they note, is that many "high-speed optical multi-access network applications, [such as] optical bus networks [and] WDMA optical star networks" can use burst-mode receivers (Zhao, 1492).
In their paper, they provide a "new, efficient burst-mode signal detection scheme" that utilizes "a two-step data clustering method based on a K-means algorithm." They go on to explain that "the burst-mode signal detection problem" can be expressed as a "binary hypothesis," determining if a bit is 0 or 1. Further, although they could use maximum likelihood sequence estimation (MLSE) to determine the class, it "is very computationally complex, and not suitable for high-speed burst-mode data transmission." Thus, they use an approach based on k-means to solve the practical problem where simple MLSE is not enough (Zhao, 1493).
The paper from Douglas Turnbull and Charles Elkan is quite a bit more concise, and not as full of jargon as the Zhao article, so its example may prove more meaningful. In it, they show that although "the classification of music by genre is difficult to automate," they are able to "achieve human-level accuracy with fast training and classification" by using "radial basis function (RBF) networks using a combination of unsupervised and supervised initialization methods." Furthermore, they note that the initialization methods they use are "hundreds of times" faster than using "RBF networks trained with gradient descent." They are able to achieve this goal in part with help from "an improved method for initializing the k-means clustering algorithm which is useful for both unsupervised and supervised initialization methods" (Turnbull, 580).
The new approach Turnbull and Elkan use to initialize k-means is what they call Subset Furthest First (SFF). They note that a problem with (normal) Furthest First is "that it tends to find the outliers in the data set." By using only a subset, they found that there are less outliers that can be found, and "thus, the proportion of nonoutlier points obtained as centers is increased" (Turnbull, 581).
Finally, Turnbull and Elkan also show that using the multiple initialization methods of Gaussian maximum likelihood, k-means, and in-class k-means, they get about the same classification accuracy as one would obtain by using the method of gradient descent. However, they note, "in each trial, creating a network without gradient descent takes seconds, whereas applying gradient descent takes hours." In the end, they cite a study that "found that humans achieved 70 percent music classification accuracy," and compared that to their own result - which was 71.5 percent (Turnbull, 583).
Reading Turnbull and Elkan, as well some other current literature, one can see some potential difficulties in using k-means. Some of those below are discussed below.
Difficulties with k-means
There are some difficulties in using k-means for clustering data. We can see several mentions of this in current and past research, and an oft-recurring problem has to do with the initialization of the algorithm.
For example, in protein sequence motifs, Wei Zhong et al. mention their improved initialization procedure "increases the percentage of sequence segments belonging to clusters with high structural similarity." The wording "improved" suggests that many initialization methods are far from ideal, especially given the amount of papers mentioning this. Further, it "may discover some relatively weak and subtle sequence motifs" which would otherwise remain undetected by "the traditional K-means algorithms" (Zhong, 255).
They continue, explaining that out of many initialization strategies, four in particular - Random, Forgy, MacQueen, and Kaufman - are all subject to the rule that "the choice of initial data points defines deterministic mapping [where] inappropriate choices of initial points may result in distorted or incorrect partitions" (Zhong, 256). And although they choose the Forgy method when implementing a "traditional" k-means algorithm (the other three are not suitable for their purpose, they state), they also propose a "new greedy initialization method to overcome potential problems of random initialization" (Zhong, 256).
They explain their algorithm as follows:
In the new initialization method, the clustering algorithm will only be performed for several iterations during each run. After each run, initial points, which can be used to form the cluster with good structural similarity, are chosen and their distance is checked against that of all points already selected in the initialization array. If the minimum distance of new points is greater than the specified distance, these points will be added to the initialization array (Zhong, 256).
Similarly, Michael Lazlo and Sumitra Mukherjee refer to this concept outright, pointing out "the algorithm is very sensitive to the initial selection of centers and is likely to converge to partitions that are significantly inferior to the global optimum." In their article, they "present a genetic algorithm (GA) for evolving centers in the k-means algorithm that simultaneously identifies good partitions for a range of values around a specified k" (Lazlo, 533). They also note that the k-means dominates the running time of their algorithm, so it should be useful for large datasets (Lazlo, 541), and although they use their "GA to evolve good centers for a [specific] version of k-means [their] approach is flexible enough to accommodate other variants of k-means" (Lazlo, 542).
Finally, Alpaydin mentions that because the initial means greatly influence the final means, we should limit the choice of our selection criteria to include centers where we believe there will be data (137). The same problem can be inferred through the Turnbull and Elkan paper, as they mention the necessity of their subset furthest first implementation for initializing the mean vector for k-means is that the standard version is subject to poor results due to outliers. Additionally, they find that using a single method for initialization and the k-means algorithm, the results obtained in trying to classify music by genre are not as good as humans. However, by combining different methods, they show how they were able to match the ability of humans.
Determining the number of clusters in a set of data
The problem of initializing the centers for k-means, as described above, dovetails with the problem of finding the initial k - or how many clusters are to be found in the dataset - since to do so, one must also know how many centers exist. A paper by Greg Hamerly and Charles Elkan from University of California, San Diego, addresses this problem. They introduce an algorithm to find k, which they call G-means. As they describe the algorithm, it follows this path:
The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k=1, or we can choose some larger value of k if we have some prior knowledge about the range of k. (Hamerly and Elkan, 2)
It is important to note that they assume each cluster to be Gaussian distributed, and the only other "intuitive parameter" is "the standard statistical significance level" (1).
The rest of this paper focuses on my own experiment to find what number k represents, given some assumptions about the data.
Setting up the experiments
Like Hamerly and Elkan, this experiment assumes the data come in the form of clusters that are each Gaussian distributed. To keep it simple, the experiment only covers the case for one-dimensional data. Obviously, this is quite useless, since a human could plot the data and tell how many clusters are in it for that case. However, it could be quite simple for someone with the desire and knowledge to extend it to cover multi-dimensional data, some problems of which I discuss below. In addition, references to "random" below should be regarded as "pseudorandom," since this was carried out using generated data, not organic. Finally, it should be noted that despite the fact I believe the approaches used could work, the experiment as it stands, failed.
For the experiment, a GaussianGenerator was created first, which generates a Sample given a mean and standard deviation. It uses the Box-Muller transform to generate a number within the given parameters.
In order to be reusable and vary the parameters, the experiment starts by specifying a minimum and maximum number of clusters, and a minimum and maximum size for a Sample. Further, it needs a constant to transform a random mean, and another for random standard deviation. These just multiply a pseudorandom number between 0...1 to increase the range of parameters available. It then generates a random number of Samples with random mean and variance between the supplied parameters discussed above. Finally, it combines all the samples into one, shuffles the examples, and passes control to the two algorithms to find k. The algorithms attempted to use nothing but the combined Sample as input.
Description of the algorithms
Two ideas struck me as plausible ways to find k, given nothing but a sample, although both of them relied on using standard deviation.
The first algorithm sorts the examples in the combined sample and simply iterates over each of them, adding examples to a new working sample if they are within 3 standard deviations of the current mean, since about 99 percent of the points in a given normal distribution should be within that range. When an example falls outside that range, another working sample is created, and the process continues. When all the examples have been iterated over, the function returns the number of working samples it used. Further, it assumes a minimum of 50 examples in each sample (to calculate the initial mean and standard deviation), and this parameter can be varied depending on knowledge of the data. The Java source code for this function (or the entire experiment) is available upon request.
The second algorithm takes the standard deviation of a sample, and recursively calls itself using smaller ones as the standard deviation decreases. Since it is easier to understand a recursive function by viewing it as opposed to reading a description, the source code for it is provided in Figure 1 below.
Figure 1: Recursive findK() function.
At first glance it looks like the function expects an initial standard deviation, but in fact, it is used such that on the first run, that parameter is initialized to -1. Therefore, it will calculate the initial standard deviation based on the entire sample. Within it there is a variable, tolerance, which is used to indicate how close the smaller samples' standard deviation should be to the last standard deviation to continue splitting the sample. This value is chosen arbitrarily, and it is likely the experiment results could have been better had I attempted to let the program "learn" a good value to use.
Experimental Results
Presented here in tabular form are the results of 20 experimental runs. More were done, of course, but due to space requirements, only 20 are presented. In both sets, the number of samples was uniformly random between 5 and 12, the size of each sample varied from 50 to 100, and the mean and standard deviation of each sample spanned 0-100 and 0-5, respectively. In the first set of ten runs, the standard deviations of the initial samples were allowed to vary. The second set used a fixed standard deviation of two.
Set 1: Variable Standard Deviation Run: 1 2 3 4 5 6 7 8 9 10 k: 6 7 11 6 11 6 10 7 7 8 Iterative: 3 11 2 7 8 8 6 3 22 6 Recursive: 8 8 9 8 9 8 8 8 8 8 Set 2: Fixed Standard Deviation = 2 Run: 1 2 3 4 5 6 7 8 9 10 k: 9 7 5 11 11 9 7 7 6 10 Iterative: 8 8 14 21 10 11 7 4 2 5 Recursive: 8 8 8 9 8 8 8 8 8 8
As the data show, neither algorithm was particularly successful at correctly finding k. The recursive algorithm is a lot less erratic, and seems to be closer to finding k than its iterative counterpart. It should be noted, however, that in several of the cases where the recursive algorithm found too few clusters, particularly when it was one or two off, looking at the individual generated clusters reveals that actually, it was very good - just that two or three of the means were incredibly close, and the algorithm could not differentiate between all the sets. For example, in the first run of the fixed standard deviation, one of the means was 87.68 with another at 89.78. In this case, it is unlikely even a human could have differentiated between the two. Similarly, in the sixth run of the same set, two of the generated means were less than a standard deviation apart. However, a glaring issue with the recursive trials is that it appears to always hover around eight or nine - indicating that perhaps the tolerance was so low, it encouraged splitting the sample all the time. However, even varying the side of the equation in which the tolerance was applied (or subtracting it instead of adding it) did not change the variety of results obtained.
Analysis of multiple runs of this experiment shows unexciting results as well. Using 100 trials, it was found that about 10 percent of the time, the iterative algorithm guesses correctly. With only 8 possible values to choose from, this is worse than guessing if the algorithm had known how many it had to choose from. Among the times that it counted the wrong number of clusters, it was off by an astounding average of 45.3 percent.
On the other hand, the recursive algorithm fared slightly better. Using the same 100 trials, it correctly identified k 16 percent of the time, and averaged only 19.3 percent difference among the times it was wrong. Clearly however, this is nothing to write home about.
At this point, it was clear that at least some of the time, the recursive algorithm appeared to be finding k correctly, except that some of the means were too close to each other to distinguish between them. Therefore, another 100 trials were run, this time randomly generating the mean of the first sample, and adding a fixed number to each successive one to ensure a minimum distance between each sample. Discouragingly, the results weren't much better. The iterative algorithm again only correctly found k nine percent of the time, and averaged a whopping 71 percent difference when it was wrong. The performance of the recursive algorithm did improve slightly - this time getting k correct 26 percent of the time, and only being "off" by an average of 16.5 percent the times it was incorrect.
Conclusions
I would have considered something in the range of 90% + correct with a much smaller average percentage "off" to be somewhat of a success. By those criteria, the experiments failed. In fact, by any standard of measurement, both the recursive and iterative algorithm ideas failed.
Even if one of the algorithms did work, among many potential flaws in making the ideas useful for higher dimensions, a particular one that doesn't map well is sorting the examples, as both algorithms did upon receiving the combined sample set. It is easy to sort a one-dimensional set, but what about the multi-dimensional case? My thought here is something along the lines of sorting by angle, then distance, but I haven't explored the idea in depth.
Still being new to the field of machine learning, I haven't the experience to know if someone has tried these ideas and failed, or figured out a way to make them work. Intuitively, the idea seems solid: Look for examples that are far away from the others but still form a group. Count the instances, and you've found k. In practice, this didn't work well. However, at this point I am not convinced the idea would not work, but rather that my own shortcomings didn't allow me to see something I was overlooking or missing altogether.
In particular, I see a similar problem in both algorithms that may point to why the experiment failed. In the iterative algorithm, we have a calculation of standard deviation that occurs at the "left" side of an ordered sample. At first, the algorithm assumed only two, three, and ten examples (rather than the 50 we saw above). Using two gave horrible predictions, as did three, but 50 did not offer much improvement over ten, where the improvement was significant. Further, in calculating the initial standard deviation, the data are skewed and as we get to the bulk of the examples in the set, the value may rise before it gets lower, splitting too soon. Even if that is not true, as we get further to the right of the sample, if we have just one example further than three standard deviations from the mean, it will split, and likely do so again on the next example. Here it seems possible to have a cascading effect, and we see cases where the algorithm wildly predicts the wrong k, sometimes almost doubling it.
Finally, in the recursive algorithm, the idea was to keep splitting the examples into different samples until k stopped decreasing, given some tolerance. However, depending on where the split happens (as the algorithm stands, it just split in half), the standard deviation may increase, counting only one when it should have been two. In this case, it is possible to see too few k counted. On the other hand, it would also be possible to split a mostly homogenous set into three sets, counting too many k. In the end, this algorithm could have been improved by choosing a smarter place to split (perhaps even integrating the two algorithms). Similarly, learning the tolerance level instead of arbitrarily choosing it would likely have improved the performance.
References
Alpaydin, Ethem. Introduction To Machine Learning. Cambridge, Massachusetts: MIT Press. 2004.
Lazlo, Michael, and Mukherjee, Sumitra. "A Genetic Algorithm Using Hyper-Quadtrees for Low-Dimensional K-means Clustering." IEEE Transactions on Pattern Analysis and Machine Intelligence.
Vol. 28. No. 4. April 2006. 533-543.
Hamerly, Greg, and Elkan, Charles. "Learning the k in k-means." Retrieved from the World Wide Web at HYPERLINK "http://books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf" http://books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf through Google Scholar on November 25, 2006. Found.
Turnbull, Douglas and Elkan, Charles. "Fast Recognition of Musical Genres Using RBF Networks." IEEE Transactions on Knowledge and Data Engineering.
Vol. 17. No. 4. April 2005. 580-584.
Zhao, Tong, Nehorai, Arye, and Porat, Boaz. "K-Means Clustering-Based Data Detection and Symbol-Timing Recovery for Burst-Mode Optical Receiver." IEEE Transactions on Communications.
Vol. 54. No 8. August 2006. 1492-1501.
Zhong Wei, et al. "Improved K-Means Clustering Algorithm for Exploring Local Protein Sequence Motifs Representing Common Structural Property." IEEE Transactions on Nanobioscience.
Vol. 4. No. 3. September 2005. 255-265.
Sammy Larbi December 5, 2006
Machine Learning codeodor.com
PAGE 11
Hh " # $ - / 2 3 4 7 8 9 E F G H I _ ` a p q r s t
no ^9_9`99>2>4>5>Z>[>>CJaJ0JCJ jUCJOJQJaJ jH*H*5\6]
5CJ$\R <)* B V n Y
Z
^`^dhdh$dha$[\detunoQRq r dhdh ""%%(((*,,B-C----0011e4f44
55(8^dhdhd7$8$H$(8)8]9^9`999;;<>>4>5>;>a$$If 0634a-$Ifdhdh^`dh;>>>A>D>G>J>M>P>S>V>Z>[>_>b>e>i>l>p>s>w>z>}>>>>>>Ff$IfFf $$Ifa$>>>>>>>>>$?%?J?K?y?z???VV|WWMXNXXXXXXkYYbZZZZF[i[u\v\|\}\\\\0JmHnHu0J
j0JUaJ0JjU jU6]0JCJaJ*>>>>>>>>>>>>>>>>>>>>>>>$&`#$/IfFf~$IfFf@ $$Ifa$>>????????? ?$?ssssssssss$$&`#$/Ifa$$&`#$/Ifk$$If
6`0634a$?%?)?,?/?2?6?:?=?@?C?F?J?K?W?Z?]?a?e?i?m?p?s?v?y?ԼFf
$$&`#$/Ifa$$&`#$/IfFfy?z??????????????DDMFNF_G`GJJ
!dhFf$$&`#$/Ifa$$&`#$/IfFf^JJKKnMoMOOSSVVVVVWWW Y
YYYYZZZm[[[dhdh[[t\u\\\\\dh$a$ 1h/ =!"#$%Ddl
CHA0..\recursive_find_k.gifb\acT+
D*n
\acT+PNG
IHDRs5HYPLTEU?_mWlbKGDHcmPPJCmp0712Om
QIDATxj<ByUcjPPU6eZ"98iܙeI>]GGNlf@J':vII[
˜x|2YYsJ {VcU*iMY,r\/]x-g`C*nTaՏlZA> )X[BŃ
x%0;XdWlU@PZLV gr;Ck5 5\gg(u'drotXA͎x14V` U`z<k4d~L
0y 8L mȷ;{ϼL6~=NHKӖΓn$6&`nEfot6?twe&g"|.@B`n|,Y9s(R[o=J=ڣ4`XJn&D
!B
Q¹D S,-0~(ʹH:2aXH
P
(~۴(##(q,0*t0*p^L}s[Fټ@(xcyި5ʐB7QIܟio8hڃ.LehBiU}O%%]mюGrVv"'Z,)e;5*ەݘ]˝?
J9['5'1-u%c%y^bqZUr恫.wfXDrh_UBY
gw /wJ]FU%`r8(v˝1k
Ս-76/5=!3*@هWΫIC3_5&&X8UnpO/"twsI SIBNp8C1O)TvH!a)WdCIUԍ!QJU !нY͙*.EΪǻmVig\6"*~ 4*;=RYETZEm.86RiuE
\H.-kB<`_JrO*,vǤE:x%lIѽbklIѹbkN>o(@5+n>阬[pǤ.oVIѹbkN>o(@9+
Puj_vNZW }dxw.-sft_rviFqo`]~um9ꗜg6kJ*YnRZЧ_viFno%'eVpw=vi*B*ph>V@A>FollowedHyperlink>*B*ph"W@Q"Strong5\Xz <)*BVnYZdet
u
noQRqr!!$$$&((B)C)))),,--e0f00
11(4)4]5^5`555778::4:5:;:>:A:D:G:J:M:P:S:V:Z:[:_:b:e:i:l:p:s:w:z:}:::::::::::::::::::::::::::::;;;;;;;;; ;$;%;);,;/;2;6;:;=;@;C;F;J;K;W;Z;];a;e;i;m;p;s;v;y;z;;;;;;;;;;;;;;@@MBNB_C`CFFFGGnIoIKKOORRRRRSSS U
UUUUVVVmWWWWtXuXXXX0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 000e0e0e0e0e0e0e0e0e0e0e0e0e0e000000000000000000$0$0$0$0$0$0$00)0)0)0)0)0)00f00f00f00f00f00f00f00f00f00f00f00070707070707070707070707070707070707070707070707070707070707070707070707070707070707070707070707070707070707@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@07@0707070707070707070700F0F0F0F0F0F0F0F0F000R0R0R0R0R0R0R0R0R0R0R0R0R0R0R0@0@0@0@0@0
0>\/6 (8;>>>$?y?J[\02345789:;<\1MTTTXX!MTTTTTTWWWXsXuXXXX%TUWWWXsXuXXXX:>FF8@./ Z#a#%%)**
***l****2,S,,,,,--/ /111111g2n24488<<FFGGwGwGGGGGHHHHJJKKLLM
M7MaMdMnMMMMMMMMMNN8N:A:D:G:J:M:P:S:V:Z:[:_:b:e:i:l:p:s:w:z:}:::::::::::::::::::::::::::::;;;;;;;;; ;$;%;);,;/;2;6;:;=;@;C;F;J;K;W;Z;];a;e;i;m;p;s;v;y;z;;;;;;;;;;;;;X@DCXP@UnknownGz Times New Roman5Symbol3&z Arial;Wingdings?Times-Roman"1htAf"&o^=H$G$!90dX[TW2Qk-means ClusteringSammy LarbiSammy LarbiOh+'0
<H
T`hpxk-means Clustering-meSammy LarbiammammNormal.dotSammy Larbi111Microsoft Word 9.0@W@0|L,@=H՜.+,D՜.+,Dhp
xorbex$X
k-means ClusteringTitle\ 8@_PID_HLINKSAzV;http://books.nips.cc/papers/files/nips16/NIPS2003_AA36.pdf9Q^9..\recursive_find_k.gif
!"#$%&'()*+,-./0123456789:;<=?@ABCDEFGHIJKLMOPQRSTUVWXYZ[\]^_`abcefghijkmnopqrsvRoot Entry F"xData
>y1TableN+WordDocument"zSummaryInformation(dDocumentSummaryInformation8lCompObjjObjectPool""
FMicrosoft Word Document
MSWordDocWord.Document.89q