-- you have earned Featured Contributor Badge Your exceptional post has been selected for our editorial column Staff Picks http://wolfr.am/StaffPicks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!
This k-means implementation -- and many k-neighbor etc. codes suffer from a short coming that yields poor results when the measure represents distance. In particular, the "nearest" does not account for multiplicity; i.e. when there are multiple neighbors with the same nearest distance. The result is that the algorithms under-reach in their collection of neighbors, causing a domino effect downstream including clusters that are mutated with respect to reality. I previously brought this to the group's attention here:
https://community.wolfram.com/groups/-/m/t/2079392
I have since implemented my own algorithms to work around the issue.
Interesting! Yes, there are definitely some major drawbacks to k-means, but it works fairly well given how simple it is
It is not that there are drawbacks to k-means, but rather there are drawbacks to implementations that ignore multiplicity. Part of this is due to the influence of of older k-neighbor algorithms, designed for compiler construction in a single threaded single processor world of the time.
If clustering and associated data structures interest you, see this 1993 paper by Warren and Salmon, "A parallel hashed oct-tree N-body algorithm": https://dl.acm.org/doi/pdf/10.1145/169627.169640 and the 2014 update by Warren: https://content.iospress.com/articles/scientific-programming/spr385