Message Boards Message Boards

What distance function does FindClusters use?

GROUPS:

My list contains numbers from 0-40k. The figure shows data distribution:

enter image description here

I tried FindClusters[list]

The output is two clusters as seen here:

{{4169, 7114, 5025, 7316, 4977, 10411, 9352, 16438, 8719, 14330, 
      10277, 7144, 11950, 18572, 10471, 4915, 4958, 7556, 5145, 13862, 
      8466, 14138, 10861, 11815, 5638, 15242, 16666, 23564, 4256, 13014, 
      9865, 3729, 5980, 7740, 14290, 14067, 12038, 14125, 6436, 14240, 
      19054, 9622, 13876, 8362, 5983, 7163, 4908, 12856, 15923, 14368, 
      14467, 9393, 9555, 8537, 9149, 10272, 8228, 6525, 6596, 10401, 6244,
       16576, 15262, 12593, 16128, 13189, 13508, 14206, 15115, 24985, 
      19442, 18195, 14522, 9103, 8781, 9394, 4716, 6760, 9281, 6958, 
      10581, 10862, 11518, 11508, 5691, 8567, 9797, 10897, 9535, 8723, 
      7645, 7035, 7186, 7392, 6913, 7549, 18990, 12778, 15982, 5145, 
      14650, 14468, 13480, 20918, 14713, 17319, 22983, 20166, 9464, 23675,
       8466, 9598, 9698, 7082, 18233, 15193, 11804, 10285, 25290, 17428, 
      11320, 6441, 11868, 14666, 18505, 11778, 12131, 9275, 6347, 13024, 
      19351, 14984, 14150, 18093, 7455, 20572, 14041, 23137, 12763, 14986,
       11280, 13584, 17583, 14394, 17540, 18123, 16960, 9344, 20265, 
      21251, 19206, 25316, 17411, 17123, 17137, 11778, 19055, 15926, 
      18753, 19731, 14524, 21106, 12309, 12357, 17689, 23076, 20067, 
      10224, 16353, 7571, 8493, 8927, 15024, 18869, 14585, 16099, 18462, 
      14361, 15621, 15584, 20522, 18542, 13220, 19124, 16885, 10800, 
      20395, 18752, 17369, 21940, 14893, 14939, 25153, 19275, 15273, 
      18337, 18835, 17250, 26872, 15279, 14366, 15319, 20846, 15711, 
      18547, 20289, 22089, 17250, 18777, 21723, 17813, 21230, 24460, 8375,
       14843, 18409, 4854, 10552, 13598, 14440, 14707, 17834, 18916, 
      22908, 7045, 20264, 20317, 6742, 8589, 15747, 17136, 12764, 18185, 
      6882, 8867, 7009, 13119, 10461, 11362, 14844, 14337, 9780, 7170, 
      8486, 8538, 8758, 8383, 5024, 7285, 10365, 5239, 7644, 8675, 7909, 
      8781, 7353, 6439, 9123, 8136, 11655, 18012, 8834, 11400, 8248, 8207,
       9232, 11126, 24912, 12578, 8352, 13299, 6344, 8347, 6876, 14591, 
      11316, 18416, 11233, 8438, 20095, 10800, 7596, 5791, 7083, 7931, 
      6021, 6088, 13472, 9212, 6992, 8428, 9336, 11558, 10948, 8795, 6353,
       11253, 9172, 15023, 6512, 7775, 11892, 7908, 7545, 8135, 10378, 
      8896, 7302, 12794, 10991, 10490, 7240, 9780, 4285, 4694, 6847, 9383,
       6969, 7879, 12737, 5840, 5550, 12252, 9034, 8661, 10347, 11444, 
      8241, 11445, 11539, 14462, 17701, 13711, 8229, 7458, 12440, 13455, 
      12092, 13517, 12047, 10099, 18228, 14068, 17192, 18021, 12252, 
      11070, 11711, 12952, 12144, 9109, 6563, 4531, 7438, 8839, 15560, 
      11478, 18469, 14584}, {35494, 32082, 27490, 29077, 31458, 31198}}

My second try was to specify the number of clusters using FindClusters[list,4]. The output was:

{{4169, 7114, 5025, 7316, 4977, 10411, 9352, 16438, 8719, 14330, 
  10277, 7144, 11950, 18572, 10471, 4915, 4958, 7556, 5145, 13862, 
  8466, 14138, 10861, 11815, 5638, 15242, 16666, 23564, 4256, 13014, 
  9865, 3729, 5980, 7740, 14290, 14067, 12038, 14125, 6436, 14240, 
  19054, 9622, 13876, 8362, 5983, 7163, 4908, 12856, 15923, 14368, 
  14467, 9393, 9555, 8537, 9149, 10272, 8228, 6525, 6596, 10401, 6244,
   16576, 15262, 12593, 16128, 13189, 13508, 14206, 15115, 19442, 
  18195, 14522, 9103, 8781, 9394, 4716, 6760, 9281, 6958, 10581, 
  10862, 11518, 11508, 5691, 8567, 9797, 10897, 9535, 8723, 7645, 
  7035, 7186, 7392, 6913, 7549, 18990, 12778, 15982, 5145, 14650, 
  14468, 13480, 20918, 14713, 17319, 22983, 20166, 9464, 23675, 8466, 
  9598, 9698, 7082, 18233, 15193, 11804, 10285, 17428, 11320, 6441, 
  11868, 14666, 18505, 11778, 12131, 9275, 6347, 13024, 19351, 14984, 
  14150, 18093, 7455, 20572, 14041, 23137, 12763, 14986, 11280, 13584,
   17583, 14394, 17540, 18123, 16960, 9344, 20265, 21251, 19206, 
  17411, 17123, 17137, 11778, 19055, 15926, 18753, 19731, 14524, 
  21106, 12309, 12357, 17689, 23076, 20067, 10224, 16353, 7571, 8493, 
  8927, 15024, 18869, 14585, 16099, 18462, 14361, 15621, 15584, 20522,
   18542, 13220, 19124, 16885, 10800, 20395, 18752, 17369, 21940, 
  14893, 14939, 19275, 15273, 18337, 18835, 17250, 15279, 14366, 
  15319, 20846, 15711, 18547, 20289, 22089, 17250, 18777, 21723, 
  17813, 21230, 24460, 8375, 14843, 18409, 4854, 10552, 13598, 14440, 
  14707, 17834, 18916, 22908, 7045, 20264, 20317, 6742, 8589, 15747, 
  17136, 12764, 18185, 6882, 8867, 7009, 13119, 10461, 11362, 14844, 
  14337, 9780, 7170, 8486, 8538, 8758, 8383, 5024, 7285, 10365, 5239, 
  7644, 8675, 7909, 8781, 7353, 6439, 9123, 8136, 11655, 18012, 8834, 
  11400, 8248, 8207, 9232, 11126, 12578, 8352, 13299, 6344, 8347, 
  6876, 14591, 11316, 18416, 11233, 8438, 20095, 10800, 7596, 5791, 
  7083, 7931, 6021, 6088, 13472, 9212, 6992, 8428, 9336, 11558, 10948,
   8795, 6353, 11253, 9172, 15023, 6512, 7775, 11892, 7908, 7545, 
  8135, 10378, 8896, 7302, 12794, 10991, 10490, 7240, 9780, 4285, 
  4694, 6847, 9383, 6969, 7879, 12737, 5840, 5550, 12252, 9034, 8661, 
  10347, 11444, 8241, 11445, 11539, 14462, 17701, 13711, 8229, 7458, 
  12440, 13455, 12092, 13517, 12047, 10099, 18228, 14068, 17192, 
  18021, 12252, 11070, 11711, 12952, 12144, 9109, 6563, 4531, 7438, 
  8839, 15560, 11478, 18469, 14584}, {35494}, {24985, 25290, 25316, 
  27490, 25153, 29077, 26872, 24912}, {32082, 31458, 31198}}

Could you explain me how this function works? I don't want to have a huge cluster with most of the values. Instead, I expect that the function recognises a cluster for values near 10k, 15k, 20k and 30k. What is the distance function used in FindingClusters()?

POSTED BY: Veronica Estrada
Answer
4 months ago

Hi Veronica,

have you tried using the Method-option?

FindClusters[data, 4, Method -> "KMeans"]

The boundaries of the ranges are:

MinMax /@ FindClusters[data, 4, Method -> "KMeans"]
{{3729, 10099}, {10224, 15621}, {15711, 22089}, {22908, 35494}}

The method "KMedoids" gives the same. The following code gives (without the Quiet) error messages, but gives you an idea of what the different methods do:

Quiet[(MinMax /@ 
Check[FindClusters[data, Method -> #], FindClusters[data, 4, Method -> #]]) & /@ StringDelete[( Entity["WolframLanguageSymbol", "Method"] /. WolframLanguageData["FindClusters", "CommonOptionValues"]), "\""]]

Cheers,

Marco

POSTED BY: Marco Thiel
Answer
4 months ago

By specifying Method -> "Means" I get an error (copy&paste):

FindClusters::wrgdist: The distance function EuclideanDistance cannot be comuputed on the data.

Using the option DistanceFunction triggers similar error.

POSTED BY: Veronica Estrada
Answer
4 months ago

It is not surprising that you get an error message when you use

Method->"Means"

because that is not a valid method. The correct version is:

Method->"KMeans"

You would also get an error if for KMeans you do not specify the "4". The distance function might become more interesting if you have more than 1 dimensional data. For the appropriate distances it does work however:

FindClusters[data, 4, Method -> "KMeans", DistanceFunction -> CanberraDistance]

Cheers,

Marco

POSTED BY: Marco Thiel
Answer
4 months ago

Hi Marco, Sorry the corrector changed my writing and I didn't realised it before submitting my reply. In fact, I used "KMeans" but I get error either with

Method -> "KMeans", DistanceFunction -> CanberraDistance

or only

Method -> "KMeans"
POSTED BY: Veronica Estrada
Answer
4 months ago

Hi Veronica,

I do not get an error message. Could you post your notebook and the version of MMA that you are using?

Does your code have the "4"?

Cheers,

Marco

POSTED BY: Marco Thiel
Answer
4 months ago

I think I found it. It does appear to be a problem in Version 11.2; it does work in 11.1. I think this is a known issue and will hopefully be addressed in the future. Sorry I overlooked that.

Cheers,

Marco

PS: Could you please try to execute the examples (applications section) in the documentation of FindClusters? I predict that they, too, will produce error messages.

POSTED BY: Marco Thiel
Answer
4 months ago

Hi Marco, I have Mathematica 11.1.1.0 for Mac OS. My notebook is here: link to github Thanks for your help! Actually the error disappeared after a computer restart. I don't know what was the cause, clearing everything or new notebook didn't help.

POSTED BY: Veronica Estrada
Answer
4 months ago

Cool. The funny thing is that under 11.2 I could reproduce it. I am on a different machine now and it works....

M.

POSTED BY: Marco Thiel
Answer
4 months ago

Group Abstract Group Abstract