Message Boards Message Boards

GROUPS:

What distance function does FindClusters use?

Posted 7 months ago
547 Views
|
8 Replies
|
5 Total Likes
|

My list contains numbers from 0-40k. The figure shows data distribution:

enter image description here

I tried FindClusters[list]

The output is two clusters as seen here:

{{4169, 7114, 5025, 7316, 4977, 10411, 9352, 16438, 8719, 14330, 
      10277, 7144, 11950, 18572, 10471, 4915, 4958, 7556, 5145, 13862, 
      8466, 14138, 10861, 11815, 5638, 15242, 16666, 23564, 4256, 13014, 
      9865, 3729, 5980, 7740, 14290, 14067, 12038, 14125, 6436, 14240, 
      19054, 9622, 13876, 8362, 5983, 7163, 4908, 12856, 15923, 14368, 
      14467, 9393, 9555, 8537, 9149, 10272, 8228, 6525, 6596, 10401, 6244,
       16576, 15262, 12593, 16128, 13189, 13508, 14206, 15115, 24985, 
      19442, 18195, 14522, 9103, 8781, 9394, 4716, 6760, 9281, 6958, 
      10581, 10862, 11518, 11508, 5691, 8567, 9797, 10897, 9535, 8723, 
      7645, 7035, 7186, 7392, 6913, 7549, 18990, 12778, 15982, 5145, 
      14650, 14468, 13480, 20918, 14713, 17319, 22983, 20166, 9464, 23675,
       8466, 9598, 9698, 7082, 18233, 15193, 11804, 10285, 25290, 17428, 
      11320, 6441, 11868, 14666, 18505, 11778, 12131, 9275, 6347, 13024, 
      19351, 14984, 14150, 18093, 7455, 20572, 14041, 23137, 12763, 14986,
       11280, 13584, 17583, 14394, 17540, 18123, 16960, 9344, 20265, 
      21251, 19206, 25316, 17411, 17123, 17137, 11778, 19055, 15926, 
      18753, 19731, 14524, 21106, 12309, 12357, 17689, 23076, 20067, 
      10224, 16353, 7571, 8493, 8927, 15024, 18869, 14585, 16099, 18462, 
      14361, 15621, 15584, 20522, 18542, 13220, 19124, 16885, 10800, 
      20395, 18752, 17369, 21940, 14893, 14939, 25153, 19275, 15273, 
      18337, 18835, 17250, 26872, 15279, 14366, 15319, 20846, 15711, 
      18547, 20289, 22089, 17250, 18777, 21723, 17813, 21230, 24460, 8375,
       14843, 18409, 4854, 10552, 13598, 14440, 14707, 17834, 18916, 
      22908, 7045, 20264, 20317, 6742, 8589, 15747, 17136, 12764, 18185, 
      6882, 8867, 7009, 13119, 10461, 11362, 14844, 14337, 9780, 7170, 
      8486, 8538, 8758, 8383, 5024, 7285, 10365, 5239, 7644, 8675, 7909, 
      8781, 7353, 6439, 9123, 8136, 11655, 18012, 8834, 11400, 8248, 8207,
       9232, 11126, 24912, 12578, 8352, 13299, 6344, 8347, 6876, 14591, 
      11316, 18416, 11233, 8438, 20095, 10800, 7596, 5791, 7083, 7931, 
      6021, 6088, 13472, 9212, 6992, 8428, 9336, 11558, 10948, 8795, 6353,
       11253, 9172, 15023, 6512, 7775, 11892, 7908, 7545, 8135, 10378, 
      8896, 7302, 12794, 10991, 10490, 7240, 9780, 4285, 4694, 6847, 9383,
       6969, 7879, 12737, 5840, 5550, 12252, 9034, 8661, 10347, 11444, 
      8241, 11445, 11539, 14462, 17701, 13711, 8229, 7458, 12440, 13455, 
      12092, 13517, 12047, 10099, 18228, 14068, 17192, 18021, 12252, 
      11070, 11711, 12952, 12144, 9109, 6563, 4531, 7438, 8839, 15560, 
      11478, 18469, 14584}, {35494, 32082, 27490, 29077, 31458, 31198}}

My second try was to specify the number of clusters using FindClusters[list,4]. The output was:

{{4169, 7114, 5025, 7316, 4977, 10411, 9352, 16438, 8719, 14330, 
  10277, 7144, 11950, 18572, 10471, 4915, 4958, 7556, 5145, 13862, 
  8466, 14138, 10861, 11815, 5638, 15242, 16666, 23564, 4256, 13014, 
  9865, 3729, 5980, 7740, 14290, 14067, 12038, 14125, 6436, 14240, 
  19054, 9622, 13876, 8362, 5983, 7163, 4908, 12856, 15923, 14368, 
  14467, 9393, 9555, 8537, 9149, 10272, 8228, 6525, 6596, 10401, 6244,
   16576, 15262, 12593, 16128, 13189, 13508, 14206, 15115, 19442, 
  18195, 14522, 9103, 8781, 9394, 4716, 6760, 9281, 6958, 10581, 
  10862, 11518, 11508, 5691, 8567, 9797, 10897, 9535, 8723, 7645, 
  7035, 7186, 7392, 6913, 7549, 18990, 12778, 15982, 5145, 14650, 
  14468, 13480, 20918, 14713, 17319, 22983, 20166, 9464, 23675, 8466, 
  9598, 9698, 7082, 18233, 15193, 11804, 10285, 17428, 11320, 6441, 
  11868, 14666, 18505, 11778, 12131, 9275, 6347, 13024, 19351, 14984, 
  14150, 18093, 7455, 20572, 14041, 23137, 12763, 14986, 11280, 13584,
   17583, 14394, 17540, 18123, 16960, 9344, 20265, 21251, 19206, 
  17411, 17123, 17137, 11778, 19055, 15926, 18753, 19731, 14524, 
  21106, 12309, 12357, 17689, 23076, 20067, 10224, 16353, 7571, 8493, 
  8927, 15024, 18869, 14585, 16099, 18462, 14361, 15621, 15584, 20522,
   18542, 13220, 19124, 16885, 10800, 20395, 18752, 17369, 21940, 
  14893, 14939, 19275, 15273, 18337, 18835, 17250, 15279, 14366, 
  15319, 20846, 15711, 18547, 20289, 22089, 17250, 18777, 21723, 
  17813, 21230, 24460, 8375, 14843, 18409, 4854, 10552, 13598, 14440, 
  14707, 17834, 18916, 22908, 7045, 20264, 20317, 6742, 8589, 15747, 
  17136, 12764, 18185, 6882, 8867, 7009, 13119, 10461, 11362, 14844, 
  14337, 9780, 7170, 8486, 8538, 8758, 8383, 5024, 7285, 10365, 5239, 
  7644, 8675, 7909, 8781, 7353, 6439, 9123, 8136, 11655, 18012, 8834, 
  11400, 8248, 8207, 9232, 11126, 12578, 8352, 13299, 6344, 8347, 
  6876, 14591, 11316, 18416, 11233, 8438, 20095, 10800, 7596, 5791, 
  7083, 7931, 6021, 6088, 13472, 9212, 6992, 8428, 9336, 11558, 10948,
   8795, 6353, 11253, 9172, 15023, 6512, 7775, 11892, 7908, 7545, 
  8135, 10378, 8896, 7302, 12794, 10991, 10490, 7240, 9780, 4285, 
  4694, 6847, 9383, 6969, 7879, 12737, 5840, 5550, 12252, 9034, 8661, 
  10347, 11444, 8241, 11445, 11539, 14462, 17701, 13711, 8229, 7458, 
  12440, 13455, 12092, 13517, 12047, 10099, 18228, 14068, 17192, 
  18021, 12252, 11070, 11711, 12952, 12144, 9109, 6563, 4531, 7438, 
  8839, 15560, 11478, 18469, 14584}, {35494}, {24985, 25290, 25316, 
  27490, 25153, 29077, 26872, 24912}, {32082, 31458, 31198}}

Could you explain me how this function works? I don't want to have a huge cluster with most of the values. Instead, I expect that the function recognises a cluster for values near 10k, 15k, 20k and 30k. What is the distance function used in FindingClusters()?

8 Replies

Hi Veronica,

have you tried using the Method-option?

FindClusters[data, 4, Method -> "KMeans"]

The boundaries of the ranges are:

MinMax /@ FindClusters[data, 4, Method -> "KMeans"]
{{3729, 10099}, {10224, 15621}, {15711, 22089}, {22908, 35494}}

The method "KMedoids" gives the same. The following code gives (without the Quiet) error messages, but gives you an idea of what the different methods do:

Quiet[(MinMax /@ 
Check[FindClusters[data, Method -> #], FindClusters[data, 4, Method -> #]]) & /@ StringDelete[( Entity["WolframLanguageSymbol", "Method"] /. WolframLanguageData["FindClusters", "CommonOptionValues"]), "\""]]

Cheers,

Marco

By specifying Method -> "Means" I get an error (copy&paste):

FindClusters::wrgdist: The distance function EuclideanDistance cannot be comuputed on the data.

Using the option DistanceFunction triggers similar error.

It is not surprising that you get an error message when you use

Method->"Means"

because that is not a valid method. The correct version is:

Method->"KMeans"

You would also get an error if for KMeans you do not specify the "4". The distance function might become more interesting if you have more than 1 dimensional data. For the appropriate distances it does work however:

FindClusters[data, 4, Method -> "KMeans", DistanceFunction -> CanberraDistance]

Cheers,

Marco

Hi Marco, Sorry the corrector changed my writing and I didn't realised it before submitting my reply. In fact, I used "KMeans" but I get error either with

Method -> "KMeans", DistanceFunction -> CanberraDistance

or only

Method -> "KMeans"

Hi Veronica,

I do not get an error message. Could you post your notebook and the version of MMA that you are using?

Does your code have the "4"?

Cheers,

Marco

I think I found it. It does appear to be a problem in Version 11.2; it does work in 11.1. I think this is a known issue and will hopefully be addressed in the future. Sorry I overlooked that.

Cheers,

Marco

PS: Could you please try to execute the examples (applications section) in the documentation of FindClusters? I predict that they, too, will produce error messages.

Hi Marco, I have Mathematica 11.1.1.0 for Mac OS. My notebook is here: link to github Thanks for your help! Actually the error disappeared after a computer restart. I don't know what was the cause, clearing everything or new notebook didn't help.

Cool. The funny thing is that under 11.2 I could reproduce it. I am on a different machine now and it works....

M.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract