Message Boards Message Boards

Make vector/mutlivariate clustering/machine learning of a dataset?

GROUPS:

Hi, Does anyone know how to perform a vector level clustering of a data set ? The problem to solve: I have a data set of 1450 samples. Each sample is a vector with 10 scalar data (numbers). The data is structured in a matrix, i.e. a list of lists of numbers. {{1,2,1,...},{3,1,7,...}...} When I use the function Find Clusters, it returns a classification of the scalars themselves, i.e. each number, but not of the vectors. I want to be able to classify the vectors {1,2,1,...} as single objects, as opposed to each scalar component, which is what Mathematica does wenn I call the function Findclusters on the matrix itself. Does anyone know how to proceed to do this ? Thanks a lot for the answer. Best, Emmanuel.

POSTED BY: Emmanuel Daugeras
Answer
3 months ago

FindClusters handles vector input. With no concrete example posted it is difficult to diagnose what may have gone wrong. Below is a simple (rigged) example that shows clustering of vectors. We use three sets created each as a separate cluster.

SeedRandom[134];
n = 20;
d = 4;
data = Join[RandomReal[{-1, 1}, {n, d}], RandomReal[{-3, -1}, {n, d}],
    RandomReal[{1, 3}, {n, d}]];

FindClusters[data -> Range[Length[data]]]

(* Out[617]= {{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
   18, 19, 20}, {21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 
  34, 35, 36, 37, 38, 39, 40}, {41, 42, 43, 44, 45, 46, 47, 48, 49, 
  50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60}} *)
POSTED BY: Daniel Lichtblau
Answer
3 months ago

Hi Daniel, Thank you for the reply. It seemed to work. The clustering was made to vectors, by using the FindCluster function. However, I have not managed to associate the cluster numbers to the vectors, by using the ClusteringComponents function. What I am trying to do is to associate the cluster number to each element of the list. Here is a sample of the data set (called sublistDealers) to sample: {{58., 60., 58.}, {61., 65., 61.}, {55., 55., 61.}, {58., 54., 53.}, {63., 65., 67.}, {58., 58., 60.}, {58., 55., 57.}, {54., 64., 63.}, {43., 44., 43.}, {64., 65., 59.}, {51., 54., 48.}, {3., 3., 5.}, {62., 63., 61.}, {54., 52., 53.}, {56., 57., 59.}, {62., 60., 61.}, {46., 46., 47.}, {50., 54., 52.}, {52., 55., 54.}, {60., 57., 59.}, {55., 52., 55.}, {53., 54., 53.}, {51., 53., 56.}, {50., 48., 53.}, {54., 56., 57.}, {50., 52., 51.}, {57., 53., 56.}, {59., 56., 62.}, {45., 49., 47.}, {43., 49., 46.}, {51., 57., 56.}, {46., 44., 51.}, {53., 56., 51.}, {49., 52., 55.}, {46., 48., 51.}, {50., 48., 49.}, {51., 56., 54.}, {37., 45., 44.}, {49., 51., 48.}, {49., 45., 49.}, {42., 47., 42.}, {54., 52., 43.}, {49., 45., 48.}, {53., 52., 51.}, {44., 43., 41.}, {49., 46., 44.}, {47., 46., 50.}, {33., 38., 43.}, {47., 52., 50.}, {36., 31., 36.}, {30., 26., 36.}, {49., 49., 47.}, {44., 45., 46.}, {33., 42., 46.}, {33., 41., 44.}, {45., 47., 48.}, {36., 43., 45.}, {35., 38., 39.}, {50., 55., 48.}, {39., 48., 43.}, {54., 48., 49.}, {39., 38., 37.}, {50., 44., 47.}, {42., 38., 35.}, {41., 43., 50.}, {41., 44., 45.}, {34., 30., 34.}, {43., 47., 45.}, {53., 49., 49.}, {53., 58., 51.}, {8., 3., 3.}, {50., 49., 46.}, {53., 56., 47.}, {50., 47., 49.}, {23., 25., 45.}, {33., 39., 42.}, {43., 49., 45.}, {40., 42., 45.}, {45., 45., 43.}, {46., 41., 46.}, {51., 50., 47.}, {43., 41., 46.}, {42., 48., 40.}, {38., 38., 38.}, {28., 31., 29.}, {38., 42., 43.}, {51., 45., 46.},...}

clusteredDealers = FindClusters[sublistDealers, 2] provides the following list (which is correct): {{{58., 60., 58.}, {61., 65., 61.}, {55., 55., 61.}, {58., 54., 53.}, {63., 65., 67.}, {58., 58., 60.}, {58., 55., 57.}, {54., 64., 63.}, {43., 44., 43.}, {64., 65., 59.}, {51., 54., 48.}, {62., 63., 61.}, {54., 52., 53.}, {56., 57., 59.}, {62., 60., 61.}, {46., 46., 47.}, {50., 54., 52.}, {52., 55., 54.}, {60., 57., 59.}, {55., 52., 55.}, {53., 54., 53.}, {51., 53., 56.}, {50., 48., 53.}, {54., 56., 57.}, {50., 52., 51.}, {57., 53., 56.}, {59., 56., 62.}, {45., 49., 47.}, {43., 49., 46.}, {51., 57., 56.}, {46., 44., 51.}, {53., 56., 51.}, {49., 52., 55.}, {46., 48., 51.}, {50., 48., 49.}, {51., 56., 54.}, {37., 45., 44.}, {49., 51., 48.}, {49., 45., 49.}, {42., 47., 42.}, {54., 52., 43.}, {49., 45., 48.}, {53., 52., 51.}, {44., 43., 41.}, {49., 46., 44.}, {47., 46., 50.}, {33., 38., 43.}, {47., 52., 50.}, {36., 31., 36.}, {30., 26., 36.}, {49., 49., 47.}, {44., 45., 46.}, {33., 42., 46.}, {33., 41., 44.}, {45., 47., 48.}, {36., 43., 45.}, {35., 38., 39.}, {50., 55., 48.}, {39., 48., 43.}, {54., 48., 49.}, {39., 38., 37.}, {50., 44., 47.}, {42., 38., 35.}, {41., 43., 50.}, {41., 44., 45.}, {34., 30., 34.}, {43., 47., 45.}, {53., 49., 49.}, {53., 58., 51.}, {50., 49., 46.}, {53., 56., 47.}, {50., 47., 49.}, {23., 25., 45.}, {33., 39., 42.}, {43., 49., 45.}, {40., 42., 45.}, {45., 45., 43.}, {46., 41., 46.}, {51., 50., 47.}, {43., 41., 46.}, {42., 48., 40.}, {38., 38., 38.}, {28., 31., 29.}, {38., 42., 43.}, {51., 45., 46.}, {37., 39., 41.}, {31., 40., 41.}, {51., 48., 44.}, {39., 42., 41.}, {41., 37., 42.}, {47., 45., 47.}, {46., 41., 40.}, {38., 44., 41.}, {26., 33., 37.}, {39., 48., 48.}, {47., 47., 47.}, {47., 48., 44.}, {43., 44., 40.}, {46., 48., 41.}, {43., 46., 47.}, {46., 57., 43.}, {37., 35., 41.}, {34., 37., 43.}, {45., 42., 42.}, {45., 46., 43.}, {36., 42., 36.},...} However, using the ClusteringComponents[sublistDealers, 2] call, it seems to deliver a clustering of the scalars themselves: Here are some sample elements that I get from the list: {{1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, \ 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, \ 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {2, 2, 1}, \ {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {2, 1, 1}, {1, 1, 2}, {1, 1, 1}, {1, \ 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, \ 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 1, 1}, {1, 2, 1}, {2, 2, 1},...} Thank you for answer. Best, Emmanuel

POSTED BY: Emmanuel Daugeras
Answer
3 months ago

For that one uses the optional level argument to ClusteringComponents. Which I guess could have been better documented-- I had to go a ways down into the examples to find out it did what is wanted here.

sublistDealers = {{58., 60., 58.}, {61., 65., 61.}, {55., 55., 
    61.}, {58., 54., 53.}, {63., 65., 67.}, {58., 58., 60.}, {58., 
    55., 57.}, {54., 64., 63.}, {43., 44., 43.}, {64., 65., 
    59.}, {51., 54., 48.}, {3., 3., 5.}, {62., 63., 61.}, {54., 52., 
    53.}, {56., 57., 59.}, {62., 60., 61.}, {46., 46., 47.}, {50., 
    54., 52.}, {52., 55., 54.}, {60., 57., 59.}, {55., 52., 
    55.}, {53., 54., 53.}, {51., 53., 56.}, {50., 48., 53.}, {54., 
    56., 57.}, {50., 52., 51.}, {57., 53., 56.}, {59., 56., 
    62.}, {45., 49., 47.}, {43., 49., 46.}, {51., 57., 56.}, {46., 
    44., 51.}, {53., 56., 51.}, {49., 52., 55.}, {46., 48., 
    51.}, {50., 48., 49.}, {51., 56., 54.}, {37., 45., 44.}, {49., 
    51., 48.}, {49., 45., 49.}, {42., 47., 42.}, {54., 52., 
    43.}, {49., 45., 48.}, {53., 52., 51.}, {44., 43., 41.}, {49., 
    46., 44.}, {47., 46., 50.}, {33., 38., 43.}, {47., 52., 
    50.}, {36., 31., 36.}, {30., 26., 36.}, {49., 49., 47.}, {44., 
    45., 46.}, {33., 42., 46.}, {33., 41., 44.}, {45., 47., 
    48.}, {36., 43., 45.}, {35., 38., 39.}, {50., 55., 48.}, {39., 
    48., 43.}, {54., 48., 49.}, {39., 38., 37.}, {50., 44., 
    47.}, {42., 38., 35.}, {41., 43., 50.}, {41., 44., 45.}, {34., 
    30., 34.}, {43., 47., 45.}, {53., 49., 49.}, {53., 58., 51.}, {8.,
     3., 3.}, {50., 49., 46.}, {53., 56., 47.}, {50., 47., 49.}, {23.,
     25., 45.}, {33., 39., 42.}, {43., 49., 45.}, {40., 42., 
    45.}, {45., 45., 43.}, {46., 41., 46.}, {51., 50., 47.}, {43., 
    41., 46.}, {42., 48., 40.}, {38., 38., 38.}, {28., 31., 
    29.}, {38., 42., 43.}, {51., 45., 46.}};

cc=ClusteringComponents[sublistDealers, 2, 1]

(* Out[700]= {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, \
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, \
1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1} *)

QuIck check:

In[703]:= Extract[sublistDealers, Position[cc, 2]]

(* Out[703]= {{3., 3., 5.}, {8., 3., 3.}} *)

This is in fact the second cluster provided by FindClusters[sublistDealers, 2].

POSTED BY: Daniel Lichtblau
Answer
3 months ago

Daniel, Thanks a lot, this is awesome. It does work. May I have a last question ? What I am trying to do is to cluster data from Excel automatically, and add a column at the end of the data with the cluster number. Is it possible to append the source data and to add the cluster number at the end of each item ? Furthermore, is it possible to determine which clustering method was used and what are the parameters ? if it is a linear/hyperplane clustering method, what are the parameters ? And if it is another method, which one is it ? Thanks in advance. Best, Emmanuel

POSTED BY: Emmanuel Daugeras
Answer
3 months ago

The first is quite simple. if I follow correctly what you want. Starting with the computation I already showed, the appending is done as below.

Transpose[{sublistDealers, cc}]

(* {{{58., 60., 58.}, 1}, {{61., 65., 61.}, 1}, {{55., 55., 61.}, 
  1}, {{58., 54., 53.}, 1}, {{63., 65., 67.}, 1}, {{58., 58., 60.}, 
  1}, {{58., 55., 57.}, 1}, {{54., 64., 63.}, 1}, {{43., 44., 43.}, 
  1}, {{64., 65., 59.}, 1}, {{51., 54., 48.}, 1}, {{3., 3., 5.}, 
  2}, {{62., 63., 61.}, 1}, {{54., 52., 53.}, 1}, {{56., 57., 59.}, 
  1}, {{62., 60., 61.}, 1}, {{46., 46., 47.}, 1}, {{50., 54., 52.}, 
  1}, {{52., 55., 54.}, 1}, {{60., 57., 59.}, 1}, {{55., 52., 55.}, 
  1}, {{53., 54., 53.}, 1}, {{51., 53., 56.}, 1}, {{50., 48., 53.}, 
  1}, {{54., 56., 57.}, 1}, {{50., 52., 51.}, 1}, {{57., 53., 56.}, 
  1}, {{59., 56., 62.}, 1}, {{45., 49., 47.}, 1}, {{43., 49., 46.}, 
  1}, {{51., 57., 56.}, 1}, {{46., 44., 51.}, 1}, {{53., 56., 51.}, 
  1}, {{49., 52., 55.}, 1}, {{46., 48., 51.}, 1}, {{50., 48., 49.}, 
  1}, {{51., 56., 54.}, 1}, {{37., 45., 44.}, 1}, {{49., 51., 48.}, 
  1}, {{49., 45., 49.}, 1}, {{42., 47., 42.}, 1}, {{54., 52., 43.}, 
  1}, {{49., 45., 48.}, 1}, {{53., 52., 51.}, 1}, {{44., 43., 41.}, 
  1}, {{49., 46., 44.}, 1}, {{47., 46., 50.}, 1}, {{33., 38., 43.}, 
  1}, {{47., 52., 50.}, 1}, {{36., 31., 36.}, 1}, {{30., 26., 36.}, 
  1}, {{49., 49., 47.}, 1}, {{44., 45., 46.}, 1}, {{33., 42., 46.}, 
  1}, {{33., 41., 44.}, 1}, {{45., 47., 48.}, 1}, {{36., 43., 45.}, 
  1}, {{35., 38., 39.}, 1}, {{50., 55., 48.}, 1}, {{39., 48., 43.}, 
  1}, {{54., 48., 49.}, 1}, {{39., 38., 37.}, 1}, {{50., 44., 47.}, 
  1}, {{42., 38., 35.}, 1}, {{41., 43., 50.}, 1}, {{41., 44., 45.}, 
  1}, {{34., 30., 34.}, 1}, {{43., 47., 45.}, 1}, {{53., 49., 49.}, 
  1}, {{53., 58., 51.}, 1}, {{8., 3., 3.}, 2}, {{50., 49., 46.}, 
  1}, {{53., 56., 47.}, 1}, {{50., 47., 49.}, 1}, {{23., 25., 45.}, 
  1}, {{33., 39., 42.}, 1}, {{43., 49., 45.}, 1}, {{40., 42., 45.}, 
  1}, {{45., 45., 43.}, 1}, {{46., 41., 46.}, 1}, {{51., 50., 47.}, 
  1}, {{43., 41., 46.}, 1}, {{42., 48., 40.}, 1}, {{38., 38., 38.}, 
  1}, {{28., 31., 29.}, 1}, {{38., 42., 43.}, 1}, {{51., 45., 46.}, 
  1}} *)

I do not know of a way to determine the method used, assuming one goes with the Automatic default. OneClusteringComponents could force a method using the option though. The ClusteringComponents ref guide page gives a set of possibilities.

POSTED BY: Daniel Lichtblau
Answer
3 months ago

Group Abstract Group Abstract