Message Boards Message Boards

GROUPS:

ADASYN (Adaptive Synthetic Sampling) for imbalance datasets

Posted 1 month ago
153 Views
|
0 Replies
|
1 Total Likes
|

Imbalance data occurs when some types of data distribution dominate the instance space compared to other data distributions (He et. al., 2008). It happens frequently in financial fraud data. Using an imbalance dataset in training may result in misleading conclusions for anomaly detection as ML algorithms tend to show bias for the majority class.

One common way to deal with imbalance datasets is using oversampling methods such as SMOTE. He et. al. (2008), instead, introduce an adaptive method that outperforms SMOTE in many cases, at the same time, does not require hypothesis evaluation for generating synthetic data and thus more efficient.

The detailed algorithm and introduction can be found in their paper: https://sci2s.ugr.es/keel/pdf/algorithm/congreso/2008-He-ieee.pdf

I build-up a function for data pre-processing using this ADASYN method:

adasyn[input_, minorClass_, majorClass_, \[Beta]_, K_] :=
 Module[
  {g, m, ms, ml , sinput, linput, r, s, rn, xx, delta},
  ms = Count[Values[input], minorClass];
  ml = Count[Values[input], majorClass];
  m = ms + ml;
  sinput = input[[Flatten[Position[Values[input], minorClass]]]];
  linput = input[[Flatten[Position[Values[input], majorClass]]]];
  xx = ConstantArray[Null, {ms, K}];
  delta = ConstantArray[Null, ms];
  g = ConstantArray[Null, ms];
  r = ConstantArray[Null, ms];
  s = ConstantArray[Null, ms];
  rn = ConstantArray[Null, ms];

  For [i = 1, i <= ms, i++,
   xx[[i]] = 
    Nearest[Flatten[{Keys[input], Delete[Keys[sinput], i]}], 
      Keys[sinput][[i]] , K] // Flatten; 
   delta[[i]] = 
    Length[Position[Keys[linput], Alternatives @@ xx[[i]]]]; 
   r[[i]] = delta[[i]]/K];

  For[i = 1, i <= ms, i++,
   rn[[i]] = r[[i]]/Total[r];
   g[[i]] = rn[[i]]* (ml - ms)*\[Beta] // Ceiling;
   s = ReplacePart[s, i -> ConstantArray[Null, g[[i]]]];
   For [z = 1, z <= g[[i]], z++,
    s[[i, z]] = 
     Keys[sinput][[
        i]] + (RandomSample[xx[[i]], 1] - Keys[sinput][[i]])*
        RandomReal[] /. {x_} -> x
    ]
   ];

  s
  ]

I'm not proficient in mathematica coding and those for loops might be able to optimized in many ways.

This is a very preliminary version that allows for only one key, one minority class. More functionalities can be added further, such as for multiple datas (keys), multiple classes (values).

Package imbalanced-learn in Python has adopted ADASYN and SMOTE methods. Hope mathematica can add this functionality in the near future.

Update: I test the function using Wine Quality example data from Wolfram. Volatile Acidity is the only data selected, and Class 8. are selected as minority in comparison to the majority Class 6.

data = ExampleData[{"MachineLearning", "WineQuality"}, "TrainingData"];
Histogram[Values[data], PlotLabel -> "Class Distribution"]

enter image description here

Histogram[
 Keys[data][[Flatten[Position[Values[data], 8.]]]][[All, 2]], 
 PlotLabel -> "Volatile Acidity for Class 8."]

enter image description here

selectedkey = Keys[data][[All, 2]];
selectedata = Thread[selectedkey -> Values[data]];
ada = Thread[Flatten[adasyn[selectedata, 8., 6., 1, 5]] -> 8.];
ListPlot[{Transpose[{Values[#1], Keys[#1]}], 
    Transpose[{Values[#2], Keys[#2]}]}, 
   PlotStyle -> {Automatic, {Red, Opacity[0.05]}}, 
   ImageSize -> Large] &[selectedata, ada]

enter image description here

Histogram[{Keys[data][[Flatten[Position[Values[data], 8.]]]][[All, 
    2]], Keys[ada]}, 10, "Probability", 
 PlotLabel -> "Comparison: Volatile Acidity for Class 8."]

enter image description here

Histogram[{Values[selectedata], Values[ada]}, 
 PlotLabel -> "Class Disbtibution before and after ADASYN"]

enter image description here

The data of the minority class, which is Class 8. in my case, after ADASYN adjustment seems to be more concentrate though.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract