Message Boards Message Boards

Importing ARFF files (Weka's Machine Learning format)

I have searched after packages for importing ARFF files (i.e. Weka's Machine Learning format) into Mathematica, but haven't found any. Below is my own take. I assume that it can be improved a lot, both on the features side and the code it self. After the code are an example how to use it.

As stated in the last comment in the code, I've tried to make the data as a Dataset but then it seems to be much slower when using it with Classify or Predict. It would be great if forthcoming Mathematica versions have ARFF as a standard import format, and perhaps export format as well. The spec for ARFF is here: Attribute-Relation File Format (ARFF)

ClearAll[fromArff];
fromArff::usage = 
  "fromArff[file(,classIndex)]\nclassIndex is the last attribute (-1) \
by default.";
fromArff[file_, classIx_: - 1] := 
    Module[{content, arff, datastartPos, attributes, attributeNames, 
   realAttributes, attributeNamesF, attributeNamesClass, arffdata, 
   dataset},

  (* Checking if file exists *)

  content = Import[file, "Text", "IgnoreEmptyLines" -> True];
  If [content == $Failed,  Return[{}]];

  arff = StringSplit[content, "\n"];
  (* Remove comment lines. *)

  arff = arff[[PositionIndex[
       StringMatchQ[arff, RegularExpression["^[^%].*"]]][True]]];

  (* identify the attributes, and delete the empty cases. *)

  attributes = 
   Flatten[DeleteCases[
     StringCases[arff, RegularExpression["@attribute.+"], 
      IgnoreCase -> True], {}]];

  (* identify attribute names (not yet used) and the numeric \
attributes *)

  attributeNames = 
   Flatten[Take[StringSplit[#], {2}] & /@ attributes];
  Print["attributeNames: ", attributeNames];
  realAttributes = 
   Pick[Range[Length[attributeNames]], 
    Flatten[StringContainsQ[Take[StringSplit[#], {3}], 
        RegularExpression["real|numeric"], IgnoreCase -> True] & /@ 
      attributes]];

  (* the data section *)

  datastartPos = 
   First@PositionIndex[
      StringContainsQ[arff, "@data", IgnoreCase -> True]][True];
  arffdata = StringSplit[arff[[datastartPos + 1 ;;]], ","];

  (* Fixes: some floats are represented as '...'. *)

  arffdata = StringDelete[#, "'"] & /@ arffdata;

  (* convert real | numeric datafields to float *)

  arffdata[[All, realAttributes]] = 
   ToExpression[arffdata[[All, realAttributes]]];

  (* Missing values are stated as "?" in ARFF format. 
  Replace with Missing[] *)

  arffdata = Replace[arffdata, ("?" | Null) -> Missing[], 2];

  (* Return the data as "plain" data, i.e. not as a Dataset *)

  dataset = Drop[#, {classIx}] -> #[[classIx]] & /@ arffdata 

  (* trying to add attribute names: 
  This makes the classification much slower and requires more RAM/
  CPU.  *)
  (* 
  attributeNamesF = Drop[attributeNames,{classIx}];
  attributeNamesClass = attributeNames[[classIx]]; 
  dataset = Thread[attributeNamesF \[Rule] 
  Drop[#,{classIx}]] \[Rule]  #[[classIx]]  &/@ arffdata
  *)
  ]

Here is a simple example how to use fromArff:

dataset = fromArff["http://hakank.org/weka/iris.arff"];
cl = Classify[dataset]

Note: My Weka page contains a lot of ARFF files: http://hakank.org/weka/ .

Any updates? I am interested in using ARFF files in Mathematica to segment images.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract