Message Boards Message Boards

Proper ValidationSet use in Classify?

Posted 8 years ago

I am using Classify[] to build a classifier for images, and am not sure about the proper use of the ValidationSet options. My understanding is that without the ValidationSet option, cross correlation will be used. I do not think this is stated explicitly in the documentation, but I have read it online someplace.

First, I divide my data equally into training and test sets.

enterSeedRandom[500];
forTestSet = EvenQ@Range[Length@data];
forTrainingSet = Map[Not[#] &, forTestSet];
testSetClasses = Pick[Normal@data[All, "Type"], forTestSet];
trainingSetClasses =  Pick[Normal@data[All, "Type"], forTrainingSet];
testSetImagesRGB = Pick[Normal@data[All, "Image"], forTestSet];
trainingSetImagesRGB =  Pick[Normal@data[All, "Image"], forTrainingSet];

I use one of these two statements to set up the classifier:

imageClassifyerRGB = 
 Classify[trainingSetImagesRGB -> imageTrainingSetClass,
  PerformanceGoal -> "Quality",
  Method -> "NeuralNetwork",
  ValidationSet -> 
   MapThread[Rule[#1 , #2] &, {testSetImagesRGB, testSetClasses}]]

imageClassifyerRGB = 
 Classify[trainingSetImagesRGB -> imageTrainingSetClass,
  PerformanceGoal -> "Quality",
  Method -> "NeuralNetwork"]

I am judging the accuracy with:

ClassifierMeasurements[ imageClassifyerRGB, 
 testSetImagesRGB -> testSetClasses, "Accuracy"]

Should I be using the ValidationSet option this way, or will this result in over fitting my data?

POSTED BY: Jeff Burns
2 Replies

As far as I know, Classify (and machine learning methods in general) are well-suited for supervised classification: the nature of several region is known, and they are used as training sets. So, using random drawn for that purpose is strange. For unsupervised classification, ClusteringComponents is better suited, in my opinion.

POSTED BY: Claude Mante

I don't know the internals and I would be interested to see what people more knowledgeable have to say. But the first setup you show would make me very nervous about an overfit, since the validation set is used in the creation of the classifier and is also used as the test set. I would expect that, in oder to get trustable results, the three sets (training, validation, testing) would need to be all nonoverlapping.

POSTED BY: Daniel Lichtblau
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract