Message Boards Message Boards

0
|
4864 Views
|
5 Replies
|
4 Total Likes
View groups...
Share
Share this post:

Give Classify a list of files on disk instead importing all at once?

Posted 3 years ago

I am trying to use Classify to train an image classifier on a dataset of 30k+ images. I followed this tutorial and created a list of the form {File[…]->class,…}. I can use NetTrain with this list, but when I try to use Classify it seems that Classify does not understand to import the images files and instead directly treats the File references as the inputs, evident by the input type reading "Nominal". Thus, it achieves an expectedly terrible result.

How can I pass a list of File references to Classify? Or, alternatively, how can I write some sort of data loader that only imports the images that are currently needed and pass that to Classify?

Thanks in advance.

POSTED BY: Sepehr Elahi
5 Replies

I did a small project to classify certain tropical fishes. The import of images worked well, but again, I faced a similar issue in handling large image archives.

(keeping three different classes of the image in separate folders)

SetDirectory[NotebookDirectory[]];
C01 = FileNames[All, "Fishes\\Guppy"];
C02 = FileNames[All, "Fishes\\Sword"];
C03 = FileNames[All, "Fishes\\Zebra"];

(define the association between the file path and the class)

Data01 = C01[[#]] -> "Guppy" & /@ Range[Length@C01];
Data02 = C02[[#]] -> "Sword" & /@ Range[Length@C02];
Data03 = C03[[#]] -> "Zebra" & /@ Range[Length@C03];

(union all defined associations)

Data = Union[Data01, Data02, Data03];

(Perform imports)

myData = Import@Data[[#]][[1]] -> Data[[#]][[2]] & /@Range[Length@Data];

(Train)

myClassify =  Classify[myData, TargetDevice -> "CPU"]
POSTED BY: Teck Boon Lim
Posted 3 years ago

Hi Sepehr,

Using Classify with 30K images should not require out of core support.

What are the dimensions of the images? There is usually no need to train on large images.

How many different classes are in the dataset? If the distribution of images among classes fairly flat, maybe you do not need 30K images to train. If there is significant class imbalance you will probably get better results by randomly sampling a set of images with minimal class imbalance.

POSTED BY: Rohit Namjoshi
Posted 3 years ago

Thanks for your reply!

Even if the resolution of each image is 200x200 (which is a conservative estimate), then the total size of the imported images will be 30000x200x200x8=21.6 GB! I tried training on a smaller subset of the dataset, but the results were not that good and so I want to train on the entire dataset.

POSTED BY: Sepehr Elahi
Posted 3 years ago

For 8 bit color depth images isn't it 3.6 GB?

200 * 200 * 3 * 30000
(* 3600000000 *)

You did not answer the question about the number of classes and whether they are approximately equally represented in the dataset. If there is a large class imbalance, then random sample from each class a number of images equal to the number of images in the lowest population class. Have you tried that?

POSTED BY: Rohit Namjoshi
Posted 3 years ago

Each floating point number is 8 bytes right? So you would need to multiply by 8 as well. The computation that I did is just for grayscale images. The classes are approximately sampled, and there are around 1000-1300 images for each of the 30 classes. Even if the sampling approach worked, I still want to know if there is a way to feed a list of files to Classify, or alternatively how to write a sampler/data loader. It seems like a general and needed approach that may be needed if files are too large to fit into memory or if you want to train on a dataset that is online (i.e., each image is a URL reference and not saved on the disk). You can do both of these with NetTrain.

POSTED BY: Sepehr Elahi
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract