Message Boards Message Boards

Is there a more efficient way to import a large number of files?

GROUPS:

I'm working on an image classification machine learning project and have a somewhat large (~800 MiB) training data set. The data set is composed of individual PNG image files organized in directories. Upon trying to import them all

loadedData = ParallelMap[Import, dataFiles]

It takes an extremely long time (10+ minutes), and ends up using 21 GiB of memory. Obviously, something seems to be wrong. Is there a more efficient or more correct way to be loading all these images?

POSTED BY: Alex O'Brien
Answer
3 months ago

Try:

loadedData = ParallelMap[Import[#, IncludeMetaInformation -> None] &, dataFiles]
POSTED BY: Piotr Wendykier
Answer
3 months ago

This appears not to have made any (or much of a) difference. I'm not on a machine capable of loading 21 GiB of data at the moment, but it's still using at least 16 GiB before failing. I'm bewildered as to where all of this apparent data is coming from.

POSTED BY: Alex O'Brien
Answer
3 months ago

Also, if you want to train a neural net with those images you can perform an out-of-core training. See this example: http://www.wolfram.com/language/11/neural-networks/out-of-core-image-classification.html?product=language

POSTED BY: Shadi Ashnai
Answer
3 months ago

I'm aware of out-of-core training, but I had expected that it wouldn't be necessary for the size of my data.

POSTED BY: Alex O'Brien
Answer
3 months ago

PNG is usually (always?) compressed. When its imported into the WL, the size increases (sometimes quite dramatically), as its stored as an uncompressed array in memory. This is one major advantage of out-of-core training: you can store all your images in a compressed format on disk and only have small batches of uncompressed images in memory at any one time.

One last thing: JPG is a better format for out-of-core learning than PNG, as the Image NetEncoder is faster for this format.

Answer
3 months ago

(as to deciding what not to load or pre-loading); the operating system and (hard disk electronics) do some of that already

POSTED BY: John Hendrickson
Answer
3 months ago
  • make sure the images are located on a local disk not the cloud

  • your missing a Semi-Colon on your posted expression (block output of expression). if you did NOT use colon then your initial statement may be untrue. It may be the Front End, not the Mathematica Kernel, which is causing delay. (there are ways to speed that up - but i consider that a new different topic)

  • as the said above: some options may effect loading time.

  • as said above: you may change the file format (on disk, then import them after) - but may loose image data doing so. if you aren't very familiar with what is in the data and it is valuable: you should not.

  • state what OS your using. Sierra? WIn10? Linux?

  • what are the file sizes in byes total? 100MB? how many files total? size each? PNG can have differing structure and require re-writing into mathematic's format. but on a modern machine, it should not take 10 min.

? is it still slow and using lots of memory WITHOUT ParallelMap

POSTED BY: John Hendrickson
Answer
3 months ago

Group Abstract Group Abstract