
The purpose of this post is to describe the process of utilizing the Classifier functionality in Mathematica for Image Processing (in this specific case, images of natural clouds). While it would definitely be more robust to develop a CNN to handle this (Convoluted Neural Network), or at least use one of the more image-processing oriented functions in Mathematica, my purpose in using the Classify function was to demonstrate the ability to develop a reasonable machine learning tool with very easy and uncomplicated code. Actually developing a CNN would require thoughts on how to structure layers, and would require at the very least a decent understanding of data analytics - likewise with the image processing focused functions. In contrast, the Classify function can literally create a machine learning tool in one line - perfect for beginners like me. Below is the exploration process of using the Classify function to do a simple task.
 
This was done as a part of the Wolfram Mentorship
 
Developing a Classifer for Cloud Images
The purpose of the classifier is, as the title implies, to develop a classifier function that can act on cloud images. The main focus is to identify the four major cloud types - Cumulus, Cirrus, Contrail, and Stratus. The path of developing this function is as follows:
 
 
 - Initial attempt at classification, using images gathered from the NASA cloud image database, to form a general idea of how the classification would work.
- Attempts at improving accuracy from the initial try, via various methods of pre-processing images
- Moving away from pre-processing images and onto using API services to generate larger datasets
- Final resulting classification function, able to handle many image requests, based on offline image file database.
In the initial attempt (and the several attempts following), my idea was to use a small training dataset to evaluate the effectiveness of the classifier, to avoid spending undue amounts of time iterating over a large dataset - the goal being after a valid classifier function is determined, I can scale upwards from a small dataset to a larger one. In the initial attempt, I took raw images directly from the NASA Cloud Classroom Database, and evaluated the Classify function directly over the images.
(As the code had images directly placed within it, it can't be shown directly - however, it is present in the attached file)
Needless to say, the classifier did work. However, there were severe problems with accuracy and validity - it had an accuracy of about 20%, which is frankly unusable for any practical purpose. To improve this, I attempted to see if using some basic preprocessing techniques would work.
 
Classification with pre-processing of images
I thought of two possible methods of pre-processing the images - using a Binarizer/EdgeDetect, to remove potentially distracting minor features that did not belong to clouds (like trees, etc.), or using background Cropping or Padding, to directly remove those features. 
As I write in the attached file, the results of the Binarizing (and EdgeDetect) were as follows: By trying to detect edges and binarize images, many extraneous, undesired consequences occur. While some details become prominent (notably the edges), many subtle details are removed entirely via loss of color. As color is a major factor in classifying species of clouds (beyond their general names), not only are the results of this method non-desirable, it also restricts future scalability. EdgeDetection, while bringing edge detail to the forefront, also adds new outlines to the image, further complicating, and detracting from, its original features. These issues imply that this is not an appropriate method for image processing.
Trying to pre-process with the Background Cropping brought marginally better results. The code below shows three main variations of the general cropping idea I tried:
 
skyColor = RGBColor[.4, .7, 1];
cropNonSky[img_?ImageQ, color_] := 
 Module[{skyMask}, 
  skyMask = 
   Erosion[ColorNegate[
     Binarize[ImageAdjust[ColorDistance[img, color]], .3]], 0];
  ImagePad[img, -Echo[BorderDimensions[skyMask], "BorderDimensions"]]]
cropNonSky2[img_?ImageQ, color_] := 
 Module[{skyMask}, 
  skyMask = 
   Erosion[ColorNegate[
     Binarize[ImageAdjust[ColorDistance[img, color]], .3]], 0];
  ImagePad[img, {{0, 0}, {-BorderDimensions[skyMask][[2]][[1]], 0}}]]
cropNonSky3[img_?ImageQ] := 
 Module[{skyMask, color = DominantColors[img][[1]]}, 
  skyMask = 
   Erosion[ColorNegate[
     Binarize[ImageAdjust[ColorDistance[img, color]], .3]], 0];
  ImagePad[img, -BorderDimensions[skyMask]]]
Apart from this, I also used a basic crop, for comparison:
 
ImageTake[img, ImageDimensions[img[[2]]3/4]. img]
(img represents the actual image when used in the code)
Running these on the images before passing them through the classifier overall improved the accuracy slightly (the general cropping idea function improved accuracy by 10%, but interestingly enough, the basic crop improved accuracy by 40%). My notes on this are as follows:
There are inherent dangers in removing backgrounds, and cropping and padding, similar to when binarizing and edge detecting - namely, there is a large probability of removing necessary detail along with extraneous details. However, by doing so, I noticed a small improvement from the unprocessed classifier model to the processed image classifier model (interestingly, the constant parameter cut worked better than a more localized padding scheme). While this demonstrates that the processing concept works, the gains from the processing are neither sufficient in making the classifiers functional, nor truly scalable - cropping blindly without regard to detail is largely unapplicable to many possible cloud images. Hence, the pre-processing concept concluded with the thought that while possible to use the devised pre-processing methods as a means of improving classifier accuracy, there are still several problems with scalability (due to an overall loss of detail experienced in several of the pre-processing ideas). From here, my next thought were to directly deal with a larger dataset, to avoid more pitfalls in approach due to lack of scalability.
 
To speed up the gathering of the large dataset, I used the Bing API service WebImageSearch, to download multiple images instantly.
For example, here is code used to download images of stratus clouds:
 
Cases[WebImageSearch[SearchQueryString["Stratus Clouds"], "Images", 
   "MaxItems" -> 30], _Image];
By using the Cases function, we can automatically remove failed API runs from the dataset.
The following code is the try at directly using a large dataset of images for the classifying - no pre-processing was done to these images, but the result was a 100% accuracy rate, hence the conclusion as this being the final attempt.
 
largeStratusSet = 
  Cases[WebImageSearch[SearchQueryString["Stratus Clouds"], "Images", 
    "MaxItems" -> 30], _Image];
largeCumulusSet = 
  Cases[WebImageSearch[SearchQueryString["Cumulus Clouds"], "Images", 
    "MaxItems" -> 30], _Image];
largeCirrusSet = 
  Cases[WebImageSearch[SearchQueryString["Cirrus Clouds"], "Images", 
    "MaxItems" -> 30], _Image];
largeContrailSet = 
  Cases[WebImageSearch[SearchQueryString["Contrail Clouds"], "Images",
     "MaxItems" -> 30], _Image];
trainingNumbers = List /@ RandomSample[Range[30], 15]
trainLargeStratusSet = Extract[largeStratusSet, trainingNumbers];
trainLargeCumulusSet = Extract[largeCumulusSet, trainingNumbers];
trainLargeCirrusSet = Extract[largeCirrusSet, trainingNumbers];
trainLargeContrailSet = Extract[largeContrailSet, trainingNumbers];
testLargeStratusSet = Complement[trainLargeStratusSet];
testLargeCumulusSet = Complement[trainLargeCumulusSet];
testLargeCirrusSet = Complement[trainLargeCirrusSet];
testLargeContrailSet = Complement[trainLargeContrailSet];
cloudTrainRule = <|"Stratus cloud" -> trainLargeStratusSet, 
   "Cumulus cloud" -> trainLargeCumulusSet, 
   "Cirrus cloud" -> trainLargeCirrusSet, 
   "Contrail cloud" -> trainLargeContrailSet|>;
cloudTestRule = <|"Stratus cloud" -> testLargeStratusSet, 
   "Cumulus cloud" -> testLargeCumulusSet, 
   "Cirrus cloud" -> testLargeCirrusSet, 
   "Contrail cloud" -> testLargeContrailSet|>;
cgt = Classify[cloudTrainRule];
cm = ClassifierMeasurements[cgt, cloudTestRule];
cm["Accuracy"]
Output=1.0
The concluding notes on all this are as follows: What I noticed through all this is that the main issue with the classification of clouds arose not from the pre-processing of images (although that did have a small positive influence, and would be a method of improvement for the future), but from the lack of an adequately sized training set. A simple increase in set size per category from 5 to 15 images allowed a 100% accuracy rate. Limitations then reasonably arise from lacking a truly fully comprehensive training set, and a large enough data set; more limitations arise due to computer complexity in handling so many graphics. It is also worth noting here that when dealing with images containing multiple cloud types, or multiple layers of such clouds even, the accuracy of the classifier does drop - an understandable problem, as the classifier was targeted at classifying instances when single cloud types populate the image. Most importantly, the major limitation arises from the method of gathering the data - the data was arbitrarily downloaded via a web search, which made for easy access to images in Mathematica, but also allows non-vetted images to come in - images that are not useful for training due to their irrelevance. As is obvious, by using only the Classify function, I was restricted in how much I could fine-tune the model - the next step for this would be to actually develop a basic (convoluted) neural network to handle the analysis.
 
Note of thanks
I would like to thank the Wolfram Mentorship Program and my advisor, Ms. Andrea Griffin, for providing me this opportunity.
Shashank Swaminathan
 
I have also made a cloud function of the above work - below is a screenshot of it running:

 
Here are three sample cloud images to get started - enjoy!
 
  
 