Message Boards Message Boards

[WSC17] Finding Kiwi Calls In Audio Recordings


Recognizing Kiwi Bird Calls in Audio Recordings

enter image description here

My project for WSC 2017 was identifying kiwi calls in audio recordings. The project can be broken down into 2 main steps:

  1. Finding clips that contain noise that could be a kiwi
  2. Identifying the clips that actually contain a kiwi call

Finding Clips


The data is an audio recording taken overnight in Northland, New Zealand


I started off by filtering the audio to be between 1200Hz and 3600Hz in order to remove the majority of the noise. I then normalized the audio to make the volume consistent.

AudioNormalize[HighpassFilter[LowpassFilter[audio, Quantity[1200, "Hertz"]],  Quantity[3600, "Hertz"]]]

At first I just took any intervals of audio that were above a certain threshold.

AudioIntervals[audioProcessed, #RMSAmplitude > 0.02 &]

This yielded many very short clips so I then extended each clip 1 second in each direction and merged clips that were within 2 seconds of each other.

({#[[1]] - 1, #[[2]] + 1} & /@ AudioIntervals[audioProcessed, #RMSAmplitude > 0.02 &]) 
//. {pre___, {c_, d_}, {x_, y_}, post___} /; d > x - 2 -> {pre, {c, y}, post}

This still had the problem of containing many clips just over 2 seconds long so I removed any clips less than 10 seconds as kiwi calls are longer than ten seconds.

Cases[int, {x_, y_} /; y - x > 10]

I then limited the length of each of the clips to exactly 10 seconds to make it easier for the machine learning.


Identifying calls

Unsupervised Learning

Since my data was unclassified I initially tried to use unsupervised learning for clustering the audio clips, although this didn't yield any particularly meaningful results.

FeatureSpacePlot[clips, LabelingFunction -> (#2[[2]] &), PerformanceGoal -> "Quality"]


The middle section is about 50% kiwi calls and 50% not kiwi calls, as is the circle around the outside so I have no idea what the feature extractor is looking at.

Classifying Data

Since unsupervised learning didn't work particularly well I decided to manually classify every single clip.

Doesn't that sound fun...

clipId = 1;
clipClasses = Range@Length[clips2];
Dynamic@Row[{clipId, "/", Length[clips2], 
   If[clipId <= Length[clips2], clips2[[clipId]], "DONE!"] k, 
   Button["Yes", clipClasses[[clipId]] = "Kiwi"; clipId = clipId + 1],
    Button["No", clipClasses[[clipId]] = "NotKiwi"; 
    clipId = clipId + 1]}]
Dynamic[Row[{clipId - 1, clipClasses[[clipId - 1]]}]]


Neural Network - Take One

I used 200 of the clips as training data for the neural network.


I used a neural net to classify the audio clips as it was the best at classifying them.


The accuracy of the neural network was extremely poor though


Neural Network - Take Two

I then tried downsampling the audio from 44.1kHz to 10kHz to reduce the amount of extraneous data the neural network has to work with.

clipsSmall = AudioResample[#, Quantity[10, "Kilohertz"]] & /@ clips;

This reduced the amount of data without significantly changing the audio


Well back to the drawing board I guess

Neural Network - Take Three

This time I tried a different approach, since the neural network seemed to handle audio extremely poorly I instead input the spectrogram of the audio into the neural net.

data= Thread[Image[Abs[SpectrogramArray[#]]] & /@ clipsSmall -> clipClasses[[All, 1]]]

The image processing side has had a lot more work done so this should work much better.


At this point I ran out of ways to improve the score and ran up against the limit of the accuracy that I was able to classify the clips so I'm going to call that a success.

Finding calls

Finally finding the calls, which is the simple now that we can find and classify potential calls

FindCalls[clip_] := (Module[{int = {}, audioProcessed, clips, classes},
   audioProcessed = ProcessAudio[clip];
   int = ProcessAudioIntervals[audioProcessed];
   int = Cases[int, {x_?NumberQ, y_?NumberQ} /; y - x >= 10];
   clips = Which[Length[int] == 0, {},
     Length[int] == 1, 
     AudioTrim[#, 10] & /@ {AudioTrim[audioProcessed, int]},
     Length[int] > 1, 
     AudioTrim[#, 10] & /@ AudioTrim[audioProcessed, int]];
   classes = 
          AudioResample[#, Quantity[10, "Kilohertz"]]]]] & /@ clips];
   Thread[{Extract[clips, Position[classes, "Kiwi"]], 
     Quantity[#, "Seconds"] & /@ 
      Extract[int, Position[classes, "Kiwi"]][[All, 1]]}]])

FindCalls[Import@"C:\\Users\\Isaac\\Desktop\\Programming\\data\\Kiwi Audio\\Processing\\20170604k-53.mp3"] // Grid

Find kiwi calls



Actually finding clips with noise was quite a simple and easy task, I just messed around with the frequencies on the filters and the threshold for a while.


In hindsight I realize that this wasn't a problem that was particularly suited to unsupervised learning, as there are other more general features than "kiwi" or "not kiwi" for a feature extractor to identify (although I still have no idea what it was doing).


Another thing that's important for undertaking a machine learning project like this is having a lot of data. If I do another project like this I'm definitely going to use more, already classified data instead of spending 2 hours listening to birds screaming and loud backgound noise.

This is the training data I used for the neural net

POSTED BY: Isaac Chandler
9 months ago

Group Abstract Group Abstract