Message Boards Message Boards

[WSC18] Music Sentiment Analysis through Machine Learning

Posted 6 years ago

A Representation of the emotion categorization system


Abstract

This project aims to develop a machine learning application to identify the sentiments in a music clip. The data set I used consists of one hundred 45-second clips from the Database for Emotional Analysis of Music and an additional 103 gathered by myself. I manually labeled all 203 clips and used them as training data for my program. This program works best with classical-style music, which is the main component of my data set, but also works with other genres to an reasonable extent.

Introduction

One of the most important functions of music is to affect emotion, but the experience of emotion is ambiguous and subjective to individual. The same music may induce a diverse range of feelings in people as a result of different context, personality, or culture. Many musical features, however, usually lead to the same effect on the human brain. For example, louder music correlates more with excitement or anger, while softer music corresponds to tenderness. This consistency makes it possible to train a supervised machine learning program based on musical features.

Background

This project is based on James Russell's circumplex model, in which a two-dimensional emotion space is constructed from the x-axis of valence level and y-axis of arousal level, as shown above in the picture. Specifically, valence is a measurement of an emotion's pleasantness, whereas arousal is a measurement of an emotion's intensity. Russell's model provides a metric on which different sentiments can be compared and contrasted, creating four main categories of emotion: Happy (high valence, high arousal), Stressed (low valence, high arousal), Sad (low valence, low arousal), and Calm (high valence, low arousal). Within these main categories there are various sub-categories, labeled on the graph above. Notably, "passionate" is a sub-category that does not belong to any main category due to its ambiguous valence value.


Program Structure

The program contains a three-layer structure. The first layer is responsible for extracting musical features, the second for generating a list of numerical predictions based on different features, and the third for predicting and displaying the most probable emotion descriptors based on the second layer's output.
enter image description here

First Layer

The first layer consists of 23 feature extractors that generate numerical sequence based on different features:

(*A list of feature extractors*)
feMin[audio_] := Normal[AudioLocalMeasurements[audio, "Min", List]]
feMax[audio_] := Normal[AudioLocalMeasurements[audio, "Max", List]]
feMean[audio_] := Normal[AudioLocalMeasurements[audio, "Mean", List]]
feMedian[audio_] := Normal[AudioLocalMeasurements[audio, "Median", List]]
fePower[audio_] := Normal[AudioLocalMeasurements[audio, "Power", List]]
feRMSA[audio_] := Normal[AudioLocalMeasurements[audio, "RMSAmplitude", List]]
feLoud[audio_] := Normal[AudioLocalMeasurements[audio, "Loudness", List]]
feCrest[audio_] := Normal[AudioLocalMeasurements[audio, "CrestFactor", List]]
feEntropy[audio_] := Normal[AudioLocalMeasurements[audio, "Entropy", List]]
fePeak[audio_] := Normal[AudioLocalMeasurements[audio, "PeakToAveragePowerRatio", List]]
feTCent[audio_] := Normal[AudioLocalMeasurements[audio, "TemporalCentroid", List]]
feZeroR[audio_] := Normal[AudioLocalMeasurements[audio, "ZeroCrossingRate", List]]
feForm[audio_] := Normal[AudioLocalMeasurements[audio, "Formants", List]]
feHighFC[audio_] := Normal[AudioLocalMeasurements[audio, "HighFrequencyContent", List]]
feMFCC[audio_] := Normal[AudioLocalMeasurements[audio, "MFCC", List]]
feSCent[audio_] := Normal[AudioLocalMeasurements[audio, "SpectralCentroid", List]]
feSCrest[audio_] := Normal[AudioLocalMeasurements[audio, "SpectralCrest", List]]
feSFlat[audio_] := Normal[AudioLocalMeasurements[audio, "SpectralFlatness", List]]
feSKurt[audio_] := Normal[AudioLocalMeasurements[audio, "SpectralKurtosis", List]]
feSRoll[audio_] := Normal[AudioLocalMeasurements[audio, "SpectralRollOff", List]]
feSSkew[audio_] := Normal[AudioLocalMeasurements[audio, "SpectralSkewness", List]]
feSSlope[audio_] :=  Normal[AudioLocalMeasurements[audio, "SpectralSlope", List]]
feSSpread[audio_] := Normal[AudioLocalMeasurements[audio, "SpectralSpread", List]]
feNovelty[audio_] := Normal[AudioLocalMeasurements[audio, "Novelty", List]]


Second Layer

Using data generated from the first layer, the valence and arousal predictors of the second layer provide 46 predictions for the audio input, based on its different features.

(*RMSAmplitude*)
(*Feature extractor*) feRMSA[audio_] := Normal[AudioLocalMeasurements[audio, "RMSAmplitude", List]]
dataRMSA = Table[First[takeLast[feRMSA[First[Take[musicFiles, {n}]]]]], {n, Length[musicFiles]}];
(*Generating predictor*) pArousalRMSA = Predict[dataRMSA -> arousalValueC]

Sample predictor function


Third Layer

The two parts of the third layer, main category classifier and sub-category classifier, each utilize the tensors generated in the second layer to make a prediction within their realm of emotion. The output consists of two parts, a main category emotion and a sub-category emotion.

(*Main*) emotionClassify1 = Classify[classifyMaterial -> emotionList1, PerformanceGoal -> "Quality"]
(*Sub*) emotionClassify2 = Classify[classifyMaterial -> emotionList2, PerformanceGoal -> "Quality"]

enter image description here


Output

If the program receives an input that is longer than 45 second, it will automatically clip the audio file into 45 second segments and return the result for each. If the last segment is less than 45 seconds, the program would still work fine on it, though with reduced accuracy. The display for each clip includes a main-category and a sub-category descriptor, with each of their associated probability also printed.

Sample testing: Debussy's Clair de Lune

enter image description here



Conclusion

The program gives very reasonable result for most music in the classical style. However, the program have three shortcomings that I plan to fix in later versions of the this program. Firstly, the program may give contradictory result (ex. happy and depressed) if the sentiment dramatically changes in the middle of a 45 second segment, perhaps reflecting the music's changing emotional composition. The current 45 second clipping window is rather long and thus prone to capture contradicting emotions. In the next version of this program, the window will probably be shortened to 30 or 20 seconds to reduce prediction uncertainty. Secondly, the program's processing speed has a lot of room of improvement. It currently takes about one and half minutes to compute an one minute audio file. In future versions I will remove relative ineffective feature extractors to speed things up. Lastly, the data used in creating this application is solely from myself, and therefore it is prone to my human biases. I plan to expand the data set with more people's input and more genres of music.

I have attached the application to this post so that everyone can try out the program.

Acknowledgement

I sincerely thank my mentor, Professor Rob Morris, for providing invaluable guidance to help me carry out the project. I also want to thank Rick Hennigan for giving me crucial support with my code.

Attachments:
9 Replies

Hi, William. Terrific work! Unfortunately, it does not seem to work in Version 13 (the error codes don't help much). Attached is my attempt to use the notebook. Have you done any further work on this project?

Best,

Harlan

Attachments:
POSTED BY: Harlan Brothers

Hi Michael,

Yes of course. Because the file is too big, I have put it up on Google Drive and here is the link:

https://drive.google.com/file/d/1NWb5DQPW9ZiSJiqlfaDYBKxvYi4Dv5pq/view?usp=sharing

Best regards,

William

Thanks William for the detailed explanation. I appreciate it. If it is still ok, I would like a copy of your code so that I can play around with it

many Thanks Michael

POSTED BY: Michael Kelly

Hi Michael,

Thank you for your interest in my project! I apologize for the belated reply.

For your first question, yes it was defined internally. The function definition is as follows:

takeLast[list_List] := Map[Last, list, {2}]

For your second question, this project did not use neural network, but could be roughly described as a feed-forward network of linear regressers, random forest, and other classifier functions (I let Mathematica decides which one to use). The precise structure is described in that network figure: I trained the second layer to map from each features to the correct valence/arousal values, and then I trained the third layer to map the vector of valence/arousal predictions to the correct label.

The code for this project is very long and is simply repetition of what I described in this post:

  1. define feature extractor

  2. extract feature for every time step, outputting a feature time series

  3. train two predictor functions to map feature time series to "ground truth" valence and arousal value

  4. repeat the above three steps for each type of feature (power, entropy, etc.), creating a "feature vector"

  5. finally, train classifier function to map arousal and valence feature vectors to "ground truth" labels.

I don't know if the actual code will be helpful to you, but if you still want the code, I am more than willing to post it here.

This approach works, but it is definitely not optimal. For one thing, back propagation only happens locally instead of globally. I have been planning to redo this project with neural network (which is more efficient and much better at distributed representation) for a long time, and I hope I can find some time in the near future to finish it. I am currently working on a time series classification project using recurrent neural networks in a neuroscience laboratory, and I could potentially adapt the network architecture I have in that project to fit the purpose of this project.

Anyway, thank you again for your interest in my project. I hope my reply helps.

Best regards,

William

Thanks William for a very interesting post. Just two questions: In the second layer for the definition of dataRMSA you have the function takeLast[.]. Is this defined internally with the sentimentAnalyzer? Also while you have outlined the components of the sentimentAnalyzer, you have not explicitly defined how these are put together, say in either a NetChain or NetGraph that would show how they actually create the required neural net. is it possible to get that code?

Thanks again Michael

POSTED BY: Michael Kelly
Posted 6 years ago

This seems like a really cool way to analyze music.

POSTED BY: Luke Xie
Posted 6 years ago

Great job! Very interesting!

POSTED BY: Mei Jin

This is interesting I am taking a course on the Great Courses Plus on Music and the Brain taught by Aniruddh Patel PHD that talks about these things. It's really interesting

POSTED BY: Leighton Cooper

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

POSTED BY: EDITORIAL BOARD
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract