Message Boards Message Boards

[WSC18] Phone Gesture Recognition with Machine Learning

Posted 3 months ago
671 Views
|
3 Replies
|
6 Total Likes
|

Accelerometer-based Gesture Recognition with Machine Learning

A GIF of gesture recognition in progress

Introduction

This is Part 2 of a 2-part community post - Part 1 (Streaming Live Phone Sensor Data to the Wolfram Language) is available here: http://community.wolfram.com/groups/-/m/t/1386358

As technology advances, we are constantly seeking new, more intuitive methods of interfacing with our devices and the digital world; one such method is gesture recognition. Although touchless human interface devices (kinetic user interfaces) exist and are in development, the cost and configuration required for these sometimes makes them impractical, particularly for mobile applications. A simpler method would be to use devices the user already has on their person - such as a phone or a smartwatch - to detect basic gestures, taking advantage of the wide array of sensors included in such devices. In an attempt to assess the feasibility of such a method, methods of asynchronous communication between a mobile device and the Wolfram Language are investigated, and a gesture recognition system based around an accelerometer sensor is implemented, using a machine learning model to classify a few simple gestures from mobile accelerometer data.

Investigating Methods of Asynchronous Communication

To implement an accelerometer-based gesture recognition system, we must devise a suitable means for a mobile device to transmit accelerometer data to a computer running the Wolfram Language (WL). On a high level, the WL has baked-in support for a variety of devices - specifically the Raspberry Pi, Vernier Go!Link compatible sensors, Arduino microcontrollers, webcams and devices using the RS-232 or RS-422 serial protocol (http://reference.wolfram.com/language/guide/UsingConnectedDevices.html); unfortunately, there is no easy way to access sensor data from Android or iOS mobile devices.

On a low level, the WL natively supports TCP and ZMQ socket functionality, as well as receipt and transmission of HTTP requests and Pub-Sub channel communication. We investigate the feasibility of both methods for transmission of accelerometer data in Part 1 of this community post (http://community.wolfram.com/groups/-/m/t/1386358).

Gesture Classification using Neural Networks

Now that we are able to stream accelerometer data to the WL, we may proceed to implement gesture recognition / classification. Due to limited time at camp, we used the UDP socket method to do this - in the future, we hope to move the system over to the (more user-friendly) channel interface.

We first configure the sensor stream, allowing live accelerometer data to be sent to the Wolfram Language:

Configure the Sensor Stream

  1. Install the "Sensorstream IMU+GPS" app (https://play.google.com/store/apps/details?id=de.lorenz_fenster.sensorstreamgps)
  2. Ensure the sensors you want to stream to Wolfram are ticked on the 'Toggle Sensors' page. (If you want to stream other sensors besides 'Accelerometer', 'Gyroscope' and 'Magnetic Field', ensure the 'Include User-Checked Sensor Data in Stream' box is ticked. Beware, though - the more sensors are ticked, the more latency the sensor stream will have.)
  3. On the "Preferences" tab: a. Change the target IP address in the app to the IP address of your computer (ensure your computer and phone are connected to the same local network) b. Set the target port to 5555 c. Set the sensor update frequency to 'Fastest' d. Select the 'UDP stream' radio box e. Tick 'Run in background'
  4. Switch stream ON before executing code. (nb. ensure your phone does not fall asleep during streaming - perhaps use the 'Caffeinate' app (https://play.google.com/store/apps/details?id=xyz.omnicron.caffeinate&hl=en_US) to ensure this.)
  5. Execute the following WL code:

(in part from http://community.wolfram.com/groups/-/m/t/344278)

QuitJava[];
Needs["JLink`"];
InstallJava[];
udpSocket=JavaNew["java.net.DatagramSocket",5555];
readSocket[sock_,size_]:=JavaBlock@Block[{datagramPacket=JavaNew["java.net.DatagramPacket",Table[0,size],size]},sock@receive[datagramPacket];
datagramPacket@getData[];
listen[]:=record=DeleteCases[readSocket[udpSocket,1200],0]//FromCharacterCode//Sow;
results={};
RunScheduledTask[AppendTo[results,Quiet[Reap[listen[]]]];If[Length[results]>700,Drop[results,150]],0.01];
stream:=Refresh[ToExpression[StringSplit[#[[1]],","]]& /@ Select[results[[-500;;]],Head[#]==List&],UpdateInterval-> 0.01]

Detecting Gestures

On a technical level, the problem of gesture classification is as follows: given a continuous stream of accelerometer data (or similar),

  1. distinguish periods during which the user is performing a given gesture from other activities / noise and

  2. identify / classify a particular gesture based on accelerometer data during that period. This essentially boils down to classification of a time series dataset, in which we can observe a series of emissions (accelerometer data) but not the states generating the emissions (gestures).

A relatively straightforward solution to (1) is to approximate the gradients of the moving averages of the x, y and z values of the data and take the Euclidean norm of these - whenever these increase above a threshold, a gesture has been made.

movingAvg[start_,end_,index_]:=Total[stream[[start;;end,index]]]/(end-start+1);

^ Takes the average of the x, y or z values of data (specified by index - x-->3, y-->4, z-->5) from the index start to the index end.

normAvg[start_,middle_,end_]:=((movingAvg[middle,end,3]-movingAvg[start,middle,3])/(middle-start))^2+((movingAvg[middle,end,4]-movingAvg[start,middle,4])/(middle-start))^2+((movingAvg[middle,end,5]-movingAvg[start,middle,5])/(middle-start))^2;

^ (Assuming difference from start to middle is equal to difference from middle to end:) Approximates the gradient at index middle using the average from start to middle and from middle to end for x, y and z values, and then takes the sum of the squares of these values. Note that we do not need to take the square root of the final answer (to find the Euclidean norm), as doing so and comparing it to some threshold x would be equivalent to not doing so and comparing it to the threshold x^2 (square root is a computationally expensive operation).

Thus

Dynamic[normAvg[-155,-150,-146]]

will yield the square of the Euclidean norm of the gradients of the x, y and z values of the data (approximated by calculating the averages from the 155th most recent to 150th most recent and 150th most recent to 146th most recent values). As accelerometer data is sent to the Wolfram Language, this value will update.

Data Collection

To train the network, we must collect gesture data. To do this, we have a variety of options - we can either represent the gesture as a tensor of 3 dimensional vectors (x,y,z accelerometer data points) and perform time series classification on these sequences of vectors using hidden Markov models or recurrent neural networks, or we can represent the gesture as a rasterised image of a graph much like the one below:

Rasterised image of a gesture

and perform image classification on the image of the graph.

Since the latter has had some degree of success (e.g. in http://community.wolfram.com/groups/-/m/t/1142260 ), we attempt a similar method:

PrepareDataIMU[dat_]:=Rasterize@ListLinePlot[{dat[[All,1]],dat[[All,2]],dat[[All,3]]},PlotRange->All,Axes->None,AxesLabel->None,PlotStyle->{Red, Green, Blue}];

^ Plots the data points in dat with no axes or axis labels, and with x coordinates in red, y coordinates in green, z coordinates in blue (this makes processing easier as Wolfram operates in RGB colours).

threshold = 0.8;
trainlist={};
appendToSample[n_,step_,x_]:=AppendTo [trainlist,PrepareDataIMU[Part[x,n;;step]]];
Dynamic[If[normAvg[-155,-150,-146]>threshold,appendToSample[-210,-70,stream],False],UpdateInterval->0.1]

^ Every 0.1 seconds, checks whether or not the normed average of the gradient of accelerometer data at the 150th most recent data point (using the normAvg function) is greater than the threshold - if it is, it will create a rasterised image of a graph of accelerometer data from the 210th most recent data point to the 70th most recent data point and append it to trainlist - a list of graphs of gestures. Patience is recommended here - there can be up to ~5 seconds' lag before a gesture appears. Ensure gestures are made reasonably vigorously.

As a first test, we attempted to generate 30 samples of each of the digits 1 to 5 drawn in the air with a phone - the images of these graphs were stored in trainlist. Then, we classified them as 1, 2, 3, 4 or 5, converting trainlist into an association with key <image of graph> and value <number>). We split the data into training data (25 samples from each category) and test data (all remaining data):

TrainingTest[data_,number_]:=Module[{maxindex,sets,trainingdata,testdata},
maxindex=Max[Values[data]];
sets = Table[Select[data,#[[2]]==x&],{x,1,maxindex}];
sets=Map[RandomSample,sets];
trainingdata =Flatten[Map[#[[1;;number]]&,sets]];
testdata =Flatten[Map[#[[number+1;;-1]]&,sets]];
Return[{trainingdata,testdata}]
]

^ Randomly selects number training elements and Length[data]-number test elements for each value in the list data

gestTrainingTest=TrainingTest[gesture1to5w30,25];
gestTraining=gestTrainingTest[[1]];
gestTest=gestTrainingTest[[2]];

gestTraining and gestTest now contain key-value pairs like those below:

Key-value pairs of accelerometer graphs and labels

Machine Learning

To train a model on these images, we first attempt a basic Classify:

TrainWithClassify = Classify[gestTraining]
ClassifierInformation[TrainWithClassify]

A poor result in ClassifierInformation

Evidently, this gives a very poor training accuracy of 24.4% - given that there are 5 classes, this is only marginally better than random chance.

As the input data consists of images, we try transfer learning on an image identification neural network (specifically the VGG-16 network):

net = NetModel["VGG-16 Trained on ImageNet Competition Data"]

We remove the last few layers from the network (which classify images into the classes the network was trained on), leaving the earlier layers which perform more general image feature extraction:

featureFunction = Take[net,{1,"fc6"}]

We train a classifier using this neural net as a feature extractor:

NetGestClassifier = Classify[gestTraining,FeatureExtractor->featureFunction]

We now test the classifier using the data in gestTest:

NetGestTest = ClassifierMeasurements[NetGestClassifier,gestTest]

We check the training accuracy:

ClassifierInformation[NetGestClassifier]

A better classifier information result

NetGestTest["Accuracy"]
1.
NetGestTest["ConfusionMatrixPlot"]

A good confusion matrix plot

This method appears to be promising, as a training accuracy of 93.1% and a test accuracy of 100% was achieved.

Implementation of Machine Learning Model

To use the model with live data, we use the same method of identifying gestures as before in 'Detecting Gestures' (detecting 'spikes' in the data using moving averages), but when a gesture is identified, instead of being appended to a list it is sent through the classifier:

results = {""};
ClassGestIMU[n_,step_,x_]:=Module[{aa,xa,ya},
aa = Part[x,n;;step];
xa=PrepareDataIMU[aa];
ya=gestClassifier[xa];
AppendTo[results,{Length@aa,xa,ya}]
];
Dynamic[If[normAvg[-155,-150,-146]>threshold,ClassGestIMU[-210,-70,stream],False],UpdateInterval->0.1]

Real time results (albeit with significant lag) can be seen by running

Dynamic@column[[-1]]

Conclusions and Further Work

On the whole, this project was successful, with gesture detection and classification using rasterised graph images proving a viable method. However, the system as-is is impractical and unreliable, with a significant lag, training bias (trained to recognise the digits 1 to 5 the way they are drawn by one person only) and small sample size: these are problems that can be solved, given more time.

Further extensions to this project include: - Serious code optimisation to reduce / eliminate lag

  • An improved training interface to allow users to create their own gestures

  • Integration of the gesture classification system with the Channel interface (as described earlier) and deployment of this to the cloud.

  • Investigation of the feasibility of using an RNN or an LSTM for gesture classification - using a time series of raw gesture data rather than relying on rasterised images (which, although accurate, can be quite laggy). Alternatively, hidden Markov models could be used in an attempt to recover the states (gestures) that generate the observed data (accelerometer readings).

  • Adding an API to trigger actions based on gestures and deployment of gesture recognition technology as a native app on smartphones / smartwatches.

  • Improvement of gesture detection. At the moment, the code takes a predefined 1-2 second 'window' of data after a spike is detected - an improvement would be to detect when the gesture has ended and 'crop' the data values to include only the gesture made.

  • Exploration of other applications of gesture recognition (e.g. walking style, safety, sign language recognition). Beyond limited UI navigation, a similar concept to what is currently implemented could be used with, say, a phone kept in the pocket, to analyse walking styles / gaits and, for instance, to predict or detect an elderly user falling and notify emergency services. Alternatively, with suitable methods for detecting finger motions, like flex sensors, such a system could be trained to recognise / transcribe sign language.

  • Just for fun - training the model on Harry Potter-esque spell gestures, to use your phone as a wand...

A notebook version of this post is attached, along with a full version of the computational essay.

Acknowledgements

We thank the mentors at the 2018 Wolfram High School Summer Camp - Andrea Griffin, Chip Hurst, Rick Hennigan, Michael Kaminsky, Robert Morris, Katie Orenstein, Christian Pasquel, Dariia Porechna and Douglas Smith - for their help and support during this project.

3 Replies

this is very neat, and analysis using Wolfram can make it scientific (although the iPhone is not a scientific instrument)

however: as far as interpreting gestures: the iphone has that built-in, the methods are patented, and Apple allows using the libraries for detecting gestures free by the public (likewise, android i think released free code for their lib)

if your goal was iphone gesture, i believe it's already done and patented but available to use - realtime recognition

but downloading raw data and re-recognizing it, nothing wrong with that

HOWEVER: as far as mathematica learning being slow and inaccurate. sound recognition is a MAJOR job: mathematica has a large section for Sound that's good (your gps signal it still applies though your wave form is not in that form). Sound cards have special recognition components that do things CPU cannot keep up with. the best waveform analyzers cost tens of thousands.

i'm unsure if you want quick waveform detection if you should be using the "computer learning" feature. perhaps you should convert it to soundwave form and check your options that way

Thanks for taking the time to look at my Wolfram Summer Camp project - I hope you enjoyed it.

A few notes:

  1. The device I'm using here is an Android phone - I did develop a platform-agnostic Channel[]-based method for sensor data collection detailed in Part 1 (http://community.wolfram.com/groups/-/m/t/1386358), though, and there is some documentation for doing a similar thing with iDevices.
  2. When you refer to 'interpreting gestures', are you referring to accelerometer-based / air gestures or touch gestures (Googling 'gesture recognition yields many results for the latter and few for the former)? If Android / Apple have released a library for accelerometer-based gestures, do post the link here - I'd be interested to see it. (n.b. I'm using accelerometer sensors, not GPS)
  3. Your sound wave based suggestion sounds quite interesting - I might take a look at Mathematica's sound processing capabilities (e.g. this seems interesting for determining when a gesture starts/stops: https://www.wolfram.com/language/11/computational-audio/transient-detection.html).
  4. I'm not completely certain, however, as to the viability of a non-'computer learning' / machine learning approach to waveform classification - even if I converted the data to sound waves, as I understand it I would still have to use some machine learning model to classify the waveforms (time series classification --> e.g. hidden Markov models, or recurrent neural networks, both of which would also work with the raw accelerometer data simply presented as a time series). Again, if there's something the Wolfram Language has (built-in or otherwise) that could help with this, please do let me know.

Kind regards,

Euan

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract