In this community the question of speech-to-text, i.e. speech transcription capabilities for Mathematica has been raised several times, e.g. here by @Jesse Friedman. Also at the presentations at the Wolfram Technology Conference, videos of which have recently been made available, this feature was discussed several times. In his presentation Stephen Wolfram mentioned that they are working on built-in transcription using their own powerful neural network framework. I am certain that this functionality will be much, much more powerful than what I am going to present here, but a project of mine required speech to text and I decided to bend some OSX features to my will...
Here is an example of a sound file (very exaggerated intonation etc) and this is the transcription:
"Why from Mathematica usually termed Mathematica is a modern technical computing system spanning all areas of technical computing-including neural networks, machine learning,Image processing, geometry, data science, visualisations, and others. The system is used in many technical, scientific, engineering, mathematics, and computing feelings."
which is not too bad from the original:
"Wolfram Mathematica (usually termed Mathematica) is a modern technical computing system spanning all areas of technical computing - including neural networks, machine learning, image processing, geometry, data science, visualizations, and others. The system is used in many technical, scientific, engineering, mathematical, and computing fields."
You can download the sound file from here.
Introduction
I have been working on serval little projects that are language based, for example analysing captions of videos - if anybody is interested in that, I would be happy to post a how-to. I do neither have the skills nor the data that Wolfram Research have, so I needed to try something quick and dirty. I will use the OSX feature "dictation" and pipe the output into Mathematica. I will implement two different approaches. In the first, I will use the microphone and dictate directly into a notebook. In my second approach I will use an existing recording, play that with Mathematica and use dictation to transcribe that. For this second approach I will use a program called Loopback (a descendent of Soundflower); this program turns out to be very useful in combination with Mathematica as it allows me to re-route any sound into Mathematica (or any other program).
Preparation of OSX
The first thing we need to setup is the "Dictation" feature of OSX. You find it in the system preferences under keyboard:
Then you click on the Dictation pane and make sure to use the following settings:
So dictation needs to be on and you need to check the box "Use Enhanced Dictation". You can of course change the language setting. That is it!
Configuration of Loopback
After installing Loopback (which is after paying for it!) you need to add a new virtual device:
When that is done you add Mathematica to the applications:
That is it for Loopback.
Speech transcription "in real time"
This real time transcription does not require Loopback, but we will use a shell script (swift) to activate dictation programatically. The script looks like this:
#!/usr/bin/env xcrun swift
import Foundation
let task = Process()
task.launchPath = "/usr/bin/osascript"
task.arguments = ["-e","tell app \"System Events\" to key code {63,63}"]
task.launch()
You will have to save it into a file with extension sh and also change the rights so that it becomes executable. I have added the file to the Wolfram Cloud (https://wolfr.am/tGhhcbcJ). In fact the following Mathematica code should download it, and change the rights appropriately:
Export["~/Desktop/startdictation2.sh", CloudGet["https://wolfr.am/tGhhcbcJ"], "Text"];
Run["chmod a+x ~/Desktop/startdictation2.sh"];
Note that this is required for real-time-transciprtion and also for the transcription from an existing file in the next section. The following lines should transcribe whatever is spoken into the microphone for the next 47 seconds:
Run["~/Desktop/startdictation.sh"];
Pause[47];
transcript = StringJoin[StringJoin[# /. RowBox -> Identity] & /@ NotebookRead[NextCell[]][[1, 1, 1]] ];NotebookDelete@NextCell[]
This is very (!) messy programming, but it appears to work ok-ish. As a test case I used the first couple of sentences from the Wikipedia article on Mathematica:
StringTake[WikipediaData["Wolfram Mathematica"], 345]
Wolfram Mathematica (usually termed Mathematica) is a modern technical computing system spanning all areas of technical computing - including neural networks, machine learning, image processing, geometry, data science, visualizations, and others. The system is used in many technical, scientific, engineering, mathematical, and computing fields.
The dictation program has to deal with two challenges: (i) the text contains "non-dictionary-words" like Wolfram and Mathematica; and (ii) I am not a native speaker so the program will have to make sense of my German-English language mixture. Here is a typical transcript of the text:
Wolfram Mathematica usually can't mathematical is a modern technical computing system spanning all areas of technical computing-including neural networks, machine-gunning,Image processing, geometry, data science, visualisations, and others.The system is used in many technical, scientific, engineering, mathematical, and computing fears.
I was quite impressed that it got "Wolfram Mathematica" right, but then it failed for the second "Mathematica". The "can't" is obviously wrong and the transcript suffers from "computing fears". But all in all, it is a decent transcript, it is free and also we did not need any data set for the training.
Speech transcription from a file/recording
Now we will need the Loopback, which allows us to re-route sound from any program to any other program. We will play the sound file in Mathematica and route it into Dictation, which will have to deliver the transcript back to Mathematica. We will therefore tell Dictation in the keyboard section of the system preferences to use Loopback as input:
This will provide the sound from Mathematica to Dictation. We will use the file linked above; here is the link again. Then the following few lines will do the magic:
audioduration = Duration[Import["/Users/thiel/Desktop/MMAWiki.m4a"]];
Run["~/Desktop/startdictation.sh &"];
Pause[4];
EmitSound[Import["/Users/thiel/Desktop/MMAWiki.m4a"]];
Pause[audioduration + Quantity[4, "Seconds"]];
transcript = StringJoin[StringJoin[# /. RowBox -> Identity] & /@ NotebookRead[NextCell[]][[1, 1, 1]] ];
NotebookDelete@NextCell[]
The idea of this code is the following. The first line determines the length of the recording. This is important in order to make the program wait for the transcription to happen. Then we start the dictation terminal program via the script. We wait for four seconds until it initialises. We then import and play the recording. The next Pause-function guarantees that the dictation program "listens" to the entire recording. The transcript will be written as input into the next cell. We read and clean the input, and finally delete the cell. Et voila!
Using the recording from above this should be the transcript:
Why from Mathematica usually termed Mathematica is a modern technical computing system spanning all areas of technical computing-including neural networks, machine learning,Image processing, geometry, data science, visualisations, and others. The system is used in many technical, scientific, engineering, mathematics, and computing feelings.
This time it got Wolfram completely wrong, mathematical became mathematics, and the "computing fears" from before have now become "computing feelings", but again we have something that could pass as a transcript. I wonder how this performs with native speakers. It would be great if you could let me know.
Conclusion
I find it quite nice that with 5 - 7 lines of code you can manage to highjack the operating system's transcription feature and make it available in the notebook. Of course, I don't actually do much here: I just use a feature of the operating system, and Loopback to re-route and that is it, but I do like to see that these three programs (the two plus Mathematica) actually work so nicely together. This procedure does not appear to work directly on a Windows machine, but I think that it would be possible to use Dragon Naturally Speaking and its feature to "auto-transcribe" files in a given folder - a feature that does not work on OSX - to achieve the same effect, in fact in an easier way.
One of the important draw backs of this method is that the punctuation has to be dictated, too. I suppose that we could improve this dramatically using machine learning or something for sentence segmentation. I'd love to hear about ideas to this effect. Do the folks at Wolfram Research have something like sentence segmentation coming up?
I think that these types of functions might be very useful for research in the digital humanities and social sciences when combined with the power of the Wolfram Language.(Any comments @Vitaliy Kaurov ?) I have much better results though with the analysis of captions from videos etc.
I can also see applications for language learning. I would hope that the detection rate gets better as the pronunciation gets better. Perhaps someone with more language-teaching experience could comment? ( @Peter Nilsson )
I have used this transcription in a modified form for some projects now, and it worked as a proof of concept. I am looking forward to the fully featured Mathematica function, whenever it will come out.