since recently I work in a part of the (Swedish) academic world called "digital humanities" (DH). I am interested in finding out to what extent Mathematica can be used for doing things with text in this area. It seems to me that the combination of Internet and database connectivity, manipulation of strings, algorithmic power, visualization, interactivity, and quick development could make Mathematica quite useful.
There are two methods that are considered hot in DH that I would like to try to implement. The first is called "Topic Modelling" and is based on an algorithm called Latent Dirchlet Allocation, in Blei, Ng & Jordan, 2003, Latent Dirchlet Allocation. I know some mathematics, but this is a little to complicated for me - even though I found the article very well written and readable.
The second is a method for finding similar passages in a large set of texts. An implementation for this can be found here . I hope that the built in functionality for sequence alignment can be used as a basis for an implementation in Mathematica. If so, I suppose that this could be done relatively easily.
Perhaps someone here knows about work already done in Mathematica related to these methods? Or have other suggestions?
Kind Regards, Sverker Lundin
I should add that if you have developed implementations of ARTFL's tools in Mathematica (or know of someone who has) I'd love to hear about it!
Thank you Arno and Jeff!
As regards Topic Modelling I have worked before with different varieties of matrix dimension reduction methods. As I understand MALLET and the discussion around it, it is based on a somewhat more advanced probabilistic model, that generates a more difficult optimization problem. To solve it you need to use some smart heuristic algorithm, and it is this implementations step that it would be nice to have implemented in Mathematica. As I work now I do preprocessing and post-processing in Mathematica, but leave the Topic-Modelling-machinery to MALLET. It is a very non-mathematica-style solution, with this use of textfiles and external software for mathematics of all things that Mathematica could possibly "not" do. :-) I would not be surprised if it is possible to outperform MALLET with smart Mathematica programming, but I cannot do it.
Thank you Arno for the book suggestion! I will go through it carefully, and it seems to me that I will be perfect for a course in Digital Humanities what will be given by the University of Gothenburg next Spring (and then probably on a yearly basis).
I have done some work in this area since my post. My present focus in on a material of Swedish Public Reports. There are about 7000 books, published between 1922 and 2015 (today). I try to use Mathematica as a platform for building tools for "navigating" this material - where navigation should be seen as an extension of "searching". I want to make Topic Modelling part of such navigation, as well as various kinds of "word-clouds". It is difficult to demonstrate what I have, or send you anything, because the tool needs to interact with the material which is several gigabytes large, and needs to be in RAM to get the necessary speed... I have asked WR how I should manage this kind of work on the web, but I suppose there is presently no good solution.
Hi Sverker, for the second part of your question, take a look at this new open source textbook on digital humanities research methods with Mathematica: http://williamjturkel.net/digital-research-methods-with-mathematica/
This is exactly what I needed. Thank you very much!
You're welcome. This is a very old post. I know a lot more now than I knew then. If you wish to get in touch, I may be able to help more.
For what it is worth, I have now adapted topic modeling to numerical data and applied it to microbiomic studies of Alzheimer's disease. See preprint. The methodology might be useful to you. See the part on LDA.
Jeff LapidesAdjunct Associate ProfessorDrexel UniversityWorking from Annapolis, MDjeffrey.firstname.lastname@example.org
Topic modeling is not nearly as complicated as you might think. Here is a good introduction: http://www.youtube.com/watch?v=4p9MSJy761Y.
I have used it extensively to analyze research portfolios of the US Department of Agriculture. Some of the work was done with Mathematica, some outside with Mallet. Get in touch if you would like to know more.
For the second ou might consider Latent Semantic Analysis (LSA) approaches. One reference I found that shows something along these lines with Mathematica is a 2011 dissertation by Saurav Karmaker.
There are probably other references out there as well. Also there may be good methods that could avail themselves of the built in Nearest function.