As regards Topic Modelling I have worked before with different varieties of matrix dimension reduction methods. As I understand MALLET and the discussion around it, it is based on a somewhat more advanced probabilistic model, that generates a more difficult optimization problem. To solve it you need to use some smart heuristic algorithm, and it is this implementations step that it would be nice to have implemented in Mathematica. As I work now I do preprocessing and post-processing in Mathematica, but leave the Topic-Modelling-machinery to MALLET. It is a very non-mathematica-style solution, with this use of textfiles and external software for mathematics of all things that Mathematica could possibly "not" do. :-) I would not be surprised if it is possible to outperform MALLET with smart Mathematica programming, but I cannot do it.
Thank you Arno for the book suggestion! I will go through it carefully, and it seems to me that I will be perfect for a course in Digital Humanities what will be given by the University of Gothenburg next Spring (and then probably on a yearly basis).
I have done some work in this area since my post. My present focus in on a material of Swedish Public Reports. There are about 7000 books, published between 1922 and 2015 (today). I try to use Mathematica as a platform for building tools for "navigating" this material - where navigation should be seen as an extension of "searching". I want to make Topic Modelling part of such navigation, as well as various kinds of "word-clouds". It is difficult to demonstrate what I have, or send you anything, because the tool needs to interact with the material which is several gigabytes large, and needs to be in RAM to get the necessary speed... I have asked WR how I should manage this kind of work on the web, but I suppose there is presently no good solution.
Sverker Lundin