Message Boards Message Boards


Correlating Concepts with Math Formulas

Posted 2 months ago
4 Replies
13 Total Likes


The idea of the project was to find the concepts that have the highest co-occurrence with specified Mathematical keywords. The data was scraped from research preprints as a source for math concepts and keywords using the arXiv API, and the highest occurring words were found using regular expressions, and processing using basic natural language processing.

Importing Data using the arXiv API

url  = "\
import = Import[url, {"HTML", "XMLObject"}];
paperlinks = 
     "link", {"title" -> "pdf", "href" -> link_, ___, ___}, _] :> 
    link, Infinity];
ptpl = Quiet[Import[#, "Plaintext"] & /@ paperlinks];

The data was imported using the API which can be found here. With a change in the max_results in the API, “n” number of papers can be imported and processed. With using the Cases function on the XMLElements of the page, we download all of the links with the tags “pdf”. The links are then imported in plaintext.

Cleaning of Data

deleteCases = DeleteCases[ptpl, $Failed];
whiteSpace = 
  StringReplace[#, WhitespaceCharacter .. -> " "] & /@ deleteCases;
mapS1 = StringDelete[#,
      DigitCharacter ..] & /@ (StringSplit[#, "."] & /@ whiteSpace) //
mapS2 = List[ToLowerCase[#] & /@ mapS1];

The failed cases are deleted which are generated during the import, the whitespace characters are replaced to make the data more presentable, and the digit characters are deleted.

Searching for Mathematical keywords in the paper

The data was converted to lower case to avoid the overhead of defining different switch cases to handle for user query input. The keywords were then searched using RegularExpression. A regular expression is a sequence of characters that define a search pattern. The RegularExpression (regex) we used to find out the words in the papers is as follows:

mapS2 = List[ToLowerCase[#] & /@ mapS1];
regx = StringCases[#, RegularExpression["\\bcircle(s?)\\b"]] & /@ mapS2;

The above regex captures any occurrences of ‘circle’ or ‘circles’, given that it is surrounded by a word boundary in the data, next we have to find out the position of the sentences containing ‘circle’ or ‘circles’

pos = Position[#, Except@{}, 1, Heads -> False] & /@ regx // Flatten;

To find occurrences of the associating word, we can consider the preceding and succeeding three sentences around the position of the word, with conditions as follows:

ifs = If[# > 3,
      List[# - 3, # + 3]
      , If[# <= 3, List[# + 3]]
      , If[# >= Length[mapS2 // Flatten],
       List[# - 3]]] & /@ pos // Flatten;

Extract the sentences using the new positions

mapS3 = Flatten[mapS2][[#]] & /@ ifs // Flatten;

We then delete the keyword (the word captured by the regex pattern) from the papers, remove the words if their length is less than three, and then delete the stopwords using the built-in function.

regxd = StringDelete[#, RegularExpression["\\bcircle(s?)\\b"]] & /@ 
    mapS3 // Flatten;
ss = StringSplit[#, " "] & /@ regxd;
regxd1 = StringCases[#, RegularExpression["[a-zA-Z]+"]] & /@ ss // 
mapS4 = Flatten[
   DeleteCases[If[StringLength[#] >= 3, List[#]] & /@ regxd1, Null]];
deleteStopwords = List[StringRiffle[DeleteStopwords[#] & /@ mapS4]];

We can look at the words co-occurring most often with the keywords in a Dataset

WordCounts[StringRiffle[deleteStopwords]] // Dataset

The associating words with 'circles' in 1000 papers

A problem I faced while testing an initial version of the code was that of efficiency.

Some of the inbuilt functions in the Wolfram Language take up a lot of memory when used for a large dataset. When testing on a small set of papers (4-5) the code finished execution in a short time interval, but when tested on 1000 papers, the machine crashed. We had to profile the code, delete some code, and rewrite some of the variables. We ran each line of code took the absolute timing, tried to replace it with something more efficient, and running it on a Linux machine I was able to do it using 400Kb of memory instead of 4Gb.


This code can mostly be used for any other website, in fact, we referenced most of the code to the WSC19 project we had done, which was to find out the common word list for the Marathi language. An idea to extend this work would be using computer vision to generate the list using the reference figures in the papers. While importing the papers in PDF, some of the characters fail to import and some characters aren’t supported. So, instead of using PDF’s, I would like to import them in LaTeX. I would like to thank Fez for all the help he gave me in completing my project.

If you want to check out the notebook, visit my GitHub repo.

4 Replies

Thank you very much for sharing this interesting approach! I can't wait trying it. And:

... running it on a Linux machine I was able to do it using 400Kb of memory instead of 4Gb.

Yes - this is a good point and important to mention! I always advocate Linux.

Posted 2 months ago

Agreed :)


running it on a Linux machine I was able to do it using 400Kb of memory instead of 4Gb.

I’m hoping that someone from Wolfram Research looks at your code and implementation to see what could cause such a dramatic difference! A factor of 1000 is memory usage is simply unacceptable!

A first step in this direction would be to sprinkle MaxMemoryUsed[] in several places to see where/why it jumps.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract