Group Abstract

Message Boards

WOLFRAM COMMUNITY

13.8K Views

7 Replies

15 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks Data Science External Programs and Systems Import and Export Wolfram Language Computational Linguistics Natural Language Processing

Correlating Concepts with Math Formulas

Aryan Deshpande

Posted 5 years ago

Introduction The idea of the project was to find the concepts that have the highest co-occurrence with specified Mathematical keywords. The data was scraped from research preprints as a source for math concepts and keywords using the arXiv API, and the highest occurring words were found using regular expressions, and processing using basic natural language processing. Importing Data using ServiceConnect["ArXiv"] arXiv = ServiceConnect["ArXiv"]; articles = arXiv["Search",{"Query" -> "Circles","MaxItems" -> 1}]; urls = Normal@articles[All,{"URL"}]; iurls = Flatten[Values[urls]]; furls = StringReplace[iurls,"http://arxiv.org/abs/"->"http://arxiv.org/pdf/"]; urldata = Quiet[Import[#,"Plaintext"]&/@ furls]; Importing Data using the arXiv API url = "http://export.arxiv.org/api/query?search_query=all:circle&\start=0&max_results=1000"; import = Import[url, {"HTML", "XMLObject"}]; paperlinks = Cases[import, XMLElement[ "link", {"title" -> "pdf", "href" -> link_, ___, ___}, _] :> link, Infinity]; data = Quiet[Import[#, "Plaintext"] & /@ paperlinks]; The data was imported using the API which can be found here. With a change in the max_results in the API, “n” number of papers can be imported and processed. With using the Cases function on the XMLElements of the page, we download all of the links with the tags “pdf”. The links are then imported in plaintext. Cleaning of Data deleteCases = DeleteCases[data (\|urldata), $Failed]; whiteSpace = StringReplace[#, WhitespaceCharacter .. -> " "] & /@ deleteCases; mapS1 = StringDelete[#, DigitCharacter ..] & /@ (StringSplit[#, "."] & /@ whiteSpace) // Flatten; mapS2 = List[ToLowerCase[#] & /@ mapS1]; The failed cases are deleted which are generated during the import, the whitespace characters are replaced to make the data more presentable, and the digit characters are deleted. Searching for mathematical keywords in the papers The data was converted to lower case to avoid the overhead of defining different switch cases to handle for user query input. The keywords were then searched using RegularExpression. A regular expression is a sequence of characters that define a search pattern. The RegularExpression (regex) we used to find out the words in the papers is as follows: mapS2 = List[ToLowerCase[#] & /@ mapS1]; regx = StringCases[#, RegularExpression["\\bcircle(s?)\\b"]] & /@ mapS2; The above regex captures any occurrences of ‘circle’ or ‘circles’, given that it is surrounded by a word boundary in the data, next we have to find out the position of the sentences containing ‘circle’ or ‘circles’ pos = Position[#, Except@{}, 1, Heads -> False] & /@ regx // Flatten; To find occurrences of the associating word, we can consider the preceding and succeeding three sentences around the position of the word, with conditions as follows: ifs = If[# > 3, List[# - 3, # + 3] , If[# <= 3, List[# + 3]] , If[# >= Length[mapS2 // Flatten], List[# - 3]]] & /@ pos // Flatten; Extract the sentences using the new positions mapS3 = Flatten[mapS2][[#]] & /@ ifs // Flatten; We then delete the keyword (the word captured by the regex pattern) from the papers, remove the words if their length is less than three, and then delete the stopwords using the built-in function. regxd = StringDelete[#, RegularExpression["\\bcircle(s?)\\b"]] & /@ mapS3 // Flatten; ss = StringSplit[#, " "] & /@ regxd; regxd1 = StringCases[#, RegularExpression["[a-zA-Z]+"]] & /@ ss // Flatten; mapS4 = Flatten[ DeleteCases[If[StringLength[#] >= 3, List[#]] & /@ regxd1, Null]]; deleteStopwords = List[StringRiffle[DeleteStopwords[#] & /@ mapS4]]; We can look at the words co-occurring most often with the keywords in a Dataset WordCounts[StringRiffle[deleteStopwords]] // Dataset Issues A problem I faced while testing an initial version of the code was that of efficiency. Some of the inbuilt functions in the Wolfram Language when used together take up a lot of memory when used for a large dataset. When testing on a small set of papers (4-5) the code finished execution in a short time interval, but when tested on 1000 papers, the machine crashed. We had to profile the code, delete some code, and rewrite some of the variables. We ran each line of code took the absolute timing, tried to replace it with something more efficient, and running it on a Linux machine I was able to do it using 400Kb of memory instead of 4Gb. Conclusion This code can mostly be used for any other website, in fact, we referenced most of the code to the WSC19 project we had done, which was to find out the common word list for the Marathi language. An idea to extend this work would be using computer vision to generate the list using the reference figures in the papers. While importing the papers in PDF, some of the characters fail to import and some characters aren’t supported. So, instead of using PDF’s, I would like to import them in LaTeX. I would like to thank Fez for all the help he gave me in completing my project. If you want to check out the notebook, visit my GitHub repo. If you want to check out the function, visit the WFR page

POSTED BY: Aryan Deshpande

7 Replies

Sort By:

Aryan Deshpande

Posted 4 years ago

Don't worry, I didn't even know about programming in 2018. I am sure I'll see a Staff Pick on your community post soon :)

POSTED BY: Aryan Deshpande

John Baxter

John Baxter, Baxter Safe Homes

Posted 4 years ago

Very impressive. I wish I was smart enough to use this, but interesting nonetheless.

POSTED BY: John Baxter

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 4 years ago

-- you have earned *Featured Contributor Badge* Your exceptional post has been selected for our editorial column *Staff Picks* http://wolfr.am/StaffPicks and Your Profile is now distinguished by a *Featured Contributor Badge* and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD

Daniel Lichtblau

Daniel Lichtblau, Wolfram Research

Posted 5 years ago

A first step in this direction would be to sprinkle MaxMemoryUsed[] in several places to see where/why it jumps.

POSTED BY: Daniel Lichtblau

Paul Abbott

Paul Abbott, Analytica International

Posted 5 years ago

What??? running it on a Linux machine I was able to do it using 400Kb of memory instead of 4Gb. I’m hoping that someone from Wolfram Research looks at your code and implementation to see what could cause such a dramatic difference! A factor of 1000 is memory usage is simply unacceptable!

POSTED BY: Paul Abbott

Aryan Deshpande

Posted 5 years ago

Agreed :)

POSTED BY: Aryan Deshpande

Henrik Schachner

Henrik Schachner, Radiation Therapy Center, Weilheim, Germany

Posted 5 years ago

Thank you very much for sharing this interesting approach! I can't wait trying it. And: ... running it on a Linux machine I was able to do it using 400Kb of memory instead of 4Gb. Yes - this is a good point and important to mention! I always advocate Linux.

POSTED BY: Henrik Schachner

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback

Correlating Concepts with Math Formulas

Introduction

Importing Data using ServiceConnect["ArXiv"]

Importing Data using the arXiv API

Cleaning of Data

Searching for mathematical keywords in the papers

Issues

Conclusion