Group Abstract Group Abstract

Message Boards Message Boards

0
|
3.4K Views
|
5 Replies
|
0 Total Likes
View groups...
Share
Share this post:

Reading italic lines from text file

Posted 3 years ago

I have written a Mathematica program to read in titles with EndNote tags using a custom EndNote Style and Word. But I seem to have to convert the Word file to plain text for Mathematica input. I have about 10,000 titles to scan. I want to isolate those that have two consecutive words in italics. Those most likely represent Genus species. Is there any way to do this?

POSTED BY: Richard Gordon
5 Replies
Posted 3 years ago

Oh, sorry, I just now realized that you want the titles, not just the binomials. We'll need to figure out how to split the cell contents into separate titles (maybe split on line-return?) and then select the ones that contain the style box that satisfies our condition. Out of time at the moment, but I try to remember to revisit this later.

POSTED BY: Eric Rimbey
Posted 3 years ago

Okay, here's what I've tried so far. Seems to work, but I'm worried about how long it will take for 10000 titles. So, you may need to figure out how to improve performance.

Assuming you have saved your word doc in rtf format, start by importing it into Mathematica:

rtfNotebook = Import[pathToFile]

This will give you a notebook expression, which is interesting. You could open it as a notebook and start working with it as if it were just another Mathematica notebook. But you can also just manipulate the notebook expression directly (which is what I'm going to do below).

Inspecting this notebook, it does look like anything that was italicized in the original is now represented as a StyleBox with the option FileSlant->"Italic". We can extract all of these with Cases:

Cases[rtfNotebook, StyleBox[___, FontSlant -> "Italic", ___], Infinity]

Here's one example of what you'll see in the result:

StyleBox["Thermus thermophilus", FontFamily -> "TimesNewRomanPS-ItalicMT", FontSize -> 18, FontSlant -> "Italic"]

All we care about is the string content part of the StyleBox, which should always be the first element. We can get Cases to extract that for us as well. While we're at it, let's save the result in a variable:

italicizedPhrases = Cases[rtfNotebook, StyleBox[content_, ___, FontSlant -> "Italic", ___] -> content, Infinity]
(* {"Thermus thermophilus", "What is life", <<7>>, "Thioreductor micantisoli"} *)

You want two word phrases, so let's use Select to find those:

candidateBinomials = Select[italicizedPhrases, (2 == WordCount[#]) &]

I don't know if further processing will be necessary, but I was curious about whether Mathematica recognized these as species:

species = SemanticInterpretation /@ candidateBinomials

It did indeed! Here's an example of one of the entities it returned:

Entity["Species", "Species:ThermusThermophilus"]

I wondered what Mathematica might know about species:

EntityProperties["Species"]

Looks like it might know the taxonomic sequence, so I tried that:

EntityValue[species, EntityProperty["Species", "TaxonomicSequence"], "EntityAssociation"]

From this I learned that Thermus thermophilus is a bacteria with this taxonomic sequence: bacteria -> hadobacteria -> Deinococci -> Thermales -> Thermaceae -> Thurmus -> Thermus thermophilus

POSTED BY: Eric Rimbey

Y2022m08d12, Alonsa, Manitoba, Canada DearĀ Eric, I checked, and Mathematica is right about this species. Give me a few days to digest what you've written. By the way, I'm working on:

Gordon, R., Deb, M. and Gordon, N.K. (2023) Origin of Life via Archaea: Shaped Droplets to Archaea First, With a Compendium of Archaea Micrographs [OOLA, Volume in the series Astrobiology Perspectives on Life of the Universe, Eds. Richard Gordon & Joseph Seckbach, in preparation]. Wiley-Scrivener, Beverly, Massachusetts, USA.

Your followup suggests more than programming interest!

Thanks. Yours, -Richard (Dick) Gordon DickGordonCan@protonmail.com RichardGordonCan@xplornet.com Talk: https://meet.jit.si/DickGordonMeeting (arrange time first by e-mail or holler if I'm on) http://orcid.org/0000-0003-4970-9953 Canada: 1-(204) 767-2164 http://tinyurl.com/RichardGordonBooks fertilizer: https://www.youtube.com/watch?v=LMG4kuEN_kM

POSTED BY: Richard Gordon
Posted 3 years ago

Here's what I would try, but without specific data to test with, I can't be sure how successful this will be.

Save the doc as rich text (.rtf). From Mathematica, use the Import function to import the rich text doc. I think that what you'll end up with will be an expression that has the formatting stuff represented as StyleBox expressions. From that point, it's a matter of writing a function that will recognize those StyleBox expressions that are used for the species binomial. With that function, you would then search (with Cases maybe) the whole imported expression for the matching StyleBox expressions.

Alternatively, Mathematica has SemanticImport and SemanticInterpretation and other related functions. This feels like a stretch, but maybe there is some way to use those that just automatically picks out species binomials from a given text.

Maybe if you provided a snippet of the document we could test these ideas.

POSTED BY: Eric Rimbey

Dear Eric, Kind of you to offer! I'm a rank Mathematica amateur. It took me weeks to write a program that created a list of titles, with the words cyclically permuted. I selected some titles with and without italicized Genus species, showing the desired output. I own Mathematica 12.1.

Thanks. Yours, -Richard (Dick) Gordon DickGordonCan@protonmail.com RichardGordonCan@xplornet.com Talk: https://meet.jit.si/DickGordonMeeting (arrange time first by e-mail or holler if I'm on) http://orcid.org/0000-0003-4970-9953 Canada: 1-(204) 767-2164 http://tinyurl.com/RichardGordonBooks fertilizer: https://www.youtube.com/watch?v=LMG4kuEN_kM

POSTED BY: Richard Gordon
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard