Message Boards Message Boards

0
|
3803 Views
|
6 Replies
|
1 Total Likes
View groups...
Share
Share this post:

Counting lines in a large string imported to Mathematica?

Posted 2 years ago

I am trying to bring over a large HyperCard data base into Mathematica, with nearly 100 megaBytes of text and metaData broken up into smaller elements called cards.

The metadata counts the number of lines (strings ending in the newLine character) in each card

How do I do this in Mathematica, once I've loaded some of the data into a Mathematica variable

POSTED BY: Lewis Robinson
6 Replies
Posted 2 years ago

You could try CreateSearchIndex.

Perhaps a better approach would be to generate an HTML document for each card and within that document have links to the cross-referenced documents, glossary, index, etc. Turn the Hyparcard DB into a website.

POSTED BY: Rohit Namjoshi
Posted 2 years ago

Rohit: I appreciate your interest, and a much longer reply will be forthcoming tomorrow. Thank you.

POSTED BY: Lewis Robinson
Posted 2 years ago

Rohit:

Here is what I planned to do, along with a snag I think is present. I wrote Wolfram himself -- he gives out his eMail.

Dr. Wolfram:

I am a retired neurologist who is trying to move a large (for me) database in HyperCard to Mathematica. One HyperCard ‘stack’ contains 16,000+ cards taking up 79 megabyes, another 23 K in 16 megaBytes etc. etc. I was impressed that you can find what you want in about 100K of your notebooks. I don’t think it would be a problem to convert each card to a uniquely named notebook.

Your search system as described in “Adventures of a Computational Explorer” seems to be by topic and ideas. That won’t work for me — I need to search for strings in stacks.

Here is an example showing why.

We understand relatively little about the way the brain works and cellular biochemistry. One of this morning’s HyperCard entries involves Nature volume 596 pp. 570 - 575 ’21 which is about a rare neurological disorder called Niemann Pick disease. They found that one of the mutations causing the disease (in a protein called NPC1) causes changes in the way another protein involved in viral defense called STING is handled by the cell.

This was totally unexpected and a priori unpredictable. It is also exactly what makes reading about cellular biochemistry/physiology so fascinating.

So in HyperCard I just put in a link between one of my cards for Niemann Pick (#10992) and that for STING (#4955). It’s clear that I could do that in Mathematica using a button. But to find #4955 I had to search for STING in the whole 79 megaBases. I think this would likely take a long time in a folder containing 16 K + notebooks. Or does Mathematica have a way around this?

Any help you, or your staff could give would be greatly appreciated

Lewis Robinson M. D.

POSTED BY: Lewis Robinson
Posted 2 years ago

Rohit thank you very much. After posting I solved the problem another way with the very UNintuitive

ReadList [ "/Users/lewisrobinson/Desktop/  Cards  9461  to  9462", \
String ] 

Here's what it gives me

{"Total Number of Cards =  2 ", "Begin  card \"X1766697\" ", "card id \
1897041 ", "Nature vol. 316 p 457 - 460 '85 ", "4 ", "\"Choroid \
plexus tumors\" ", "1711804    Xref ", "\"SV40\" ", "1433061    \
glossary ", "\"SV40 -- moreInfo\" ", "2104481    Xref ", \
"\"Transgenic animals\" ", "1512732    \"glossary\" ", "1 ", \
"Oncogenes", "1 ", "       Transgenic mice with SV40 early genes die \
of choroid plexus papillomas at age 3 -5 months. Elevated SV40 T \
antigen and SV40 mRNA are present in affected tissues.  If the SV40 \
enhancer region is deleted, there is peripheral neuropathy rather \
than choroid plexus tumors, and there are liver and pancreatic tumors \
as well.  ", "End  card \"X1766697\" ", "Begin  card \"X1773629\" ", \
"card id 1903962 ", "Nature vol. 316 pp 596 - 605 '85 ", "1 ", \
"\"Repressor\" ", "1279797    glossary ", "1 ", "Gene expression", "1 \
", "      Good evidence exists that by changing one face of one alpha \
helix, you can alter the specificity of DNA binding proteins (lambda \
repressor, cro repressor, P22 activator).   Crystal structures of the \
models show that this helix lies in the major dense groove of the DNA \
binding site.  The other 3.5 alpha helices of the repressors remain \
unchanged.  One face of the helix barrel is enough. ", "End  card \
\"X1773629\" "}

I'm new to Mathematica, but either should solve the basic transfer problem. The metadata is quite stylized as you can see, and I can deal with each card separately, once the data is broken down into strings.

I have 16 K cards, 22 K indexes 8 K glossary items. Hypercard allows you to put the items of a similar type (cards, indexes, glossary) in stacks and you can easily move between items in a given stack. Do you think I'll have a problem with nearly 50K notebooks (one per item). I do love the way Mathematica has programmed links for you, as I estimate I have 80 K of these, linking items in different stacks together.

POSTED BY: Lewis Robinson
Posted 2 years ago

ReadList returns a List with a single string rather than a String. Give your file a .txt extension and Import will import it as a String.

You are still left with the problem of splitting that string into separate strings for each card. That is what my previous answer did. Once you have then you can process each string (card) to extract id's, cross-reference, names/id's etc.

How are you planning to represent the card data in WL and what UI do you plan to construct on top of it to permit easy navigation?

POSTED BY: Rohit Namjoshi
Posted 2 years ago

Hi Lewis,

Not clear exactly what you are trying to do. Using the sample data you gave in this question, I saved it to a text file (attached). Then

cards = Import["~/Downloads/hypercard.txt"]
cardList = cards // 
  StringCases["Begin card" ~~ c : Shortest[___] ~~ "End card" :> StringTrim@c]

Gives a list of strings where each string is the text between "Begin card" and "End card".

If that example is not representative of the data you are working with, please provide a sample.

Attachments:
POSTED BY: Rohit Namjoshi
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract