Message Boards Message Boards

0
|
15683 Views
|
4 Replies
|
4 Total Likes
View groups...
Share
Share this post:

How can I remove the formatting from imported RTF (Rich Text Format) files?

Posted 9 years ago

Hi everyone,

How can I strip away the formatting and leave only the text when I import an RTF?

I've imported as

Import[myFile, "RTF"]

Gregory

POSTED BY: Gregory Lypny
4 Replies

Gregory:

As stated in the help document "Import and Export support RTF format Version 1.3." According to WikiPedia article on RTF (https://en.wikipedia.org/wiki/Rich_Text_Format) version 1.3 is from 1993. A quick google search for rtf examples yielded this sites http://www.jafsoft.com/examples/rtf/testrtf.rtf example containing different elements no images though.

Method 1.

Needs["NETLink`"]
InstallNET[];
rtfz = NETNew["System.Windows.Forms.RichTextBox"]
rtfz@Rtf = URLFetch["http://www.jafsoft.com/examples/rtf/testrtf.rtf"];
rtfz@Text

Method 2.

Needs["XML`"];
Cases[ToSymbolicXML[
  Import["http://www.jafsoft.com/examples/rtf/testrtf.rtf", "RTF"]], 
 XMLElement["String", _, {mtext_}] -> mtext, Infinity]

Method 3.

rtfrules = ToExpression[Import["path of saved attached file rtfrules.txt on your system"]];
StringReplace[
 URLFetch["http://www.jafsoft.com/examples/rtf/testrtf.rtf"], 
rtfrules, MetaCharacters -> Automatic]

Where rtfrules is the contents of the attached file. At some point in 2004 I made a beginning set of replacement rule based on RTF 1.6 or 1.7. This is a beginning set of rules setting all these rtf control tags to "" is not optimum.

Method 4.

If in Windows environment, then install a Generic/Text Printer Driver whose output goes to file and set it as default printer (before starting Mathematica)

nb = CreateDocument[
   Import["http://www.jafsoft.com/examples/rtf/testrtf.rtf", "RTF"]];
NotebookPrint[nb]

The print dialog should popup to save the *.prn file set the paper size to "US Std Fanfold" for 120 characters wide or "Letter" for 80 characters wide. The resulting .prn file should contain ASCII (ANSI) text depending on layout may cutoff. Open saved .prn file in text editor to see if output is acceptable.

Method 5.

Do something similar to .NET method but using Java it would need to be a Swing object. I could not test this it is late and I have some java rust

Needs["JLink`"]
InstallJava[];
rtfx = JavaNew[rtfobject] (javax.swing.text.rtf.RTFEditorKit)
rtfx@Rtf = URLFetch["http://www.jafsoft.com/examples/rtf/testrtf.rtf"];
rtfx@Text

All these methods are starters as some methods would require memory management if applied repeatedly. The replacement rules would require the most work.

RTF is a bit dangerous format as it accepts embedding of external objects.

Attachments:
POSTED BY: Hans Michel
Posted 9 years ago

Hey Hans,

Thanks a load for this. Method 2 works the best and is flexible. The only formatting it leaves behind is the occasional bit of font information wrapping a table here and there. That' easy to get rid of.

Kind regards,

Gregory

POSTED BY: Gregory Lypny

Gregory,

Maybe try something like this:

nb = Last[
  Import["C:\\Users\\YourName\\Desktop\\This is an RTF file.rtf", "Rules"]]

That should open a new notebook that contains the text contents of the RTF file. From there, I think you should be able to programmatically do whatever you want with the text.

POSTED BY: Tim Mayes
Posted 9 years ago

Thank you, Tim,

I'll look into the stuff on rules. Haven't used that until now.

Gregory

POSTED BY: Gregory Lypny
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract