Message Boards Message Boards

How to import an XML file otherwise intended for Microsoft Excel?

Posted 10 years ago

I am unfamiliar with XML except to the extent that I can read XML files into MS Excel and then output them as XLS files which I can then read into Mathematica using the usual Import command. Done this way I get nice tables of all the tabs of the file in an easy to use list format.

However, I would like to access these files (that are made available several times a day) automatically through Mathematica.

When trying to Import the XML format I get objects and elements that suggest a standard Excel.Sheet template that Mathematica doesn't automatically recognize. The Short version of the file is given below:

XMLObject[ "Document"][{XMLObject["Declaration"]["Version" -> "1.0", 
   "Encoding" -> "UTF-8"], 
  XMLObject["ProcessingInstruction"]["mso-application", 
   "progid=\"Excel.Sheet\""]}, 
 XMLElement[
  "Workbook", {OutputSizeLimit`Skeleton[
   1]}, {XMLElement[
    "Worksheet", {{"urn:schemas-microsoft-com:office:spreadsheet", 
       "Name"} -> "ALL_BIDS"}, {XMLElement[
      "Table", {}, {OutputSizeLimit`Skeleton[1]}]}], 
   OutputSizeLimit`Skeleton[6], 
   XMLElement[
    "Worksheet", {OutputSizeLimit`Skeleton[
     1]}, {XMLElement["Table", {}, {OutputSizeLimit`Skeleton[1]}]}]}],
  {}

Can anyone suggest a simple way to automatically read this file into Mathematica? I suppose I could create some AppleScript (I'm on OS X - Yosemite and use Mathematica 10.) that Mathematica could call and that would in tern call Excel and direct it to do the conversion, but i would prefer to learn how to manipulate XML code in Mathematica.

Many thanks in advance for your suggestions, Steve

POSTED BY: Stephen Wilkus
2 Replies

Hi ,

I did a test with this code :

StringSplit[StringReplace[
  xml, "<" ~~ Except[">"] .. ~~ ">" -> ""], "\n"]

where , xml =

"<?xml version=\"1.0\"?>
<ss:Workbook xmlns:ss=\"urn:schemas-microsoft-com:office:spreadsheet\">
    <ss:Worksheet ss:Name=\"Sheet1\">
        <ss:Table>
            <ss:Column ss:Width=\"80\"/>
            <ss:Column ss:Width=\"80\"/>
            <ss:Column ss:Width=\"80\"/>
            <ss:Row>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">First Name</ss:Data>
                </ss:Cell>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">Last Name</ss:Data>
                </ss:Cell>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">Phone Number</ss:Data>
                </ss:Cell>
            </ss:Row>
            <ss:Row>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">Nancy</ss:Data>
                </ss:Cell>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">Davolio</ss:Data>
                </ss:Cell>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">(206)555 9857</ss:Data>
                </ss:Cell>
            </ss:Row>
        </ss:Table>
    </ss:Worksheet>

</ss:Workbook>"

And the result is:

{"    ", "        ", "            ", "            ", "            ", \
"            ", "                ", "                   First Name", \
"                ", "                ", "                   Last \
Name", "                ", "                ", "                   \
Phone Number", "                ", "            ", "            ", "  \
              ", "                   Nancy", "                ", "    \
            ", "                   Davolio", "                ", "    \
            ", "                   (206)555 9857", "                \
", "            ", "        ", "    "}

After that , you just need to clean the list easily.

POSTED BY: Marcelo De Cicco
Posted 10 years ago

Starting with a sample excel xml string taken from http://technet.microsoft.com/en-us/magazine/2006.01.blogtales.aspx :

xml = ImportString["<?xml version=\"1.0\"?>
<ss:Workbook xmlns:ss=\"urn:schemas-microsoft-com:office:spreadsheet\">
    <ss:Worksheet ss:Name=\"Sheet1\">
        <ss:Table>
            <ss:Column ss:Width=\"80\"/>
            <ss:Column ss:Width=\"80\"/>
            <ss:Column ss:Width=\"80\"/>
            <ss:Row>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">First Name</ss:Data>
                </ss:Cell>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">Last Name</ss:Data>
                </ss:Cell>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">Phone Number</ss:Data>
                </ss:Cell>
            </ss:Row>
            <ss:Row>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">Nancy</ss:Data>
                </ss:Cell>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">Davolio</ss:Data>
                </ss:Cell>
                <ss:Cell>
                   <ss:Data ss:Type=\"String\">(206)555 9857</ss:Data>
                </ss:Cell>
            </ss:Row>
        </ss:Table>
    </ss:Worksheet>
</ss:Workbook>", "xml" ]

The result is this visually hard to read expression:

XMLObject[Document][{XMLObject[Declaration][Version->1.0]},XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Workbook},{{http://www.w3.org/2000/xmlns/,ss}->urn:schemas-microsoft-
com:office:spreadsheet},{XMLElement[{urn:schemas-microsoft-com:office:spreadsheet,Worksheet},{{urn
:schemas-microsoft-com:office:spreadsheet,Name}->Sheet1},{XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Table},{},{XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Column},{{urn:schemas-microsoft-
com:office:spreadsheet,Width}->80},{}],XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Column},{{urn:schemas-microsoft-
com:office:spreadsheet,Width}->80},{}],XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Column},{{urn:schemas-microsoft-
com:office:spreadsheet,Width}->80},{}],XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Row},{},{XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Cell},{},{XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Data},{{urn:schemas-microsoft-com:office:spreadsheet,Type}->String},{First
Name}]}],XMLElement[{urn:schemas-microsoft-com:office:spreadsheet,Cell},{},{XMLElement[{urn:schemas-
microsoft-com:office:spreadsheet,Data},{{urn:schemas-microsoft-
com:office:spreadsheet,Type}->String},{Last Name}]}],XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Cell},{},{XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Data},{{urn:schemas-microsoft-com:office:spreadsheet,Type}->String},{Phone
Number}]}]}],XMLElement[{urn:schemas-microsoft-com:office:spreadsheet,Row},{},{XMLElement[{urn
:schemas-microsoft-com:office:spreadsheet,Cell},{},{XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Data},{{urn:schemas-microsoft-
com:office:spreadsheet,Type}->String},{Nancy}]}],XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Cell},{},{XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Data},{{urn:schemas-microsoft-
com:office:spreadsheet,Type}->String},{Davolio}]}],XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Cell},{},{XMLElement[{urn:schemas-microsoft-
com:office:spreadsheet,Data},{{urn:schemas-microsoft-com:office:spreadsheet,Type}->String},{(206)555
9857}]}]}]}]}]}],{}]

In that expression, click over a some useful part of information. For example, click over "Nancy" and then use the "Extend Selection" menu command (CTRL + period) until you have selected the complete XML element. Then copy&paste this into the second argument of a Case command:

Case[xml, <paste here>, Infinity]

Then you can replace the specific pattern with a more general one and extract the data like this

Cases[xml, 
 XMLElement[{"urn:schemas-microsoft-com:office:spreadsheet", 
    "Data"}, {{"urn:schemas-microsoft-com:office:spreadsheet", 
      "Type"} -> "String"}, {name_String}] :> name, \[Infinity]]

Out[] = {"First Name", "Last Name", "Phone Number", "Nancy", "Davolio", "(206)555 9857"}

By using "Extend Selection" a little more, you can get more structure out of the data:

    Cases[xml,
          {XMLElement[{"urn:schemas-microsoft-com:office:spreadsheet","Cell"},
          {},
          {XMLElement[{"urn:schemas-microsoft-com:office:spreadsheet","Data"},
                      {{"urn:schemas-microsoft-com:office:spreadsheet","Type"} -> "String"},
                      {s1_}]}],
           XMLElement[{"urn:schemas-microsoft-com:office:spreadsheet","Cell"},
                      {},
                      {XMLElement[{"urn:schemas-microsoft-com:office:spreadsheet","Data"},
                                  {{"urn:schemas-microsoft-com:office:spreadsheet","Type"} -> "String"},
                                  {s2_}]}],
           XMLElement[{"urn:schemas-microsoft-com:office:spreadsheet", "Cell"},
                      {},
                      {XMLElement[{"urn:schemas-microsoft-com:office:spreadsheet","Data"},
                      {{"urn:schemas-microsoft-com:office:spreadsheet", "Type"} -> "String"},
                      {s3_}]}]} :> {s1, s2, s3},
          Infinity]

Out[]= {{"First Name", "Last Name", "Phone Number"}, {"Nancy", "Davolio",  "(206)555 9857"}}

As a final step, you can experiment simplifying the pattern by replacing some strings with _ (or even __)

    Cases[xml,
          {XMLElement[{_,"Cell"}, {}, {XMLElement[{_,"Data"}, {{_, "Type"} -> "String"}, {s1_}]}],
           XMLElement[{_,"Cell"}, {}, {XMLElement[{_,"Data"}, {{_, "Type"} -> "String"}, {s2_}]}],
           XMLElement[{_,"Cell"}, {}, {XMLElement[{_,"Data"}, {{_, "Type"} -> "String"}, {s3_}]}]} :> {s1, s2, s3},
         Infinity]


Out[]= {{"First Name", "Last Name", "Phone Number"}, {"Nancy", "Davolio",  "(206)555 9857"}}

This procedure is very useful for parsing HTML too.

POSTED BY: Gustavo Delfino
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract