Message Boards Message Boards

Scraping HTML Using StringCases

Posted 4 years ago

Intro

I am helping a friend gather the same data points from a large number of web pages that have the same structure, but inconsistent HTML syntax. So, I thought Wolfram's pattern matching capabilities would make this easy. And it has, to a certain extent.

I have a solution now, but I would like to know if there is a better way. I would appreciate any constructive criticism you might have.

Challenge

Pull a handful of field labels from over a thousand HTML pages. For testing purposes, here are a few of the variations that appear in the files:

html = "
<input type=\"radio\" id=\"field_1\" name=\"choices\" value=\"1\">
<label for=\"field_1\">Field 1</label>
<input type=\"radio\" id=\"field_2\" name=\"choices\" value=\"2\">
<label 
for=\"field_2\">Field 2</label>
<input type=\"radio\" id=\"field_3\" name=\"choices\" value=\"3\">
<label for
=\"field_3\">Field 3</label>

<input type=\"radio\" id=\"field_4\" name=\"choices\" value=\"4\">
<label for='field_4'>Field 4</label>
<input type=\"radio\" id=\"field_5\" name=\"choices\" value=\"5\">
<label 
for='field_5'>Field 5</label>
<input type=\"radio\" id=\"field_6\" name=\"choices\" value=\"6\">
<label for
='field_6'>Field 6</label>

<input type=\"radio\" id=\"field_7\" name=\"choices\" value=\"7\">
<label for=field_7>Field 7</label>
<input type=\"radio\" id=\"field_8\" name=\"choices\" value=\"8\">
<label 
for=field_8>Field 8</label>
<input type=\"radio\" id=\"field_8\" name=\"choices\" value=\"9\">
<label for
=field_9>Field 9</label>
";

Solution

Our first challenge is attribute names (keys) might be single quoted, double quoted or not quoted at all. To handle this let's create a pattern for HTML quotes:

htmlQuoteCharacterPattern = "\"" | "'";

Next, we know that while our attribute names must always be followed by a space or equals. So, we need another pattern. This time we use a look ahead regular expression:

htmlEndOfAttributeKeyPattern = RegularExpression["(?=(>|\\s|=))"];

And, we now that attribute values end with either whitespace or a >. Again we use a look ahead regular expression:

htmlEndOfAttributeValuePattern = RegularExpression["(?=(>|\\s))"];

Now we we have almost everything we need to create patterns for attribute keys. But, to future-proof our solution, I want to create a pattern for the attribute key that allows me to specify either a string or a pattern for the key name:

htmlQuotedAttributeKeyPattern[pattern_?matchStringPatternQ] := StringExpression[
    WhitespaceCharacter ...,
    htmlQuoteCharacterPattern,
    pattern,
    htmlQuoteCharacterPattern,
    htmlEndOfAttributeKeyPattern
];  

htmlUnquotedAttributeKeyPattern[pattern_?matchStringPatternQ] := StringExpression[
    WhitespaceCharacter ...,
    pattern,
    htmlEndOfAttributeKeyPattern
];

htmlAttributeKeyPattern[pattern_?matchStringPatternQ] := 
    (htmlQuotedAttributeKeyPattern[pattern] | htmlUnquotedAttributeKeyPattern[pattern]);

As you can see, we now have functions that take either a String or a String expression. So, let's create that pattern:

matchStringPatternQ = MatchQ[#, _String | _StringExpression] &;

Testing the Solution

We can now test our solution against the data above:

StringCases[
    html, 
    "<label" ~~ htmlAttributeKeyValuePattern["for", "field_" ~~ DigitCharacter] ~~ ">" ~~ s : Except["<"] .. :> s
] 

returns, as expected:

{"Field 1", "Field 2", "Field 3", "Field 4", "Field 5", "Field 6", "Field 7", "Field 8", "Field 9"}

And,

StringCases[
    html, 
    "<label" ~~ htmlAttributeKeyValuePattern["for", "field_5"] ~~ ">" ~~ s : Except["<"] .. :> s
    ] 

returns, as expected:

{"Field 5"}

Conclusion

I am mostly happy with the above solution, but I know I will need to continue to extend it. While I will not have to handle every type of HTML, I do need a robust way to surgically extract interesting pieces of information.

Before I spend too much additional time extending this solution, I would like to know if I'm on the a good track.

I would also like to know if this is something that others have to do often as well? Would it be worth polishing this to the point that they can be published in the Resource Function Repository?

Thanks for reading. Feedback on this approach will be greatly appreciated.

Have a great and safe week.

POSTED BY: Mike Besso
9 Replies
Posted 4 years ago

Hi Mike,

Might be easier to manipulate by parsing the HTML as XML e.g.

xml = ImportString[html, {"HTML", "XMLObject"}];
Cases[xml, XMLElement["label", l___] :> l, Infinity]

(*
{{"for" -> "field_1"}, {"Field 1"}, {"for" -> "field_2"}, {"Field 2"}, {"for" -> "field_3"}, {"Field 3"}, 
 {"for" -> "field_4"}, {"Field 4"}, {"for" -> "field_5"}, {"Field 5"}, {"for" -> "field_6"}, {"Field 6"}, 
 {"for" -> "field_7"}, {"Field 7"}, {"for" -> "field_8"}, {"Field 8"}, {"for" -> "field_9"}, {"Field 9"}}
*)
POSTED BY: Rohit Namjoshi
Posted 4 years ago

Rohit:

Thank you for that suggestion. I did not go that way because of the HTML in the source files do not follow the XHTML standard.

For example, my test case HTML is ill-formed in that the input tags are not closed. That said, it seems that Wolfram can handle.

I will check it out.

THANKS

POSTED BY: Mike Besso
Posted 4 years ago

Just to follow up, Wolfram does handle the ill-formed XML.

Using the test data above,

Cases[xml, XMLElement["input", l___] :> l, Infinity]

Returns:

{{"type" -> "radio", "id" -> "field_1", "name" -> "choices", 
  "value" -> "1"}, {}, {"type" -> "radio", "id" -> "field_2", 
  "name" -> "choices", "value" -> "2"}, {}, {"type" -> "radio", 
  "id" -> "field_3", "name" -> "choices", 
  "value" -> "3"}, {}, {"type" -> "radio", "id" -> "field_4", 
  "name" -> "choices", "value" -> "4"}, {}, {"type" -> "radio", 
  "id" -> "field_5", "name" -> "choices", 
  "value" -> "5"}, {}, {"type" -> "radio", "id" -> "field_6", 
  "name" -> "choices", "value" -> "6"}, {}}

However, we are still case sensitive, but HTML is not case sensitive. StringCases allows us to use the IgnoreCase option. Can we tell Cases and XMLElement to ignore case?

Thanks.

POSTED BY: Mike Besso
Posted 4 years ago

Hi Mike,

I am unable to reproduce your result. The HTML in your first post only has lowercase "input". I tried with a mixed case example and it gets converted to lowercase when parsed as XML.

input = "<InPUt type=\"radio\" id=\"field_8\" name=\"choices\" value=\\"9\">";
ImportString[input, {"HTML", "XMLObject"}]

XMLObject[
  "Document"][{XMLObject["Declaration"]["Version" -> "1.0", "Standalone" -> "yes"]}, 
 XMLElement[
  "html", {{"http://www.w3.org/2000/xmlns/", "xmlns"} -> 
    "http://www.w3.org/1999/xhtml"}, {XMLElement[
    "body", {}, {XMLElement[
      "form", {"enctype" -> "application/x-www-form-urlencoded", 
       "method" -> "get"}, {XMLElement[
        "input", {"type" -> "radio", "id" -> "field_8", 
         "name" -> "choices", "value" -> "9"}, {}]}]}]}], {}]
POSTED BY: Rohit Namjoshi
Posted 4 years ago

Rohit:

I did my test backwards. That is I went looking for an uppercase tag. I missed that the XML parser was converting tags to lowercase.

Using XML definitely simplifies the solution.

Using the following test data:

html = "
  <html>
  <input type=\"radio\" id=\"field_1\" name=\"choices\" value=\"1\">
  <label for=\"field_1\">Field 1</label>
  <input type=\"radio\" id=\"field_2\" name=\"choices\" value=\"2\">
  <label 
  for=\"field_2\">Field 2</label>
  <input type=\"radio\" id=\"field_3\" name=\"choices\" value=\"3\">
  <label for
  =\"field_3\">Field 3</label>

  <input type=\"radio\" id=\"field_4\" name=\"choices\" value=\"4\">
  <label for='field_4'>Field 4</label>
  <input type=\"radio\" id=\"field_5\" name=\"choices\" value=\"5\">
  <label 
  for='field_5'>Field 5</label>
  <input type=\"radio\" id=\"field_6\" name=\"choices\" value=\"6\">
  <lAbel for
  ='field_6'>Field 6</label>

  <input type=\"radio\" id=\"field_7\" name=\"choices\" value=\"7\">
  <label for=field_7>Field 7</label>
  <input type=\"radio\" id=\"field_8\" name=\"choices\" value=\"8\">
  <label 
  for=field_8>Field 8</label>
  <input type=\"radio\" id=\"field_8\" name=\"choices\" value=\"9\">
  <label for
  =field_9>Field 9</label>


  <input type=\"radio\" id=\"field_NoMatch\" name=\"choices\" \
value=\"0\">
  <label for=\"No_Match_1\">No Match 1</label>


  </html>
  ";

My solution now looks like:

xml = ImportString[html, {"HTML", "XMLObject"}];

matchXMLObjectQ = Head[#] == XMLObject["Document"] &;
matchStringPatternQ = MatchQ[#, _String | _StringExpression] &;

htmlSelectElementByAttributeValue[xml_?matchXMLObjectQ, 
   element_String, attribute_String, 
   valuePattern_?matchStringPatternQ] := Module[
   {},
   Select[
    Map[
     Append[
       Association[#[[1]]],
       "inner" -> #[[2]]
       ] &,
     Cases[xml, XMLElement[element, l___] :> l, Infinity] // 
      Partition[#, 2] &
     ],
    StringMatchQ[#[attribute] , valuePattern] &
    ]

   ];

htmlSelectElementByAttributeValue[xml, "label", "for",  
 "field_" ~~ DigitCharacter]

I like this solution much better. Though I will do some timing benchmarks to see if there is a performance penalty for bringing in the XML parser. Though, since we only need to parse the HTML once, I think the performance will be good enough.

Thank you Rohit for this insight.

I'm still very interested in hearing how others are tackling challenges like this.

THANKS

POSTED BY: Mike Besso
Posted 4 years ago

Nice!

Personally, I find it hard to understand deeply nested expressions. My brain gets lost in parenthesis hell, even when it is nicely indented. Where possible, I try to use a postfix style that reads naturally from left to right. Also easier to debug, if something is not working right, it is easy to add a // Echo[#]& anywhere in the pipeline. Here is a postfix implementation of your code, also simplified the inner Association and Append.

postfixVersion[xml_?matchXMLObjectQ, element_String, 
  attribute_String, valuePattern_?matchStringPatternQ] :=
 Cases[xml, XMLElement[element, l___] :> l, Infinity] //
    Partition[#, 2] & //
    Map[<|First@#, "inner" -> Last@#|> &, #] & //
    Select[#, StringMatchQ[#[attribute], valuePattern] &] &
POSTED BY: Rohit Namjoshi
Posted 4 years ago

Rohit:

Thank you again for your constructive feedback, help and kind words. They are very much appreciated. I am refactoring the code on my current project. Using what you have taught me, I will be replacing 50+ lines of code with less than 10 simpler lines that will be easier to maintain.

I am still struggling with the Wolfram way of doing things, especially the the postfix style. I think I am mainly challenged by my typical day which has me coding in several if not all of the following languages:

  • English (both in long form and in short texts)
  • T-SQL
  • R (including the domain specific languages created by some very creative people)
  • PowerShell
  • JavaScript
  • HTML
  • DAX
  • Excel (formulas)
  • VBA
  • Wolfram
  • Tableau (formulas)
  • Emoji (unfortunately)

And yes, I do consider some of my every day writing in English and Emoji to be coding. The challenge is that I'm trying to "program" others to do do what I want them to do. Let's just say I am more successful with the computer languages.

And being 56, and having learned way too many other languages (C, C#, VB.Net, Prolog, Lisp, Java, Basic, PowerBuilder, assembler, PL1 and Latin), it might be a while before I'm comfortable with postfix.

But I am trying. I do see the elegance of the pipeline approach of postfix. Thank you for your patience.

Have a great and safe rest of your weekend.

POSTED BY: Mike Besso
Posted 4 years ago

Rohit:

Thank you again for your constructive feedback, help and kind words. They are very much appreciated. I am refactoring the code on my current project. Using what you have taught me, I will be replacing 50+ lines of code with less than 10 simpler lines that will be easier to maintain.

I am still struggling with the Wolfram way of doing things, especially the the postfix style. I think I am mainly challenged by my typical day which has me coding in several if not all of the following languages:

  • English (both in long form and in short texts)
  • T-SQL
  • R (including the domain specific languages created by some very creative people)
  • PowerShell
  • JavaScript
  • HTML
  • DAX
  • Excel (formulas)
  • VBA
  • Wolfram
  • Tableau (formulas)
  • Emoji (unfortunately)
  • Batch (*.bat)
  • Bash

And yes, I do consider some of my every day writing in English and Emoji to be coding. The challenge is that I'm trying to "program" others to do do what I want them to do. Let's just say I am more successful with the computer languages.

And being 56, and having learned way too many other languages (C, C#, VB.Net, Prolog, Lisp, Java, BASIC, PowerBuilder, assembler, PL1, FORTRAN and Latin), it might be a while before I'm comfortable with postfix.

But I am trying. I do see the elegance of the pipeline approach of postfix. Thank you for your patience.

Have a great and safe rest of your weekend.

POSTED BY: Mike Besso

enter image description here -- you have earned Featured Contributor Badge enter image description here Your exceptional post has been selected for our editorial column Staff Picks http://wolfr.am/StaffPicks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract