Message Boards Message Boards

Scraping HTML Using StringCases

Posted 4 years ago
POSTED BY: Mike Besso
9 Replies

enter image description here -- you have earned Featured Contributor Badge enter image description here Your exceptional post has been selected for our editorial column Staff Picks http://wolfr.am/StaffPicks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD
Posted 4 years ago

Rohit:

Thank you again for your constructive feedback, help and kind words. They are very much appreciated. I am refactoring the code on my current project. Using what you have taught me, I will be replacing 50+ lines of code with less than 10 simpler lines that will be easier to maintain.

I am still struggling with the Wolfram way of doing things, especially the the postfix style. I think I am mainly challenged by my typical day which has me coding in several if not all of the following languages:

  • English (both in long form and in short texts)
  • T-SQL
  • R (including the domain specific languages created by some very creative people)
  • PowerShell
  • JavaScript
  • HTML
  • DAX
  • Excel (formulas)
  • VBA
  • Wolfram
  • Tableau (formulas)
  • Emoji (unfortunately)
  • Batch (*.bat)
  • Bash

And yes, I do consider some of my every day writing in English and Emoji to be coding. The challenge is that I'm trying to "program" others to do do what I want them to do. Let's just say I am more successful with the computer languages.

And being 56, and having learned way too many other languages (C, C#, VB.Net, Prolog, Lisp, Java, BASIC, PowerBuilder, assembler, PL1, FORTRAN and Latin), it might be a while before I'm comfortable with postfix.

But I am trying. I do see the elegance of the pipeline approach of postfix. Thank you for your patience.

Have a great and safe rest of your weekend.

POSTED BY: Mike Besso
Posted 4 years ago

Rohit:

Thank you again for your constructive feedback, help and kind words. They are very much appreciated. I am refactoring the code on my current project. Using what you have taught me, I will be replacing 50+ lines of code with less than 10 simpler lines that will be easier to maintain.

I am still struggling with the Wolfram way of doing things, especially the the postfix style. I think I am mainly challenged by my typical day which has me coding in several if not all of the following languages:

  • English (both in long form and in short texts)
  • T-SQL
  • R (including the domain specific languages created by some very creative people)
  • PowerShell
  • JavaScript
  • HTML
  • DAX
  • Excel (formulas)
  • VBA
  • Wolfram
  • Tableau (formulas)
  • Emoji (unfortunately)

And yes, I do consider some of my every day writing in English and Emoji to be coding. The challenge is that I'm trying to "program" others to do do what I want them to do. Let's just say I am more successful with the computer languages.

And being 56, and having learned way too many other languages (C, C#, VB.Net, Prolog, Lisp, Java, Basic, PowerBuilder, assembler, PL1 and Latin), it might be a while before I'm comfortable with postfix.

But I am trying. I do see the elegance of the pipeline approach of postfix. Thank you for your patience.

Have a great and safe rest of your weekend.

POSTED BY: Mike Besso
Posted 4 years ago

Rohit:

I did my test backwards. That is I went looking for an uppercase tag. I missed that the XML parser was converting tags to lowercase.

Using XML definitely simplifies the solution.

Using the following test data:

html = "
  <html>
  <input type=\"radio\" id=\"field_1\" name=\"choices\" value=\"1\">
  <label for=\"field_1\">Field 1</label>
  <input type=\"radio\" id=\"field_2\" name=\"choices\" value=\"2\">
  <label 
  for=\"field_2\">Field 2</label>
  <input type=\"radio\" id=\"field_3\" name=\"choices\" value=\"3\">
  <label for
  =\"field_3\">Field 3</label>

  <input type=\"radio\" id=\"field_4\" name=\"choices\" value=\"4\">
  <label for='field_4'>Field 4</label>
  <input type=\"radio\" id=\"field_5\" name=\"choices\" value=\"5\">
  <label 
  for='field_5'>Field 5</label>
  <input type=\"radio\" id=\"field_6\" name=\"choices\" value=\"6\">
  <lAbel for
  ='field_6'>Field 6</label>

  <input type=\"radio\" id=\"field_7\" name=\"choices\" value=\"7\">
  <label for=field_7>Field 7</label>
  <input type=\"radio\" id=\"field_8\" name=\"choices\" value=\"8\">
  <label 
  for=field_8>Field 8</label>
  <input type=\"radio\" id=\"field_8\" name=\"choices\" value=\"9\">
  <label for
  =field_9>Field 9</label>


  <input type=\"radio\" id=\"field_NoMatch\" name=\"choices\" \
value=\"0\">
  <label for=\"No_Match_1\">No Match 1</label>


  </html>
  ";

My solution now looks like:

xml = ImportString[html, {"HTML", "XMLObject"}];

matchXMLObjectQ = Head[#] == XMLObject["Document"] &;
matchStringPatternQ = MatchQ[#, _String | _StringExpression] &;

htmlSelectElementByAttributeValue[xml_?matchXMLObjectQ, 
   element_String, attribute_String, 
   valuePattern_?matchStringPatternQ] := Module[
   {},
   Select[
    Map[
     Append[
       Association[#[[1]]],
       "inner" -> #[[2]]
       ] &,
     Cases[xml, XMLElement[element, l___] :> l, Infinity] // 
      Partition[#, 2] &
     ],
    StringMatchQ[#[attribute] , valuePattern] &
    ]

   ];

htmlSelectElementByAttributeValue[xml, "label", "for",  
 "field_" ~~ DigitCharacter]

I like this solution much better. Though I will do some timing benchmarks to see if there is a performance penalty for bringing in the XML parser. Though, since we only need to parse the HTML once, I think the performance will be good enough.

Thank you Rohit for this insight.

I'm still very interested in hearing how others are tackling challenges like this.

THANKS

POSTED BY: Mike Besso
Posted 4 years ago

Nice!

Personally, I find it hard to understand deeply nested expressions. My brain gets lost in parenthesis hell, even when it is nicely indented. Where possible, I try to use a postfix style that reads naturally from left to right. Also easier to debug, if something is not working right, it is easy to add a // Echo[#]& anywhere in the pipeline. Here is a postfix implementation of your code, also simplified the inner Association and Append.

postfixVersion[xml_?matchXMLObjectQ, element_String, 
  attribute_String, valuePattern_?matchStringPatternQ] :=
 Cases[xml, XMLElement[element, l___] :> l, Infinity] //
    Partition[#, 2] & //
    Map[<|First@#, "inner" -> Last@#|> &, #] & //
    Select[#, StringMatchQ[#[attribute], valuePattern] &] &
POSTED BY: Rohit Namjoshi
Posted 4 years ago

Hi Mike,

I am unable to reproduce your result. The HTML in your first post only has lowercase "input". I tried with a mixed case example and it gets converted to lowercase when parsed as XML.

input = "<InPUt type=\"radio\" id=\"field_8\" name=\"choices\" value=\\"9\">";
ImportString[input, {"HTML", "XMLObject"}]

XMLObject[
  "Document"][{XMLObject["Declaration"]["Version" -> "1.0", "Standalone" -> "yes"]}, 
 XMLElement[
  "html", {{"http://www.w3.org/2000/xmlns/", "xmlns"} -> 
    "http://www.w3.org/1999/xhtml"}, {XMLElement[
    "body", {}, {XMLElement[
      "form", {"enctype" -> "application/x-www-form-urlencoded", 
       "method" -> "get"}, {XMLElement[
        "input", {"type" -> "radio", "id" -> "field_8", 
         "name" -> "choices", "value" -> "9"}, {}]}]}]}], {}]
POSTED BY: Rohit Namjoshi
Posted 4 years ago

Just to follow up, Wolfram does handle the ill-formed XML.

Using the test data above,

Cases[xml, XMLElement["input", l___] :> l, Infinity]

Returns:

{{"type" -> "radio", "id" -> "field_1", "name" -> "choices", 
  "value" -> "1"}, {}, {"type" -> "radio", "id" -> "field_2", 
  "name" -> "choices", "value" -> "2"}, {}, {"type" -> "radio", 
  "id" -> "field_3", "name" -> "choices", 
  "value" -> "3"}, {}, {"type" -> "radio", "id" -> "field_4", 
  "name" -> "choices", "value" -> "4"}, {}, {"type" -> "radio", 
  "id" -> "field_5", "name" -> "choices", 
  "value" -> "5"}, {}, {"type" -> "radio", "id" -> "field_6", 
  "name" -> "choices", "value" -> "6"}, {}}

However, we are still case sensitive, but HTML is not case sensitive. StringCases allows us to use the IgnoreCase option. Can we tell Cases and XMLElement to ignore case?

Thanks.

POSTED BY: Mike Besso
Posted 4 years ago

Rohit:

Thank you for that suggestion. I did not go that way because of the HTML in the source files do not follow the XHTML standard.

For example, my test case HTML is ill-formed in that the input tags are not closed. That said, it seems that Wolfram can handle.

I will check it out.

THANKS

POSTED BY: Mike Besso
Posted 4 years ago

Hi Mike,

Might be easier to manipulate by parsing the HTML as XML e.g.

xml = ImportString[html, {"HTML", "XMLObject"}];
Cases[xml, XMLElement["label", l___] :> l, Infinity]

(*
{{"for" -> "field_1"}, {"Field 1"}, {"for" -> "field_2"}, {"Field 2"}, {"for" -> "field_3"}, {"Field 3"}, 
 {"for" -> "field_4"}, {"Field 4"}, {"for" -> "field_5"}, {"Field 5"}, {"for" -> "field_6"}, {"Field 6"}, 
 {"for" -> "field_7"}, {"Field 7"}, {"for" -> "field_8"}, {"Field 8"}, {"for" -> "field_9"}, {"Field 9"}}
*)
POSTED BY: Rohit Namjoshi
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract