Group Abstract Group Abstract

Message Boards Message Boards

Scraping HTML Using StringCases

Posted 5 years ago
POSTED BY: Mike Besso
9 Replies

enter image description here -- you have earned Featured Contributor Badge enter image description here Your exceptional post has been selected for our editorial column Staff Picks http://wolfr.am/StaffPicks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD
Posted 5 years ago
POSTED BY: Mike Besso
Posted 5 years ago
POSTED BY: Mike Besso
Posted 5 years ago
POSTED BY: Rohit Namjoshi
Posted 5 years ago
POSTED BY: Mike Besso
Posted 5 years ago

Hi Mike,

I am unable to reproduce your result. The HTML in your first post only has lowercase "input". I tried with a mixed case example and it gets converted to lowercase when parsed as XML.

input = "<InPUt type=\"radio\" id=\"field_8\" name=\"choices\" value=\\"9\">";
ImportString[input, {"HTML", "XMLObject"}]

XMLObject[
  "Document"][{XMLObject["Declaration"]["Version" -> "1.0", "Standalone" -> "yes"]}, 
 XMLElement[
  "html", {{"http://www.w3.org/2000/xmlns/", "xmlns"} -> 
    "http://www.w3.org/1999/xhtml"}, {XMLElement[
    "body", {}, {XMLElement[
      "form", {"enctype" -> "application/x-www-form-urlencoded", 
       "method" -> "get"}, {XMLElement[
        "input", {"type" -> "radio", "id" -> "field_8", 
         "name" -> "choices", "value" -> "9"}, {}]}]}]}], {}]
POSTED BY: Rohit Namjoshi
Posted 5 years ago
POSTED BY: Mike Besso
Posted 5 years ago

Rohit:

Thank you for that suggestion. I did not go that way because of the HTML in the source files do not follow the XHTML standard.

For example, my test case HTML is ill-formed in that the input tags are not closed. That said, it seems that Wolfram can handle.

I will check it out.

THANKS

POSTED BY: Mike Besso
Posted 5 years ago

Hi Mike,

Might be easier to manipulate by parsing the HTML as XML e.g.

xml = ImportString[html, {"HTML", "XMLObject"}];
Cases[xml, XMLElement["label", l___] :> l, Infinity]

(*
{{"for" -> "field_1"}, {"Field 1"}, {"for" -> "field_2"}, {"Field 2"}, {"for" -> "field_3"}, {"Field 3"}, 
 {"for" -> "field_4"}, {"Field 4"}, {"for" -> "field_5"}, {"Field 5"}, {"for" -> "field_6"}, {"Field 6"}, 
 {"for" -> "field_7"}, {"Field 7"}, {"for" -> "field_8"}, {"Field 8"}, {"for" -> "field_9"}, {"Field 9"}}
*)
POSTED BY: Rohit Namjoshi
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard