Intro
I am helping a friend gather the same data points from a large number of web pages that have the same structure, but inconsistent HTML syntax. So, I thought Wolfram's pattern matching capabilities would make this easy. And it has, to a certain extent.
I have a solution now, but I would like to know if there is a better way. I would appreciate any constructive criticism you might have.
Challenge
Pull a handful of field labels from over a thousand HTML pages. For testing purposes, here are a few of the variations that appear in the files:
html = "
<input type=\"radio\" id=\"field_1\" name=\"choices\" value=\"1\">
<label for=\"field_1\">Field 1</label>
<input type=\"radio\" id=\"field_2\" name=\"choices\" value=\"2\">
<label
for=\"field_2\">Field 2</label>
<input type=\"radio\" id=\"field_3\" name=\"choices\" value=\"3\">
<label for
=\"field_3\">Field 3</label>
<input type=\"radio\" id=\"field_4\" name=\"choices\" value=\"4\">
<label for='field_4'>Field 4</label>
<input type=\"radio\" id=\"field_5\" name=\"choices\" value=\"5\">
<label
for='field_5'>Field 5</label>
<input type=\"radio\" id=\"field_6\" name=\"choices\" value=\"6\">
<label for
='field_6'>Field 6</label>
<input type=\"radio\" id=\"field_7\" name=\"choices\" value=\"7\">
<label for=field_7>Field 7</label>
<input type=\"radio\" id=\"field_8\" name=\"choices\" value=\"8\">
<label
for=field_8>Field 8</label>
<input type=\"radio\" id=\"field_8\" name=\"choices\" value=\"9\">
<label for
=field_9>Field 9</label>
";
Solution
Our first challenge is attribute names (keys) might be single quoted, double quoted or not quoted at all. To handle this let's create a pattern for HTML quotes:
htmlQuoteCharacterPattern = "\"" | "'";
Next, we know that while our attribute names must always be followed by a space or equals. So, we need another pattern. This time we use a look ahead regular expression:
htmlEndOfAttributeKeyPattern = RegularExpression["(?=(>|\\s|=))"];
And, we now that attribute values end with either whitespace or a >. Again we use a look ahead regular expression:
htmlEndOfAttributeValuePattern = RegularExpression["(?=(>|\\s))"];
Now we we have almost everything we need to create patterns for attribute keys. But, to future-proof our solution, I want to create a pattern for the attribute key that allows me to specify either a string or a pattern for the key name:
htmlQuotedAttributeKeyPattern[pattern_?matchStringPatternQ] := StringExpression[
WhitespaceCharacter ...,
htmlQuoteCharacterPattern,
pattern,
htmlQuoteCharacterPattern,
htmlEndOfAttributeKeyPattern
];
htmlUnquotedAttributeKeyPattern[pattern_?matchStringPatternQ] := StringExpression[
WhitespaceCharacter ...,
pattern,
htmlEndOfAttributeKeyPattern
];
htmlAttributeKeyPattern[pattern_?matchStringPatternQ] :=
(htmlQuotedAttributeKeyPattern[pattern] | htmlUnquotedAttributeKeyPattern[pattern]);
As you can see, we now have functions that take either a String or a String expression. So, let's create that pattern:
matchStringPatternQ = MatchQ[#, _String | _StringExpression] &;
Testing the Solution
We can now test our solution against the data above:
StringCases[
html,
"<label" ~~ htmlAttributeKeyValuePattern["for", "field_" ~~ DigitCharacter] ~~ ">" ~~ s : Except["<"] .. :> s
]
returns, as expected:
{"Field 1", "Field 2", "Field 3", "Field 4", "Field 5", "Field 6", "Field 7", "Field 8", "Field 9"}
And,
StringCases[
html,
"<label" ~~ htmlAttributeKeyValuePattern["for", "field_5"] ~~ ">" ~~ s : Except["<"] .. :> s
]
returns, as expected:
{"Field 5"}
Conclusion
I am mostly happy with the above solution, but I know I will need to continue to extend it. While I will not have to handle every type of HTML, I do need a robust way to surgically extract interesting pieces of information.
Before I spend too much additional time extending this solution, I would like to know if I'm on the a good track.
I would also like to know if this is something that others have to do often as well? Would it be worth polishing this to the point that they can be published in the Resource Function Repository?
Thanks for reading. Feedback on this approach will be greatly appreciated.
Have a great and safe week.