Message Boards Message Boards

GROUPS:

CSS Selectors 3 for Symbolic XML

Posted 1 month ago
216 Views
|
3 Replies
|
10 Total Likes
|

Introduction

Implementation of CSS Selectors 3 for Wolfram Language SymbolicXML expressions. This work is motivated by the Stack Exchange post css-selectors-for-symbolic-xml. The full package with the examples shown below can be download from the github repo here. The standalone package is attached to this community post but may not remain as up to date as the repo.

The CSS Selectors 3 specification is followed as far as possible, but I make no claims to be absolutely conformant. For example, being that WL SymbolicXML is a static expression, any HTML/XML elements such as dynamic pseudo classes (e.g. active/hover/focus) and pseudo elements (e.g. before/after) are not found.

Load Package

In[] := Needs["Selectors3`"]

Testing on HTML source

Being a little meta, let's test this against the WC3 page for Selectors Level 3.

In[] := document = Import["https://www.w3.org/TR/selectors-3/", "XMLObject"];

Look for elements that belong to classes that contain the letter 'h'.

In[] := Position[document, Selector["[class*=h]"]]
Out[] = {{2, 3, 2, 3, 2}, {2, 3, 2, 3, 484}, {2, 3, 2, 3, 488}, {2, 3, 2, 3, 2, 3, 11}}

In[] := Extract[document, %][[All, 1 ;; 2]] // Column
Out[] = {
 {XMLElement["div", {"class" -> "head"}]},
 {XMLElement["dl", {"class" -> "bibliography"}]},
 {XMLElement["dl", {"class" -> "bibliography"}]},
 {XMLElement["p", {"class" -> "copyright"}]}
}

Look for elements of class '.no-num'

In[] := Extract[document, Position[document, Selector[".no-num"]]] // Column
Out[] = {
 {XMLElement["h2", {"class" -> "no-num no-toc", "id" -> "abstract"}, {"Abstract"}]},
 {XMLElement["h2", {"class" -> "no-num no-toc", "id" -> "status"}, {"Status of this document"}]},
 {XMLElement["h2", {"class" -> "no-num no-toc", "id" -> "contents"}, {"Table of contents"}]},
 {XMLElement["h2", {"class" -> "no-num no-toc"}, {"W3C Recommendation 06 November 2018"}]}
}

Check specificity of the selector

In[] := Selector[document, "[class~=a] b > *:link"]["Specificity"]
Out[] = {0, 2, 1}

In[] := Selector[document, "[class~=a] b > :not(p)"]["Specificity"]
Out[] = {0, 1, 2}

In[] := Selector[document, "#welcome"]["Specificity"]
Out[] = {1, 0, 0}

Testing on XML source

In[] := str = "<html xml:lang='zh'><head><title>Test</title></head><body \
        xmlns='http://www.w3.org/1999/xhtml'><p lang='en' class='red' \
        myid='unique'>Here is some math.</p><p><m:math \
        xmlns:m='http://www.w3.org/1998/Math/MathML'><m:mi \
        m:title='cat'>x</m:mi><m:mo>+</m:mo><m:mn>1</m:mn></m:math></p></body>\
        \n</html>";
     
In[] := obj = ImportString[str, "XML"];

Namespace

If the selector does not specify a namespace, then the namespace is ignored:

In[] := Selector[str, "mo"]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {{2, 3, 2, 3, 2, 3, 1, 3, 2}}|>

If a namespace is given in the selector, then you need to provide the prefix's expansion rule. Otherwise the selector won't match any element.

In[] := Selector[str, "m|mo"]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {}|>

In[] := Selector[str, "m|mo", "Namespaces" -> {"m" -> "http://www.w3.org/1998/Math/MathML"}]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {{2, 3, 2, 3, 2, 3, 1, 3, 2}}|>

ID

XML can define its own unique ID tags. Use the "ID" option to indicate what tag name is in use. This is equivalent to using the attribute selector but with higher specificity.

In[] := Selector[str, "#unique", "ID" -> "myid"]
Out[] = <|"Specificity" -> {1, 0, 0}, "Elements" -> {{2, 3, 2, 3, 1}}|>

In[] := Selector[str, "[myid=unique]"]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {{2, 3, 2, 3, 1}}|>

Case sensitivity

XML is case-sensitive, but the Selectors3 package is not by default. Use the "CaseInsensitive" option to enforce case sensitivity.

In[] := Selector[str, "[myID=unique]", "CaseInsensitive" -> True]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {{2, 3, 2, 3, 1}}|>

In[] := Selector[str, "[myID=unique]", "CaseInsensitive" -> False]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {}|>

You can specify the case-sensitivity separately for attribute name and value.

In[] := Selector[str, "[myID=Unique]", "CaseInsensitive" -> {"AttributeName" -> True, "AttributeValue" -> False}]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {}|>

In[] := Selector[str, "[myID=Unique]", "CaseInsensitive" -> {"AttributeName" -> False, "AttributeValue" -> True}]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {}|>

In[] := Selector[str, "[myID=Unique]", "CaseInsensitive" -> {"AttributeName" -> True, "AttributeValue" -> True}]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {{2, 3, 2, 3, 1}}|>

You can specify the case-sensitivity separately for type.

In[] := Selector[str, "P", "CaseInsensitive" -> {"Type" -> True}]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {{2, 3, 2, 3, 1}, {2, 3, 2, 3, 2}}|>

In[] := Selector[str, "P", "CaseInsensitive" -> {"Type" -> False}]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {}|>
Attachments:
3 Replies
Posted 1 month ago

I like the package. Clean implementation too.

On a related note here are two other CSS selector methods, one using a real XML processing library in Java rather than any WL hacks: https://mathematica.stackexchange.com/a/183970/38205

This will be as robust as JLink is (i.e. very robust). It is object-oriented and so much, much nicer to work with than the standard Mathematica XML headaches.

And another one I wrote up that uses pure Graph methods to implement selectors in terms of a DFS: https://mathematica.stackexchange.com/a/184417/38205

This one still performs quite well, though, and for very complex queries is conceivably cleaner than a pure Cases/Positions method. Also it references a proper Graph and thus can also be object oriented and thus attribute and property lookup from nodes is nearly instantaneous.

Very useful links, thanks. In my earlier implementation I had initially created a graph as well by first tagging all XMLElements with a unique ID:

parentXML //. XMLElement[el_, at_, chld_] :> XMLElement[el, at, chld, CreateUUID["XML-"]];

and then extracting the parent-child relations into a graph. I had kept the XMLElement nesting intact instead of making a flatter association of properties, though now I wonder why I was so keen on holding on to it...

But then I limited the scope of this project. I didn't need a full traversal of the DOM. At worst I needed to look upward towards the root, or look one generation higher to calculate sibling elements. So things like the "nth-child" selector were cumbersome but do-able. And Position + Extract seems fast enough for what I was aiming for.

One thing that I couldn't see offhand in the sources you shared is how they deal with namespaces. Symbolic XML doesn't always hang on to the namespaces unless you specifically set "IncludeNamespaces" -> True during XML import, which I found a little frustrating. Instead of trying to workaround that, I put the onus on the user to include this option. Otherwise I simply fail to find a match because there's nothing else I can do; if the namespace information didn't make its way into the symbolic XML expression on import, then there's nothing that I can look for.

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming, and consider contributing your work to the The Notebook Archive!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract