Message Boards Message Boards

GROUPS:

CSS Selectors 3 for Symbolic XML

Posted 1 year ago
3674 Views
|
10 Replies
|
14 Total Likes
|

Introduction

Implementation of CSS Selectors 3 for Wolfram Language SymbolicXML expressions. This work is motivated by the Stack Exchange post css-selectors-for-symbolic-xml. The full package with the examples shown below can be download from the github repo here. The standalone package is attached to this community post but may not remain as up to date as the repo.

The CSS Selectors 3 specification is followed as far as possible, but I make no claims to be absolutely conformant. For example, being that WL SymbolicXML is a static expression, any HTML/XML elements such as dynamic pseudo classes (e.g. active/hover/focus) and pseudo elements (e.g. before/after) are not found.

Load Package

In[] := Needs["Selectors3`"]

Testing on HTML source

Being a little meta, let's test this against the WC3 page for Selectors Level 3.

In[] := document = Import["https://www.w3.org/TR/selectors-3/", "XMLObject"];

Look for elements that belong to classes that contain the letter 'h'.

In[] := Position[document, Selector["[class*=h]"]]
Out[] = {{2, 3, 2, 3, 2}, {2, 3, 2, 3, 484}, {2, 3, 2, 3, 488}, {2, 3, 2, 3, 2, 3, 11}}

In[] := Extract[document, %][[All, 1 ;; 2]] // Column
Out[] = {
 {XMLElement["div", {"class" -> "head"}]},
 {XMLElement["dl", {"class" -> "bibliography"}]},
 {XMLElement["dl", {"class" -> "bibliography"}]},
 {XMLElement["p", {"class" -> "copyright"}]}
}

Look for elements of class '.no-num'

In[] := Extract[document, Position[document, Selector[".no-num"]]] // Column
Out[] = {
 {XMLElement["h2", {"class" -> "no-num no-toc", "id" -> "abstract"}, {"Abstract"}]},
 {XMLElement["h2", {"class" -> "no-num no-toc", "id" -> "status"}, {"Status of this document"}]},
 {XMLElement["h2", {"class" -> "no-num no-toc", "id" -> "contents"}, {"Table of contents"}]},
 {XMLElement["h2", {"class" -> "no-num no-toc"}, {"W3C Recommendation 06 November 2018"}]}
}

Check specificity of the selector

In[] := Selector[document, "[class~=a] b > *:link"]["Specificity"]
Out[] = {0, 2, 1}

In[] := Selector[document, "[class~=a] b > :not(p)"]["Specificity"]
Out[] = {0, 1, 2}

In[] := Selector[document, "#welcome"]["Specificity"]
Out[] = {1, 0, 0}

Testing on XML source

In[] := str = "<html xml:lang='zh'><head><title>Test</title></head><body \
        xmlns='http://www.w3.org/1999/xhtml'><p lang='en' class='red' \
        myid='unique'>Here is some math.</p><p><m:math \
        xmlns:m='http://www.w3.org/1998/Math/MathML'><m:mi \
        m:title='cat'>x</m:mi><m:mo>+</m:mo><m:mn>1</m:mn></m:math></p></body>\
        \n</html>";
     
In[] := obj = ImportString[str, "XML"];

Namespace

If the selector does not specify a namespace, then the namespace is ignored:

In[] := Selector[str, "mo"]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {{2, 3, 2, 3, 2, 3, 1, 3, 2}}|>

If a namespace is given in the selector, then you need to provide the prefix's expansion rule. Otherwise the selector won't match any element.

In[] := Selector[str, "m|mo"]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {}|>

In[] := Selector[str, "m|mo", "Namespaces" -> {"m" -> "http://www.w3.org/1998/Math/MathML"}]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {{2, 3, 2, 3, 2, 3, 1, 3, 2}}|>

ID

XML can define its own unique ID tags. Use the "ID" option to indicate what tag name is in use. This is equivalent to using the attribute selector but with higher specificity.

In[] := Selector[str, "#unique", "ID" -> "myid"]
Out[] = <|"Specificity" -> {1, 0, 0}, "Elements" -> {{2, 3, 2, 3, 1}}|>

In[] := Selector[str, "[myid=unique]"]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {{2, 3, 2, 3, 1}}|>

Case sensitivity

XML is case-sensitive, but the Selectors3 package is not by default. Use the "CaseInsensitive" option to enforce case sensitivity.

In[] := Selector[str, "[myID=unique]", "CaseInsensitive" -> True]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {{2, 3, 2, 3, 1}}|>

In[] := Selector[str, "[myID=unique]", "CaseInsensitive" -> False]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {}|>

You can specify the case-sensitivity separately for attribute name and value.

In[] := Selector[str, "[myID=Unique]", "CaseInsensitive" -> {"AttributeName" -> True, "AttributeValue" -> False}]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {}|>

In[] := Selector[str, "[myID=Unique]", "CaseInsensitive" -> {"AttributeName" -> False, "AttributeValue" -> True}]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {}|>

In[] := Selector[str, "[myID=Unique]", "CaseInsensitive" -> {"AttributeName" -> True, "AttributeValue" -> True}]
Out[] = <|"Specificity" -> {0, 1, 0}, "Elements" -> {{2, 3, 2, 3, 1}}|>

You can specify the case-sensitivity separately for type.

In[] := Selector[str, "P", "CaseInsensitive" -> {"Type" -> True}]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {{2, 3, 2, 3, 1}, {2, 3, 2, 3, 2}}|>

In[] := Selector[str, "P", "CaseInsensitive" -> {"Type" -> False}]
Out[] = <|"Specificity" -> {0, 0, 1}, "Elements" -> {}|>
Attachments:
10 Replies
Posted 1 year ago

I like the package. Clean implementation too.

On a related note here are two other CSS selector methods, one using a real XML processing library in Java rather than any WL hacks: https://mathematica.stackexchange.com/a/183970/38205

This will be as robust as JLink is (i.e. very robust). It is object-oriented and so much, much nicer to work with than the standard Mathematica XML headaches.

And another one I wrote up that uses pure Graph methods to implement selectors in terms of a DFS: https://mathematica.stackexchange.com/a/184417/38205

This one still performs quite well, though, and for very complex queries is conceivably cleaner than a pure Cases/Positions method. Also it references a proper Graph and thus can also be object oriented and thus attribute and property lookup from nodes is nearly instantaneous.

Very useful links, thanks. In my earlier implementation I had initially created a graph as well by first tagging all XMLElements with a unique ID:

parentXML //. XMLElement[el_, at_, chld_] :> XMLElement[el, at, chld, CreateUUID["XML-"]];

and then extracting the parent-child relations into a graph. I had kept the XMLElement nesting intact instead of making a flatter association of properties, though now I wonder why I was so keen on holding on to it...

But then I limited the scope of this project. I didn't need a full traversal of the DOM. At worst I needed to look upward towards the root, or look one generation higher to calculate sibling elements. So things like the "nth-child" selector were cumbersome but do-able. And Position + Extract seems fast enough for what I was aiming for.

One thing that I couldn't see offhand in the sources you shared is how they deal with namespaces. Symbolic XML doesn't always hang on to the namespaces unless you specifically set "IncludeNamespaces" -> True during XML import, which I found a little frustrating. Instead of trying to workaround that, I put the onus on the user to include this option. Otherwise I simply fail to find a match because there's nothing else I can do; if the namespace information didn't make its way into the symbolic XML expression on import, then there's nothing that I can look for.

enter image description here - Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming, and consider contributing your work to the The Notebook Archive!

Hi Kevin, Nice post! Did you ever publish your CSSTools` package?

The CSSTools package is soooo close to a solid first release. It makes this post almost obsolete since I've reworked the selector parsing to match the tokens from my CSS tokenizer, but that all should be hidden from the end-user. Slight name changes, though. In any case, I'm continuing to update the documentation for the CSSTools package with feedback from my colleagues. (Documentation never really feels complete, right?) Priorities are still the 12.1 release, but I'm going to push for it getting onto Wolfram's Github page after the release.

Posted 7 months ago

Hi @Kevin Daily please keep us posteed when it is released.

Congrats on publishing! I got the bonus workflow #1 in your slides to work but am having an issue with #2, I think it should just be CSSTargets[doc, "body"][[1]] (not [[1,1]]), if I change these lines in your presentation's last slide:

body = styleDataCell["Notebook", Notebook,  CSSTargets[doc, "body"][[1]]]
h3 = styleDataCell["question", Cell, CSSTargets[doc, "h3"][[1]]]
h1 = styleDataCell["h1", Cell, CSSTargets[doc, "h1"][[1]]]
li = styleDataCell["li", Cell, CSSTargets[doc, "li"][[1]]]
p = styleDataCell["p", Cell, CSSTargets[doc, "p"][[1]]]
h2 = styleDataCell["h2", Cell, CSSTargets[doc, "h2"][[1]]]

I get the output on the left (which doesn't look exactly right):

enter image description here

My presentation notebook took parts from the tutorial page of the CSSTools package and is behind the state of the package. The "Bonus Workflow 2" slide is the same material found in the tutorial documentation, specifically the section "Making a Custom Stylesheet from CSS Data", but the tutorial is up to date.

I'm not saying that I can't improve the package, or that bugs aren't present, but there will almost always be slight differences between a web browser's rendering of HTML with the CSS data and the Wolfram Desktop's rendering of cells with an imported CSS stylesheet.

  1. web browsers like Chrome and Safari have default style sheets. If properties are not specified in the user-level CSS, then the defaults will inherit through to the rendered HTML in the browser. The imported CSS will not be able to pick up the browser's default stylesheet, and Wolfram Desktop has its own default styles that may differ from the browser's. Interestingly, in your example the "SmallCaps" font style is not picked up in your web browser's font, but it is in the notebook. That may be a font variation not available to the font family the browser is using.
  2. CSS styles are separate from HTML content. My example is rather contrived in that I created HTML and Notebook content to have the same top-down structure and wording. The example Notebook cells are already given style names like "h1" so the imported stylesheet can immediately apply. The point is that the styles have (supposedly) imported correctly, whether or not the Notebook content can be faithfully drawn to match the rendered HTML content. One such cheat is I put a newline in the notebook content after the section titles (in the boxed cells) because that matched Chrome's defaults. It looks like your browser was more faithful to the original HTML and does not have the newline.

Ah, I posted before seeing your beautiful tutorial (somehow the paclet documentation didn't index until I restarted the front-end). All the examples in 'paclet:CSSTools/tutorial/CSSTools' work just fine!

Here's the output of your tutorial example (I'm using chrome), it only misses minor things: the FontFamily looks wrong for "h2", the small-caps of the text in "question" cells are missing, and the color of hyperlinks are a bit off...

enter image description here

Two follow-up questions:

  1. Are there any methods in CSSTools to better capture these missing pieces of typographic information?
  2. When it fails on some important property e.g. border-radius, how could I add support for that property? Thanks!
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract