Message Boards Message Boards

Can you help me navigate an XML tree?

Posted 3 years ago

I am working on developing a tool in Wolfram Language to correlate user ratings on the website Board Game Geek (BGG, www.boardgamegeek.com). Users on the site can rate games with a value from 1-10, and my initial goal is to allow someone to check their ratings against those of other users to find other users whose tastes generally match their own.

Part of this involves, of course, grabbing all of a user's rated games. BGG has an API which allows this. To access it, one does this:

userName = "skutsch";
urlUser = "https://www.boardgamegeek.com/xmlapi2/collection?username=" <> userName <> "&rated=1";
r1 = Import[urlUser, "XML"];

When that call is made, BGG checks if a data file already exists for that user. If so, it serves the data as XML. If not, it sends an HTTP code of 202 and returns a message that it is preparing the data. Accessing the link again a few seconds later usually then results in an HTTP code of 200 and the XML data. (The data is saved by BGG until either the user makes changes to their collection or a number of days pass by. I've been hitting these users as test cases so they will probably serve data on the first try.)

When it comes to the XML results, I am clueless. I've read the documentation for XML in WL and haven't been able to digest it.

The data very roughly looks like this:

<items totalitems="318" ... >
  <item objecttype="thing" objectid="177590" subtype="boardgame"...>
    <name ...>
    <blah>
    <blah>
    <stats ...>
      <rating value="7.5">
        <blah/>
        <blah/>
      </rating>
    </stats>
    <blah>
  </item>
  <item objecttype="thing" objectid="68448" subtype="boardgame"...>
    <name ...>
    <blah>
    <blah>
    <stats ...>
      <rating value="6.5">
        <blah/>
        <blah/>
      </rating>
    </stats>
    <blah>
  </item>
  <...many more items...>
</items>

As I have editorially indicated, I'm only interested in a few values here. What I want to parse out is:

  • for every <item> where <item subtype="boardgame">:
    • get <item objectid="xxxxx">
    • get <rating value="yy">

I was trying to do this in a very brute force matter thusly:

    a = <|"gameID" -> Values[r1[[2, 3, i, 2, 2]]], "userName" -> username,
       "rating" -> ToExpression @@ Values[r1[[2, 3, i, 3, 5, 3, 1, 2]]]|>

basically going to the exact location of the data and iterating through it (i = 1 to <items totalitems="i">). Not great, but seemed to work. I'm lazy, so I liked that it kept me from having to figure out the XML.

Unfortunately there's a snag. If you look at the data where userName="Legomancer", you hit a problem at item 129 (<item objectid="1231">). That item has an additional element that others don't have:

<item objecttype="thing" objectid="1231" subtype="boardgame" collid="6163073">
  <name sortindex="1">Bandu</name>
  <originalname>Bausack</originalname>
  <yearpublished>1987</yearpublished>

That <originalname> element shifts the rest of the fields, wrecking my brainless strategy and throwing an error. So I guess I need to learn how to use XML after all.

So while I read up again on XML and knock at it with some trial and error, if anyone could point me towards a path, that would be super helpful.

POSTED BY: Dave Lartigue
3 Replies
Posted 3 years ago

There are several ways to go about this based on how rigid the schema is and what you want to do with the data. The simplest way, and the one with fewest assumptions about schema, would probably be to use Cases. If r1 is your data, then

boardgames = Cases[r1, XMLElement["item", {___, "subtype" -> "boardgame", ___}, ___], Infinity]

will give you a list of all XMLElements that are boardgames.

Now, for each boardgame, we can play the same trick to look for the rating:

Cases[#, XMLElement["rating", ___], Infinity] & /@ boardgames

Cases produces a list, so what you now have is a list of lists. Each list contains the rating substructures for each boardgame structure, which means they no longer contain anything else from the boardgame structure, including the name, which I'm assuming will be important at some point.

If you were to save each of these lists (the boardgames list and the list of ratings) in variables, you could do some further processing to pair them or whatever. You could also just try to put this whole thing in an Association. Let's say that boardgames is the variable for the boardgames. Then you could have done this:

ratings = AssociationMap[Cases[#, XMLElement["rating", ___], Infinity] &, boardgames]

This will be a big hairy thing. You might want just the name of each boardgame to be the key, instead of the whole boardgame structure itself. To get that, you can map yet another Cases function onto the keys:

KeyMap[Cases[#, XMLElement["name", _, name_] -> name, Infinity] &, ratings]

I imagine that all of the lists generated by Cases will be superfluous to your purpose, but I'm not making any assumptions about how many ratings or names there will be. You can further clean the data to suit your purpose.

POSTED BY: Eric Rimbey

Thanks Eric! As multiple games can share the same name, I am using <item objectid="xxxx"> to identify the game. It's a unique identifier for BGG data.

This will definitely get me started. Many thanks!

POSTED BY: Dave Lartigue

This now seems to be working!

getUserRatings[username_String] := Module[
   {urlUser, listReady, k1, games, num, gameIDs, id, ratings, ra, 
    results, i, a},
   urlUser = 
    "https://www.boardgamegeek.com/xmlapi2/collection?username=" <> 
     username <> "&rated=1&stats=1";
   listReady = 0;
   While[listReady == 0,
    k1 = Import[urlUser, "XML"];
    Which[
     k1[[2, 1]] == "items", listReady = 1,
     k1[[2, 1]] == "message", listReady = 0; Pause[8],
     k1[[2, 1]] == "errors", listReady = 1
     ];
    ];
   If[k1[[2, 1]] == "errors", Print["Invalid Username"],
    games = 
     Cases[k1, 
      XMLElement["item", {___, "subtype" -> "boardgame", ___}, ___], 
      Infinity];
    num = Length[games];
    gameIDs = 
     Cases[games, 
      XMLElement["item", {___, "objectid" -> id___, ___}, ___] -> id, 
      Infinity];
    ratings = 
     Cases[games, XMLElement["rating", {"value" -> ra___}, ___] -> ra,
       Infinity];
    results = {};
    For[i = 1, i <= num, i++,
     a = <|"gameID" -> gameIDs[[i]], "userName" -> username, 
       "rating" -> ratings[[i]]|>;
     AppendTo[results, a];
     ];
    Return[results];
    ];
   ];

The only place where I am explicitly poking into the structure of the XML is checking [[2,1]] to see if I got back an error, a "please wait", or actual data. It's fast, too. I tried it on a user with >2000 ratings and it finished almost immediately after getting the data! Thank you for your help! You saved me a lot of time and grief!

POSTED BY: Dave Lartigue
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract