Group Abstract Group Abstract

Message Boards Message Boards

3
|
12.5K Views
|
15 Replies
|
16 Total Likes
View groups...
Share
Share this post:

Scraping the Rio2016.com Website for Summer 2016 Olympics Medal Results

Hi all,

I had an idea that I wanted to use the 2016 Summer Olympics medal results by event as an example for a pivot table in Excel. This is for a class that I teach, so it ultimately has to end up in Excel. Naturally, I made an attempt to scrape the data using Excel itself, but I ran into problems with that. So, I figured that I would use Mathematica to grab the data instead, partly as a learning exercise for myself.

Well, that turned out to be a bit more involved than I thought it would be. I ended up having to do a few different steps, and making heavy use of Part in order to massage the data into the form that I wanted. I often find Part to be confusing with nested lists, so this was a useful exercise. What I wanted was a list such as:

Summer 2016 Olympics Results

This list could then be exported to Excel for use in my pivot table. I have attached my notebook file for anyone to look at or use, if interested. I'm also open to (hoping for) some alternative ideas for how to accomplish the same result in an easier way.

EDIT: I found an error with the country association, so I uploaded a fixed version of the notebook.

Thanks,

Tim

Attachments:
POSTED BY: Tim Mayes
15 Replies
Posted 9 years ago
Attachments:
POSTED BY: David Proffer

As I always say: "Excel should not be on the computer of a scientist/data scientist". It seems to do some kind of job for people working in administration though - I am told.

Here a slightly different way of cleaning the data in Mathematica.

(*Import*)
dataimport = Import["https://www.rio2016.com/en/medal-count-sports", "Data"];
(*Generate first table*)
tableresults = 
  Select[Map[Flatten, 
 Transpose[{Flatten[ConstantArray[#[[1]], #[[2]]] & /@ 
 Transpose[{dataimport[[4, 2]][[All, 1]], Length /@ dataimport[[4, 2]][[All, 2]]}]], Flatten[dataimport[[4, 2]][[All, 2]], 1]}]], Length[#] == 8 &];
(*Generate table with type of medal*)
TableForm[
 Flatten /@ Transpose[{tableresults[[All, 1 ;; 4]], ConstantArray["Gold", Length[tableresults]], tableresults[[All, 5 ;; 6]], 
 ConstantArray["Silver", Length[tableresults]], tableresults[[All, 7 ;; 8]], ConstantArray["Bronze", Length[tableresults]]}]]

To me that seems to be much shorter than the endless list of Excel instructions.

Cheers,

M.

POSTED BY: Marco Thiel

Marco,

Thanks. Your solution is the kind of thing that I was looking for, though the final table doesn't match the format that I need exactly (I need each medal on its own row). Your repeated use of Map is exactly the kind of thing that I need to practice on, so thanks for showing that solution.

POSTED BY: Tim Mayes

Hi Tim,

would something like this work for you?

As before

jointTable = 
  Flatten /@ 
     Transpose[{tableresults[[All, 1 ;; 4]], 
     ConstantArray["Gold", Length[tableresults]], 
     tableresults[[All, 5 ;; 6]], 
     ConstantArray["Silver", Length[tableresults]], 
     tableresults[[All, 7 ;; 8]], 
     ConstantArray["Bronze", Length[tableresults]]}];

This table contains all the data. You now only need to pick the columns you need and join them together:

Join[jointTable[[All, 1 ;; 5]], jointTable[[All, {1, 2, 6, 7, 8}]], jointTable[[All, {1, 2, 9, 10, 11}]]] // TableForm

You can, of course, sort this now according to your favourite criterion. If you call

singleTable=Join[jointTable[[All, 1 ;; 5]], jointTable[[All, {1, 2, 6, 7, 8}]], jointTable[[All, {1, 2, 9, 10, 11}]]]

Then

SortBy[singleTable, First] // TableForm

or

SortBy[singleTable, #[[3]] &] // TableForm

or

SortBy[singleTable, #[[4]] &] // TableForm

might give you representation that you prefer.

Cheers,

Marco

POSTED BY: Marco Thiel

Marco,

Thanks, that is very helpful!

Tim

POSTED BY: Tim Mayes
Posted 9 years ago

Make sure you audit your results carefully. It would be unfortunate for people to not know of Singapore's first ever Gold medal this year!

POSTED BY: David Proffer

David,

I've got that (Mens 100m Butterfly). :-) Also, I have double checked my medal counts by country against the Rio2016 site and they match, so I'm confident that I have the results presented accurately.

Tim

POSTED BY: Tim Mayes

This is a brilliant response, @Marco Thiel, and I second that "Excel should not be on the computer of a scientist/data scientist".

POSTED BY: Vitaliy Kaurov

Indeed, but still a lot of people in biology do so:

http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

POSTED BY: Sander Huisman

Hi Sander,

what a nice reference. I very much agree that in many areas, like biology, they do tend to use Excel. I do work at the interface with other areas of science and have experienced similar issues a lot.

Thanks a lot for that reference. I'll make sure to use it...

Cheers,

Marco

POSTED BY: Marco Thiel

It just came out 5-6 days ago. Most shocking was that for gene-papers in Nature 30% has some issue with Excel. I was horrified!

POSTED BY: Sander Huisman
Posted 9 years ago

Very elegant, thank you for the lesson!

Not that I was able to end up with a 100% correct solution either, but unfortunately for the swimmers in the women's 100m freestyle and the men's 100m butterfly the results are lost in your solution. Due to the corner case in these races where ties occurred for 1st place in the women's race and silver in the men's race.

I completely agree with your sentiment on the use of Excel. The original author's goal was to create an Excel spreadsheet. My ugly solution was a poor example of using the targeted tool.

Scraping web data is a real challenge!

And, IMHO, cleaning scrapped data requires a robust visual inspection tool. While Mathematica 11 has made good steps to improve display of Datasets, it is still far short of a tool that is good for visual inspection.

POSTED BY: David Proffer

David,

Thanks for your effort. I know how to get the data into Excel and manipulate it there. The problem is that I wanted a very specific format that is useful for a pivot table (as shown in my image), and without a bunch of manual "munging" it wasn't going to happen. Also, the Get & Transform query that I was running on the Rio2016 website was occasionally crashing Excel. Even after getting the data with Mathematica, I still need to clean it up a bit in Excel due to some ties (there are sometimes several winners of a particular medal), but those are pretty easy to locate.

POSTED BY: Tim Mayes
Posted 9 years ago

Tim, Good hunting on a solution that works to your goal!

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

POSTED BY: David Proffer
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard