Message Boards Message Boards

3
|
9577 Views
|
15 Replies
|
16 Total Likes
View groups...
Share
Share this post:

Scraping the Rio2016.com Website for Summer 2016 Olympics Medal Results

Hi all,

I had an idea that I wanted to use the 2016 Summer Olympics medal results by event as an example for a pivot table in Excel. This is for a class that I teach, so it ultimately has to end up in Excel. Naturally, I made an attempt to scrape the data using Excel itself, but I ran into problems with that. So, I figured that I would use Mathematica to grab the data instead, partly as a learning exercise for myself.

Well, that turned out to be a bit more involved than I thought it would be. I ended up having to do a few different steps, and making heavy use of Part in order to massage the data into the form that I wanted. I often find Part to be confusing with nested lists, so this was a useful exercise. What I wanted was a list such as:

Summer 2016 Olympics Results

This list could then be exported to Excel for use in my pivot table. I have attached my notebook file for anyone to look at or use, if interested. I'm also open to (hoping for) some alternative ideas for how to accomplish the same result in an easier way.

EDIT: I found an error with the country association, so I uploaded a fixed version of the notebook.

Thanks,

Tim

Attachments:
POSTED BY: Tim Mayes
15 Replies
Posted 8 years ago

Web scraping is more an art than a science, IMHO. Not sure Mathematica is a good tool to accomplish your goal. It is not a very good tool to do data munging, cleaning.

In your results, you may want to do some data review. I ran your notebook on OS X 10.10.5 with Mathematica 11.0 and in the resulting spreadsheet the following problems seem to have occurred:

Row 523 Row 829 Row 853

919 total rows including header row 1

which implies 919 medals awarded, however if I just use copy and paste on the following URL: http://www.nbcolympics.com/medals I end up with a total medal count of 975 (notes: the resulting copy and paste yielded an error on row 26 for country Ukraine, the columns shifted left one column which I fixed manually)

Below are a set of steps in Excel only to get you to a spreadsheet that has correct number of awarded medals per event (note, I am off by 1 medal and my results need testing). The steps below deal with ties and related events in some events. My results at the intermediate step where I stopped are off by 1 metal in total count from the above count of 975.

These steps are all simple excel operations, result is spreadsheet of values, each row contains column for medal count for 1st, 2nd and 3rd.

To continue any further you will need to move to more advanced Excel Visual Basic code or move to a more powerful tool to find where the one metal was off, to pivot the data and continue to make it more tidy. However, it will become more difficult to parse programmatically as in few steps the athlete names are not parseable by ties, countries or team members.

Good hunting!

Data munging sucks then we die.... ;-)

Steps to create the attached spreadsheet, done on OS X 10.10.5 Excel for Mac 2011 V14.6.6.

Highlight results from web page:

https://www.rio2016.com/en/medal-count-sports

copy and paste to Excel

Delete 1st four rows

insert blank column in front of left most column

insert following formula in cell A1:

=IF(EXACT(C1,""),B1,"")

copy and paste this formula down to remainder of column A

highlight entire spreadsheet and copy

create new blank spreadsheet and paste values

insert new column to the left of column A

in cell A1 enter following formula:

=IF(EXACT(B1,C1),B1,INDIRECT(ADDRESS(ROW()-1,COLUMN())))

copy and paste this formula down to remainder of column A

highlight entire spreadsheet and copy

create new blank spreadsheet and paste values

delete column B in new spreadsheet

hightlight all cells

sort ascending by column C

scroll down and delete bottom rows that contain only values in column A and B

highlight entire spreadsheet

sort ascending by column A

highlight all rows with data and copy

scroll down to first empty row and paste

scroll down again to first empty row and paste

highlight all cells

sort ascending by column A and add second sort ascending by column B

insert column before column C

in cell C1 insert formula: =LEN(D517)/3

copy cell C1 down to all rows in column C with data

insert column before column E

in cell D1 insert formula:

=LEN(G1)/3

copy cell D1 down to all rows in column D with data

insert column before column F

in cell E1 insert formula:

=LEN(J1)/3

copy cell E1 down to all rows in column E with data

highlight entire spreadsheet and copy

create new blank spreadsheet and paste values

highlight all cells

under data function, select delete duplicates, select ALL columns

create a sum for column C create a sum for column D create a sum for column E

These are Gold, Silver and Bronze counts

sum these three values to get 974

This number is 1 medal less than what was found by the count from scraping web page: http://www.nbcolympics.com/medals

Attachments:
POSTED BY: David Proffer

As I always say: "Excel should not be on the computer of a scientist/data scientist". It seems to do some kind of job for people working in administration though - I am told.

Here a slightly different way of cleaning the data in Mathematica.

(*Import*)
dataimport = Import["https://www.rio2016.com/en/medal-count-sports", "Data"];
(*Generate first table*)
tableresults = 
  Select[Map[Flatten, 
 Transpose[{Flatten[ConstantArray[#[[1]], #[[2]]] & /@ 
 Transpose[{dataimport[[4, 2]][[All, 1]], Length /@ dataimport[[4, 2]][[All, 2]]}]], Flatten[dataimport[[4, 2]][[All, 2]], 1]}]], Length[#] == 8 &];
(*Generate table with type of medal*)
TableForm[
 Flatten /@ Transpose[{tableresults[[All, 1 ;; 4]], ConstantArray["Gold", Length[tableresults]], tableresults[[All, 5 ;; 6]], 
 ConstantArray["Silver", Length[tableresults]], tableresults[[All, 7 ;; 8]], ConstantArray["Bronze", Length[tableresults]]}]]

To me that seems to be much shorter than the endless list of Excel instructions.

Cheers,

M.

POSTED BY: Marco Thiel

Marco,

Thanks. Your solution is the kind of thing that I was looking for, though the final table doesn't match the format that I need exactly (I need each medal on its own row). Your repeated use of Map is exactly the kind of thing that I need to practice on, so thanks for showing that solution.

POSTED BY: Tim Mayes

Hi Tim,

would something like this work for you?

As before

jointTable = 
  Flatten /@ 
     Transpose[{tableresults[[All, 1 ;; 4]], 
     ConstantArray["Gold", Length[tableresults]], 
     tableresults[[All, 5 ;; 6]], 
     ConstantArray["Silver", Length[tableresults]], 
     tableresults[[All, 7 ;; 8]], 
     ConstantArray["Bronze", Length[tableresults]]}];

This table contains all the data. You now only need to pick the columns you need and join them together:

Join[jointTable[[All, 1 ;; 5]], jointTable[[All, {1, 2, 6, 7, 8}]], jointTable[[All, {1, 2, 9, 10, 11}]]] // TableForm

You can, of course, sort this now according to your favourite criterion. If you call

singleTable=Join[jointTable[[All, 1 ;; 5]], jointTable[[All, {1, 2, 6, 7, 8}]], jointTable[[All, {1, 2, 9, 10, 11}]]]

Then

SortBy[singleTable, First] // TableForm

or

SortBy[singleTable, #[[3]] &] // TableForm

or

SortBy[singleTable, #[[4]] &] // TableForm

might give you representation that you prefer.

Cheers,

Marco

POSTED BY: Marco Thiel

Marco,

Thanks, that is very helpful!

Tim

POSTED BY: Tim Mayes
Posted 8 years ago

Make sure you audit your results carefully. It would be unfortunate for people to not know of Singapore's first ever Gold medal this year!

POSTED BY: David Proffer

David,

I've got that (Mens 100m Butterfly). :-) Also, I have double checked my medal counts by country against the Rio2016 site and they match, so I'm confident that I have the results presented accurately.

Tim

POSTED BY: Tim Mayes

This is a brilliant response, @Marco Thiel, and I second that "Excel should not be on the computer of a scientist/data scientist".

POSTED BY: Vitaliy Kaurov

Indeed, but still a lot of people in biology do so:

http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

POSTED BY: Sander Huisman

Hi Sander,

what a nice reference. I very much agree that in many areas, like biology, they do tend to use Excel. I do work at the interface with other areas of science and have experienced similar issues a lot.

Thanks a lot for that reference. I'll make sure to use it...

Cheers,

Marco

POSTED BY: Marco Thiel

It just came out 5-6 days ago. Most shocking was that for gene-papers in Nature 30% has some issue with Excel. I was horrified!

POSTED BY: Sander Huisman
Posted 8 years ago

Very elegant, thank you for the lesson!

Not that I was able to end up with a 100% correct solution either, but unfortunately for the swimmers in the women's 100m freestyle and the men's 100m butterfly the results are lost in your solution. Due to the corner case in these races where ties occurred for 1st place in the women's race and silver in the men's race.

I completely agree with your sentiment on the use of Excel. The original author's goal was to create an Excel spreadsheet. My ugly solution was a poor example of using the targeted tool.

Scraping web data is a real challenge!

And, IMHO, cleaning scrapped data requires a robust visual inspection tool. While Mathematica 11 has made good steps to improve display of Datasets, it is still far short of a tool that is good for visual inspection.

POSTED BY: David Proffer

David,

Thanks for your effort. I know how to get the data into Excel and manipulate it there. The problem is that I wanted a very specific format that is useful for a pivot table (as shown in my image), and without a bunch of manual "munging" it wasn't going to happen. Also, the Get & Transform query that I was running on the Rio2016 website was occasionally crashing Excel. Even after getting the data with Mathematica, I still need to clean it up a bit in Excel due to some ties (there are sometimes several winners of a particular medal), but those are pretty easy to locate.

POSTED BY: Tim Mayes
Posted 8 years ago

Tim, Good hunting on a solution that works to your goal!

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

POSTED BY: David Proffer
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract