Group Abstract Group Abstract

Message Boards Message Boards

Add M49 standard country codes to a dataset?

Posted 7 years ago

Hey everyone,

I have to work on a university project where we have a dataset with all countries of the world are listed. I would like to add another row with countries/geographical regions which are defined by the UN in the M49 code (find more here). Is there a way to do this automatically in Mathematica?

I hope I could describe my problem in an understandable way. :) Greetings from the beginner side

POSTED BY: Tester Trying
7 Replies
POSTED BY: Hans Michel
Posted 7 years ago
POSTED BY: Tester Trying
Posted 7 years ago

Hi Tester Trying,

Can you be more specific regarding

I tried both versions but it didn't work for me

What exactly did not work? Was an error reported?

My answer does exactly what you requested, add continent and region columns. The raw data was imported as a Dataset because it provides a lot of useful functionality for working with structured data. I would suggest learning how to use Dataset, take a look at the examples.

If you prefer working with CSV outside of Mathematica then you can always Export the dataset as CSV.

POSTED BY: Rohit Namjoshi

Hi Tester:

The source data file, which is listed in your reply, was downloaded from kaggle.com (one needs to register to get dataset).

master = Import["C:\\suicide-rates-overview-1985-to-2016\\master.csv"];
mastercountriesraw = DeleteDuplicates[Rest[master[[All, 1]]]];

Now one needs the UN's version of their M49 methodology file. In a webbrowser navigate to this URL page

[https://unstats.un.org/unsd/methodology/m49/] with "Geographic Regions" selected from left menu-like items under the "M49 Standard"

Title: Click on "Search and Download:" "Full view" Leads to [https://unstats.un.org/unsd/methodology/m49/overview/]

There is a CSV export (button) which is much better than trying to gather the data from Import[URL, "Data"]

If one were to explore this branch it may lead to more clean up work than currently necessary for purposes of providing a possible solution path.

data = Import["https://unstats.un.org/unsd/methodology/m49/overview/", "Data"];
Part[data, 2, 2, 1, 1]

Gets the header data

Part[data, 2, 2, 1, 2] 

Gets the rest but not all rows contains same amount of columns so this information is ragged. Can be fix but not worth the detour.

So download the CSV file default name is "UNSD -- Methodology.csv".

UNmethoddata = Import["C:\\suicide-rates-overview-1985-to-2016\\UNSD \[LongDash] \
Methodology.csv"];

Now back to mastercountriesraw. Let's see if we can resolve the distinct countries as Entities in Wolfram Language (WL). If possible, then we will have Entity["Country","CountryName"] such that we could query for the WL CountryData property "UNNumber". The simplest way to let the WL automatically resolve the raw country names from the master file to a WL Entity is as follows:

mastercountriesinter = Map[Interpreter["Country"][#] &, mastercountriesraw];

The above is a bit slower for getting country data, but for this particular file yields no relevant issues. However, the following functions get us there and creates a dataset that one can explore

Clear[master, mastercountriesraw, UNmethoddata, addcountryprop, 
  UNcountryprop, groupUNcountryprop, ATgroupUNcountryprop, dataset];
master = Import[
   "C:\\suicide-rates-overview-1985-to-2016\\master.csv"];
mastercountriesraw = DeleteDuplicates[Rest[master[[All, 1]]]];
UNmethoddata = 
  Import["C:\\suicide-rates-overview-1985-to-2016\\UNSD \
\[LongDash] MethodologyFixed.csv"];
addcountryprop[cnty_] := 
  Module[{countryraw, countryent, continent, unnumber, globalcode, 
    globalname, regioncode, regionname, subregioncode, subregionname, 
    intermediateregioncode, intermediateregionname, countryorarea, 
    M49code, bdata}, countryraw = cnty;
   countryent = Quiet[CountryData[cnty]];
   If[ToString[Head[countryent]] == "CountryData", 
    countryent = Interpreter["Country"][cnty]];
   continent = CanonicalName[CountryData[countryent, "Continent"]];
   unnumber = CountryData[countryent, "UNNumber"];
   {globalcode, globalname, regioncode, regionname, subregioncode, 
     subregionname, intermediateregioncode, intermediateregionname, 
     countryorarea, M49code} = 
    Take[SelectFirst[UNmethoddata, (#[[10]] == unnumber) &], 10];
   bdata = Select[master, (#[[1]] == countryraw) &];
   {globalname, regionname, subregionname, 
    If[intermediateregionname == "", subregionname, 
     intermediateregionname] , continent, countryraw, Rest /@ bdata}];
UNcountryprop = Map[addcountryprop[#] &, mastercountriesraw];
groupUNcountryprop = 
  GatherBy[SortBy[
    UNcountryprop, {#[[1]], #[[2]], #[[3]], #[[4]]} &], {#[[1]], \
#[[2]], #[[3]], #[[4]]} &];
makeassoc[item_] := Module[{},
   {Association[
     Rule[item[[1]][[1]], 
        Association[
            Rule[item[[1]][[2]],
                 Association[
                      Rule[item[[1]][[3]], 
                       Association[
                          Rule[item[[1]][[4]],
                                Association[
                                 Rule[item[[1]][[5]],
                                            Association[
                                        Rule[item[[1]][[6]],

                Flatten[
                 Map[AssociationThread[{"year", "sex", "age", 
                    "suicides", "population", "suicidesper100kpop", 
                    "countryyear", "HDIforyear", "gdpforyear", 
                    "gdppercapita", "generation"}, #] &, 
                  item[[1]][[7]]]]

                                    ]
                                   ]
                                  ]
                                 ]
                               ]
                             ]
                       ]
                        ]
                      ]
                  ]
             ]
          ]
    }];
ATgroupUNcountryprop = Map[makeassoc[#] &, groupUNcountryprop];
dataset = Dataset[ATgroupUNcountryprop]

Now in your initial post you really didn't ask to do any calculations but need to add data to the master file. And in your second post I assume from the image you would like an accordion like UI to just navigate the data. WL Dataset[] function produces an accordion like UI, enter image description here

You can click on say "World"

enter image description here

You can click on the Association cell and enter image description here

Leave it to poster to explore Query[] function. The Fixed csv file fix is to put double quotes around some entries which have commas in them the "China, Hong Kong ..." and the like and "Bonaire,...".

POSTED BY: Hans Michel
Posted 7 years ago

Dear Hans Michel,

I have this dataset with all countries and multiple entries for different years per country. Ideally I would like to add two kind of rows. One with the continents and one with geographical regions like 'Caribbean', 'Central America' or 'North/South Europe' for example.

dataset: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

regions:

enter image description here

POSTED BY: Tester Trying
Posted 7 years ago

Hi Tester,

That dataset is a CSV file. I think you mean add columns for continent and geographical region, not rows. Here is one way to do that

Download and import the dataset

suicideImport = Import["~/Downloads/master.csv", "Dataset", HeaderLines -> 1];

One country's name does not match WL data so rename it

suicide = 
  suicideImport[All, 
   If[#country == "Saint Vincent and Grenadines", <|#, "country" -> "Saint Vincent and the Grenadines"|>, #] &];

Add continent column

suicide = 
  suicide[All, <|#, "continent" -> CanonicalName@CountryData[#country]["Continent"]|> &];

Country and region data is available here, download Excel and import

unsdImport = 
  Import["~/Downloads/UNSD \[LongDash] Methodology.xlsx", {"Dataset", 1}, HeaderLines -> 1];

Generate map from country name to region

regionMap = 
  unsdImport[
     All, <|#"Country or Area" -> 
        If[#"Intermediate Region Name" == "", #"Sub-region Name", #"Intermediate Region Name"]|> &] //
     Normal // Association;

Add region to suicide dataset

suicide = suicide[All, <|#, "region" -> regionMap[#country]|> &]

There are still some issues, country "Czech Republic" has no mapping to region.

Total number of suicides by continent and year

suicide[GroupBy[{#continent, #year} &] /* Values,
 <|
   "continent" -> Query[Max, #continent &], 
   "year" -> Query[Max, #year &], 
   "total_suicides" -> Query[Total, #"suicides_no" &]
 |>]
POSTED BY: Rohit Namjoshi

Hello Tester...: This should be something to get started pending some clarification of your post. Do you want to add another column or row. And what is to be in that column.

data = Import["https://unstats.un.org/unsd/methodology/m49/", "Data"];
Rest[Part[data,2,1,2,1,2,1][[All,2]]]/. Map[CountryData[#, "UNNumber"] -> # &, CountryData[All]] 

Using the UNCode is problematic as there is an entry in the M49 data that is missing there are ways to fix this but the M49Code is all there. Nevertheless not all number match a Wolfram Language CountryData entity.

POSTED BY: Hans Michel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard