Message Boards Message Boards

Add M49 standard country codes to a dataset?

Posted 5 years ago

Hey everyone,

I have to work on a university project where we have a dataset with all countries of the world are listed. I would like to add another row with countries/geographical regions which are defined by the UN in the M49 code (find more here). Is there a way to do this automatically in Mathematica?

I hope I could describe my problem in an understandable way. :) Greetings from the beginner side

POSTED BY: Tester Trying
7 Replies

Hi Florian (Tester Trying):

I would agree with Rohit on needed clarifications. The initial source data file and UN data in csv format need cleaning. The data files contain some incongruences pertaining to WolframData and the nature of country names over time and multiple regions with undefined sub-regions. As per your title, some pivoting on the M49 code helps. The following code is much the same as before up to UNcountryprop where we diverge away from Association and Dataset and just manipulate list. Instead of running through each line of the "master.csv" file and adding the two elements (columns) {"continent", "region"} it helps to optimize by reducing to a set of distinct countries, then mapping {"continent", "region"}

Clear[master, mastercountriesraw, UNmethoddata, addcountryprop, UNcountryprop, addcontinentandregionforexport, masterout, headings];
master = Import["C:\\suicide-rates-overview-1985-to-2016\\master.csv"];
mastercountriesraw = DeleteDuplicates[Rest[master[[All,1]]]];
UNmethoddata = Import["C:\\suicide-rates-overview-1985-to-2016\\UNSD — MethodologyFixed.csv"];
addcountryprop[cnty_] := Module[
       { countryraw, countryent, continent, unnumber, globalcode, globalname, regioncode, regionname, subregioncode, subregionname, intermediateregioncode, intermediateregionname, countryorarea, M49code, bdata}, 
       countryraw = cnty;
        countryent = Quiet[CountryData[cnty]];
        If[ToString[Head[countryent]] == "CountryData", countryent = Interpreter["Country"][cnty]];
        continent = CanonicalName[CountryData[countryent, "Continent"]];
        unnumber = CountryData[countryent, "UNNumber"];
        { globalcode, globalname, regioncode, regionname, subregioncode, subregionname, intermediateregioncode, intermediateregionname, countryorarea, M49code } = 
       Take[SelectFirst[UNmethoddata, (#[[10]] == unnumber) &], 10];
        bdata = Select[master, (#[[1]] == countryraw) &];
        { globalname, regionname, subregionname, If[intermediateregionname == "", subregionname, intermediateregionname], continent, countryraw, Rest /@ bdata }
       ];
UNcountryprop = Map[addcountryprop[#] &, mastercountriesraw];
addcontinentandregionforexport[item_] := Module[
       { contiregcountr }, 
       contiregcountr = {item[[5]], item[[4]], item[[6]]};
        Map[Join[contiregcountr, #] &, item[[7]] ] 
       ];
masterout = Flatten[Map[addcontinentandregionforexport[#] &, UNcountryprop], 1];
headings = Join[ {"continent","region"}, First[master] ];
Export["C:\\suicide-rates-overview-1985-to-2016\\masterout.csv", masterout, "CSV", "TableHeadings" -> headings]

In reviewing previous code I spotted some issues with the GroupBy; but you don't want a Dataset .Using the Wolfram Language to add two columns to a CSV file is actually straight forward. There are some complications as to how to add those columns in your post questions.

POSTED BY: Hans Michel
Posted 5 years ago

Hi Tester Trying,

Can you be more specific regarding

I tried both versions but it didn't work for me

What exactly did not work? Was an error reported?

My answer does exactly what you requested, add continent and region columns. The raw data was imported as a Dataset because it provides a lot of useful functionality for working with structured data. I would suggest learning how to use Dataset, take a look at the examples.

If you prefer working with CSV outside of Mathematica then you can always Export the dataset as CSV.

POSTED BY: Rohit Namjoshi
Posted 5 years ago

Hey Hans and Rohit,

thank you very much for your help. I tried both versions but it didn't work for me. My objective is to import the file which works perfectly by this:

OriginalData = Import[FileNameJoin[{NotebookDirectory[], "suicide.csv"}]];

Then I would like to add a column for continent and another column for region. This I would like to export to a csv file or work with it in Mathematica:

Export[FileNameJoin[{NotebookDirectory[],"suicide.mx"}], OriginalData]

The problem is that the other structure of the file should not be changed because I later want to test different hypothesis, do time analysis or geocomputation. The solution from you @Hans Michel changes the look of the file. The expanding option to click from World>Africa>Mauritius is nice but this will make further work more complicated. Is my request even possible to achieve? I am really on the beginner level here. :D If it is possible for you to send me a csv file with the added two columns that would be amazing. But of course I would also like to understand the work around behind it. Your help is highly appreciated Florian

POSTED BY: Tester Trying

Hi Tester:

The source data file, which is listed in your reply, was downloaded from kaggle.com (one needs to register to get dataset).

master = Import["C:\\suicide-rates-overview-1985-to-2016\\master.csv"];
mastercountriesraw = DeleteDuplicates[Rest[master[[All, 1]]]];

Now one needs the UN's version of their M49 methodology file. In a webbrowser navigate to this URL page

[https://unstats.un.org/unsd/methodology/m49/] with "Geographic Regions" selected from left menu-like items under the "M49 Standard"

Title: Click on "Search and Download:" "Full view" Leads to [https://unstats.un.org/unsd/methodology/m49/overview/]

There is a CSV export (button) which is much better than trying to gather the data from Import[URL, "Data"]

If one were to explore this branch it may lead to more clean up work than currently necessary for purposes of providing a possible solution path.

data = Import["https://unstats.un.org/unsd/methodology/m49/overview/", "Data"];
Part[data, 2, 2, 1, 1]

Gets the header data

Part[data, 2, 2, 1, 2] 

Gets the rest but not all rows contains same amount of columns so this information is ragged. Can be fix but not worth the detour.

So download the CSV file default name is "UNSD -- Methodology.csv".

UNmethoddata = Import["C:\\suicide-rates-overview-1985-to-2016\\UNSD \[LongDash] \
Methodology.csv"];

Now back to mastercountriesraw. Let's see if we can resolve the distinct countries as Entities in Wolfram Language (WL). If possible, then we will have Entity["Country","CountryName"] such that we could query for the WL CountryData property "UNNumber". The simplest way to let the WL automatically resolve the raw country names from the master file to a WL Entity is as follows:

mastercountriesinter = Map[Interpreter["Country"][#] &, mastercountriesraw];

The above is a bit slower for getting country data, but for this particular file yields no relevant issues. However, the following functions get us there and creates a dataset that one can explore

Clear[master, mastercountriesraw, UNmethoddata, addcountryprop, 
  UNcountryprop, groupUNcountryprop, ATgroupUNcountryprop, dataset];
master = Import[
   "C:\\suicide-rates-overview-1985-to-2016\\master.csv"];
mastercountriesraw = DeleteDuplicates[Rest[master[[All, 1]]]];
UNmethoddata = 
  Import["C:\\suicide-rates-overview-1985-to-2016\\UNSD \
\[LongDash] MethodologyFixed.csv"];
addcountryprop[cnty_] := 
  Module[{countryraw, countryent, continent, unnumber, globalcode, 
    globalname, regioncode, regionname, subregioncode, subregionname, 
    intermediateregioncode, intermediateregionname, countryorarea, 
    M49code, bdata}, countryraw = cnty;
   countryent = Quiet[CountryData[cnty]];
   If[ToString[Head[countryent]] == "CountryData", 
    countryent = Interpreter["Country"][cnty]];
   continent = CanonicalName[CountryData[countryent, "Continent"]];
   unnumber = CountryData[countryent, "UNNumber"];
   {globalcode, globalname, regioncode, regionname, subregioncode, 
     subregionname, intermediateregioncode, intermediateregionname, 
     countryorarea, M49code} = 
    Take[SelectFirst[UNmethoddata, (#[[10]] == unnumber) &], 10];
   bdata = Select[master, (#[[1]] == countryraw) &];
   {globalname, regionname, subregionname, 
    If[intermediateregionname == "", subregionname, 
     intermediateregionname] , continent, countryraw, Rest /@ bdata}];
UNcountryprop = Map[addcountryprop[#] &, mastercountriesraw];
groupUNcountryprop = 
  GatherBy[SortBy[
    UNcountryprop, {#[[1]], #[[2]], #[[3]], #[[4]]} &], {#[[1]], \
#[[2]], #[[3]], #[[4]]} &];
makeassoc[item_] := Module[{},
   {Association[
     Rule[item[[1]][[1]], 
        Association[
            Rule[item[[1]][[2]],
                 Association[
                      Rule[item[[1]][[3]], 
                       Association[
                          Rule[item[[1]][[4]],
                                Association[
                                 Rule[item[[1]][[5]],
                                            Association[
                                        Rule[item[[1]][[6]],

                Flatten[
                 Map[AssociationThread[{"year", "sex", "age", 
                    "suicides", "population", "suicidesper100kpop", 
                    "countryyear", "HDIforyear", "gdpforyear", 
                    "gdppercapita", "generation"}, #] &, 
                  item[[1]][[7]]]]

                                    ]
                                   ]
                                  ]
                                 ]
                               ]
                             ]
                       ]
                        ]
                      ]
                  ]
             ]
          ]
    }];
ATgroupUNcountryprop = Map[makeassoc[#] &, groupUNcountryprop];
dataset = Dataset[ATgroupUNcountryprop]

Now in your initial post you really didn't ask to do any calculations but need to add data to the master file. And in your second post I assume from the image you would like an accordion like UI to just navigate the data. WL Dataset[] function produces an accordion like UI, enter image description here

You can click on say "World"

enter image description here

You can click on the Association cell and enter image description here

Leave it to poster to explore Query[] function. The Fixed csv file fix is to put double quotes around some entries which have commas in them the "China, Hong Kong ..." and the like and "Bonaire,...".

POSTED BY: Hans Michel
Posted 5 years ago

Hi Tester,

That dataset is a CSV file. I think you mean add columns for continent and geographical region, not rows. Here is one way to do that

Download and import the dataset

suicideImport = Import["~/Downloads/master.csv", "Dataset", HeaderLines -> 1];

One country's name does not match WL data so rename it

suicide = 
  suicideImport[All, 
   If[#country == "Saint Vincent and Grenadines", <|#, "country" -> "Saint Vincent and the Grenadines"|>, #] &];

Add continent column

suicide = 
  suicide[All, <|#, "continent" -> CanonicalName@CountryData[#country]["Continent"]|> &];

Country and region data is available here, download Excel and import

unsdImport = 
  Import["~/Downloads/UNSD \[LongDash] Methodology.xlsx", {"Dataset", 1}, HeaderLines -> 1];

Generate map from country name to region

regionMap = 
  unsdImport[
     All, <|#"Country or Area" -> 
        If[#"Intermediate Region Name" == "", #"Sub-region Name", #"Intermediate Region Name"]|> &] //
     Normal // Association;

Add region to suicide dataset

suicide = suicide[All, <|#, "region" -> regionMap[#country]|> &]

There are still some issues, country "Czech Republic" has no mapping to region.

Total number of suicides by continent and year

suicide[GroupBy[{#continent, #year} &] /* Values,
 <|
   "continent" -> Query[Max, #continent &], 
   "year" -> Query[Max, #year &], 
   "total_suicides" -> Query[Total, #"suicides_no" &]
 |>]
POSTED BY: Rohit Namjoshi
Posted 5 years ago

Dear Hans Michel,

I have this dataset with all countries and multiple entries for different years per country. Ideally I would like to add two kind of rows. One with the continents and one with geographical regions like 'Caribbean', 'Central America' or 'North/South Europe' for example.

dataset: https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

regions:

enter image description here

POSTED BY: Tester Trying

Hello Tester...: This should be something to get started pending some clarification of your post. Do you want to add another column or row. And what is to be in that column.

data = Import["https://unstats.un.org/unsd/methodology/m49/", "Data"];
Rest[Part[data,2,1,2,1,2,1][[All,2]]]/. Map[CountryData[#, "UNNumber"] -> # &, CountryData[All]] 

Using the UNCode is problematic as there is an entry in the M49 data that is missing there are ways to fix this but the M49Code is all there. Nevertheless not all number match a Wolfram Language CountryData entity.

POSTED BY: Hans Michel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract