Message Boards Message Boards

[WSS19] Analyzing and visualizing data about the world's oldest people

Data about the oldest humans that have lived allow a small data set to reveal larger patterns about human life expectancy and general quality of life. The data used in this brief analysis can be found at http://archive.is/4kwbk.The first step in my data analysis was to extract data and reformat it. First, I import the data and define the headers of the dataset.

orig = Import["http://archive.is/4kwbk", "Data"]
heads = {"#", "Birthplace", "Name", "Born", "Died", "Age", "Days", 
  "Race", "Sex", "Deathplace"}

Next, I convert the data to the Dataset form.

dataSimplified = 
 Dataset[Association[Thread[heads -> #[[;; 10]]]] & /@ 
   Take[orig[[2, 1, 1]], {8, 72}]]

The data is relatively disorganized and complex, so many substitutions were necessary so that the Interpreter function was able to understand the data.

dataCorrected = 
 dataSimplified[
  All, {"Birthplace" -> 
    StringReplace[{"U.S." ~~ ___ -> "USA", ___ ~~ "(UK)" -> "UK", 
      "Czechoslovakia" -> "Czech Republic", 
      "Germany (now Poland)" ~~ ___ -> "Poland", 
      "Canada (Que)" -> "Canada", 
      "Cape Verde (Portugal) [8]" -> "Portugal", ___ ~~ "Jamaica)" -> 
       "Jamaica"}], "Born" -> StringReplace[{"[1]" -> ""}]}]

The last line of the data is offset as Kane Tanaka is the current oldest person alive.

In[ ] = Normal[dataCorrected1[65, All]]
Out[ ] = <|"#" -> 65, "Birthplace" -> "Japan", "Name" -> "Kane Tanaka", 
 "Born" -> "Jan. 2, 1903", "Died" -> "115*", "Age" -> "214*", 
 "Days" -> "EA", "Race" -> "F", "Sex" -> "Japan (Fukuoka)", 
 "Deathplace" -> "2018-"|>

We'll have to manually edit this data.

row65 = Dataset[<|"#" -> 65, "Birthplace" -> "Japan", 
   "Name" -> "Kane Tanaka", "Born" -> "Jan. 2, 1903", "Died" -> "n/a",
    "Age" -> "n/a", "Days" -> "n/a", "Race" -> "EA", "Sex" -> "F"|>]
dataCorrected2 = Append[Drop[dataCorrected1, -1], row65]

Using the Interpreter function, I was able to convert birthplaces to a format that is easily computable.

dataInterpreted = 
 dataCorrected2[
  All, {"Birthplace" -> Interpreter["Country"], 
   "Born" -> Interpreter["Date"], "Died" -> Interpreter["Date"]}]

I then used the GeoHistogram command to create the following figure.

GeoHistogram[
 Normal[dataInterpreted[All, "Birthplace"]], {"Rectangle", 15}, 
 PlotTheme -> "Scientific"]

Geo histogram

The figure displays the density of the worlds oldest people in different regions. Analysis: This figure supports the hypothesis of the Global South. This controversial hypothesis that is often said to overgeneralize suggests that individuals living South of the equator are more likely to have a lower quality of life than those living North of the equator. This figure supports the hypothesis as better quality of life is associated with longer life expectancy, and the figure suggests that those who reign as the oldest humans are more likely to be found North of the equator.

We can also observe similar disparities across races. The following figure was generated using the PieChart function.

racecounts = Counts[Normal[dataInterpreted[All, "Race"]]]
PieChart[Values[racecounts], 
 ChartLabels -> {"White", "Japanese", "Black", "Hispanic", 
   "East Asian", "Multiracial"}]

enter image description here This figure displays the proportion of those who have reigned as the oldest human on earth by race. Note Japanese is separated as a different race in accordance with the original data. Analysis: This figure displays that a large majority of those who have reigned as oldest person are white. This data is reflective of general, global racial inequity.

Finally, we can look at some statistics of this elite group. The distribution of their ages can be generated using the Histogram function.

lifeLengthData = 
 Drop[Quantity[Normal[dataInterpreted[All, "Age"]], "Years"] + 
   Quantity[Normal[dataInterpreted[All, "Days"]], "Days"], -1]
Histogram[lifeLengthData, {600}, Frame -> True, 
 FrameLabel -> {"Age (days)", "Frequency"}]

enter image description here Note the last value in the is dropped as Kane Tanaka is still living. Analysis: What is particularly interesting about this data is the relative symmetry it exhibits. Using the Skewness function, I found that the skewness of this distribution is only .174, relatively low.

N@Skewness[lifeLengthData]

This is unexpected as age distributions typically have strong positive skew. If we interpret this data as a sample from a theoretical distribution, we can conduct inference tests. The inference test of interest would be the Mardia skewness test for normality. This test allows us to assess whether the skewness observed is significantly different than the skewness measurements produced by random sampling of the normal distribution. The Wolfram Language is able to run this test with the MardiaSkewnessTest. Testing the data returns a p-value of .569, above almost all significance thresholds.

MardiaSkewnessTest[N@lifeLengthData]

Running the function DistributionFitTest on the data returns the following table.

DistributionFitTest[
 N@lifeLengthData, Automatic, {"TestDataTable", All}]

enter image description here

For all tests appropriate for the data, none flagged the data as significantly different than variates of the normal distribution. Again, this is unexpected because of the typical strong positive skew of age distributions.

POSTED BY: Emmy Blumenthal

enter image description here - Congratulations! This post is now featured in our Staff Pick column as distinguished by a badge on your profile of a Featured Contributor! Thank you, keep it coming, and consider contributing your work to the The Notebook Archive!

POSTED BY: Moderation Team
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract