Hello everyone,
A colleague here at Wolfram who happened to see another post of mine commented yesterday, why don't you post something on Indian Republic Day, which is on 26th of January (which has already started in India). Today, at lunch, when the comment came back to my mind, I thought, I might use a dataset from Kaggle relevant to India (and do something based on Machine Learning/Neural Network). However, after a talk I heard today about our Entity framework, I took it as a challenge to explore it thoroughly. Although I have worked with entities to make time series (mostly from WeatherData and FinancialData and GDP/Population) to do such analysis, I did not quite do an extensive search on what is available. Today, I just decided to explore just the various options we have for Entity["Country", "India"]. I would say, I was very surprised to see the amount of data we have for India, which is just one of the many countries in CountryData. There is more to AdministrativeDivisionData, CityData which I have not even explored in this post. (A bit of disclaimer here, this is not a well-thought project, so the data/analysis presented may not be totally following a strict workflow.)
Visualization
Let me first start by doing some visualizations. As a kid during Independence Day and Republic Day, we used to get pins, paper cuttings which had the flag in the shape of the Country, That was the first thing that came to my mind when I decided to write this post. It turns out it is pretty simple to do it in Wolfram Language. In fact, it is just a single line of code:
With[{india = Entity["Country", "India"]},
GeoGraphics[{GeoStyling[{"Image",
ImageCrop[
EntityValue[Entity["Country", "India"], "Flag"], {98, 85},
Left]}], EdgeForm[Black], Polygon[india]},
GeoBackground -> None]]
Since the map of India is not very symmetric, it is more stretched out in the east, I decided to crop the Indian flag on the left-hand side. Well, the GeoGraphic visualization I obtained below, sort of gave me a childish happiness and all the enthusiasm to continue with the more important or serious analysis.
![enter image description here][1]
So, the first step was to see what is available in the Entity framework for India. A simple call
EntityValue[
Entity["Country", "India"], "NonMissingPropertyAssociation"]
gave me a list of over 450 properties. Now, I was stunned to see this, and there was no way I could analyze everything in a few hours. So, I decided to prioritize what can I do in this limited time and it also conveys a story. So let us start by States/AdministrativeDivisions.
Once that idea came in mind, all I could think of was the atlas maps in middle and high school geography, that assigned different colors to different states. So let's do that first:
`admin = Entity["Country", "India"][
EntityProperty["Country", "AdministrativeDivisions"]]
(color[#] = RandomColor[]) & /@ admin
GeoGraphics[
{GeoStyling["OutlineMap",
Directive[Opacity[0.4], EdgeForm[Black], color[#]]],
Tooltip[EntityValue[#, "Polygon"], CommonName[#]]} & /@ admin,
GeoRangePadding -> Full]
Again, this was created directly by following an example in GeoGraphics documentation. I was so tempted to put a RandomSeed before the code, but hesitated thinking how much joy it would have given me if I had seen my atlas book from middle school geography lessons changing colors everytime I flipped the page.
![enter image description here][2]
Population By State
The grown up in me, suddenly realised it's time for some serious analysis! One of the major serious issues in India is population. So let us start first by analysing the population by state:
dat = EntityValue[Reverse@admin, {"Name", "Population", "Polygon"}];
rng = Through[{Min, Max}@QuantityMagnitude[dat[[All, 2]]]];
Labeled[GeoGraphics[{GeoStyling[None], EdgeForm@GrayLevel[0, 0.5],
Tooltip[{ColorData["AvocadoColors"]@
Rescale[QuantityMagnitude[#2], rng], #3},
Column[{Style[#1, Bold], #2}]] & @@@ dat}],
BarLegend[{"AvocadoColors", rng}, 8], Right]
I had seen these images (quite a lot) while growing up. What I realized now, was how easy was it to create such graphics in Wolfram Language. The image clearly shows the population in the states with Uttar Pradesh has the maximum population.
![enter image description here][3]
However, most people really care about the population density, which was also available for the AdministrativeDivisions in India. I thought doing the population density would be as simple as the population, as a result, I just replaced Population with PopulationDensity. However, the results I got was counter-intuitive. I was seeing the scale of the Legend going to higher values, however, couldn't see the colors on the graphics/image. So I really decided to take a deeper look at the data, and not just do visualization.
popdensity = EntityValue[admin, "PopulationDensity"]
Select[QuantityMagnitude[popdensity], # > 5000 &]
Position[QuantityMagnitude[popdensity], n_ /; n > 5000]
This quickly revealed that there were 5 AdministrativeDivisions that had a population density greater than the threshold I chose and was throwing the BarLegend off. (I deliberately used Select and Position here, just to highlight that in Wolfram Language there are a multiple ways in which you can do things that achieve the same/similar goal). Just by looking at the positions, I realized they were the union territories and mostly one city (where density is high). So it is better to leave them out of the analysis here because our aim was to do it for the states only. So now when I left the union territories out, I get the following geographics:
![enter image description here][4]
Population by Age and Gender
The next analysis with population data that came to my mind was the age and gender analysis of the population.
A very simple way to crunch the numbers would be to create a grid like Wolfram Alpha. So again, let's just use the capabilities of WolframAlpha function here:
WolframAlpha["India age distribution", \
{{"AgeDistributionGrid:AgeDistributionData", 1}, "Content"}]
![enter image description here][5]
Note the 2010 estimates written in the grid. Let us try and see if we can get more recent that. I think this could be obtained from the following Entity call:
EntityValue[
Entity["Country", "India"], #] & /@ {"FemaleChildPopulation",
"FemaleAdultPopulation", "FemaleElderlyPopulation",
"MaleChildPopulation", "MaleAdultPopulation",
"MaleElderlyPopulation"}
So I decided to go ahead and do a visualization. I had seen a brilliant post by @Vitaliy Kaurov where he had used similar styles. So instead of re-inventing the wheel, I decided to take the styling options and modify according to the data here
PieChart[EntityValue[Entity["Country", "India"], #],
PlotTheme -> "Marketing", ChartLabels -> Placed[#, "RadialCallout"],
PlotRange -> {{-1.8, 1.8}, {-1.0, 1.0}}, BaseStyle -> {12, Bold},
ImageSize -> 800, ChartStyle -> "SolarColors",
PlotLabel ->
Style["Demographics", 34, Darker[Red],
FontFamily -> "Phosphate"]] &@
{"FemaleChildPopulation", "FemaleAdultPopulation",
"FemaleElderlyPopulation", "MaleChildPopulation",
"MaleAdultPopulation", "MaleElderlyPopulation"}
![enter image description here][6]
When will India be the most populous country?
Often times you see this analysis in the web rephrased in different ways. More recently, I have seen an increase in the number of such posts. This is because the population of India has increased steadily with quite a steep slope, while that of China has not increased as much. To do such analysis, we will use the Statistical Tools in Wolfram Language (mostly regression analysis), First let us get the data for India, visualize it:
popdat = Normal@
EntityValue[Entity["Country", "India"],
EntityProperty["Country",
"Population", {"Date" ->
Interval[{DateObject[{1967}, "Year", "Gregorian", -5.],
DateObject[{2017}, "Year", "Gregorian", -5.]}]}]]
Pop = Table[{1966 + i, QuantityMagnitude@popdat[[i, 2]]}, {i, 1,
Length[popdat]}]
ListPlot[Pop, Joined -> False, PlotStyle -> Red]
The following Plot shows that it is quite a linear trend and LinearModelFit would be a good option in this case.
popfit = LinearModelFit[Pop, t, t]
Show[ListPlot[Pop, Joined -> False, PlotStyle -> Red],
Plot[popfit[t], {t, 1967, 2017}]]
Doing some analysis with the ANOVA table
{popfit["ANOVATable"], popfit["AdjustedRSquared"], popfit["AICc"]}
It revealed that 99.93 of the variance had been taken care of, and p value was very small and the AICc was 1752.84.
Now moving on to China's population data. A similar call, just changing the Entity call to include China as the Country Name. Now the data, did not show a linear trend, but had a more quadratic trend. So instead of using just a single linear term in the basis for the LinearModelFit, the basis had quadratic terms.
LinearModelFit[Popchina, t^Range[0, 2], t]
Plotting the fits and the data simultaneoulsy, we do see the curves crossing very close to 2021. ![enter image description here][7]
Many blogs/posts say that such a cross can take place any where between 2020 and 2024. A simple analysis with LinearModelFit in Wolfram Language using Entity data, predicts the same. Doing a NSolve to solve where the two models meet, further confirmed that t->2020.54, or 2021.
Languages
Finally let me end this discussion, by saying India is a land of varied cultures and languages. The Languages spoken by fraction was taken, and I just took the first 10 of those
association2 = Take[assoc, 10]
BarChart[Values[association2]*
EntityValue[Entity["Country", "India"], "Population"],
ChartElementFunction -> "GlassRectangle", BaseStyle -> {18, Bold},
ChartStyle -> "Pastel",
PlotLabel ->
Style["Approximate Native Speakers Per Language", 30, Darker[Red],
FontFamily -> "Phosphate"],
ChartLabels ->
Placed[Keys[association2], Below, Rotate[#, Pi/2.4] &]]
Multiplying that by the population of India, we can create the number of native speakers. If you go through the list, you will find at least two out of the 10 (4th and 7th in the list in Babble) most spoken languages, and the majority of the native speakers is already accounted for in the following visualization:
![enter image description here][8]
Conclusion
I have not yet analysed 20% of the data I could get from Entity class in India (leave aside CityData, Weather and FinancialData and all other Entity/data classes I am not aware of.). So there is enough material to cover for the 15th of August (Independence Day in India).