Group Abstract

Message Boards

WOLFRAM COMMUNITY

18.1K Views

2 Replies

9 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks Data Science Recreation Social Science Curated Data Graphics and Visualization Wolfram Language Computational Linguistics Computational Humanities

Analysing wikipedia articles per language & its native speakers population

Jofre Espigule-Pons

Jofre Espigule-Pons, Wolfram Research

Posted 7 years ago

Today (21st of February) is UNESCO International Mother Language Day and I decided to celebrate it by exploring a bit LanguageData function. In particular, I will show how to create the top BubbleChart using two properties of LanguageData: "NativePopulation" and "WikipediaArticleCount". The goal behind using these properties is to explore the "fitness" of languages by measuring the ratio "number of wikipedia articles"/"number native speakers" which will be represented by the size and color of the bubbles. This way I can easily illustrate how well-protected languages have a bigger internet presence (wikipedia articles counts per native speaker). And languages from poor countries like Tigrigna from Ethiopia and Eritrea (Africa) are underrepresented in wikipedia. Interestingly languages from small European countries like Sweden, Netherlands, Scotland, Catalonia, Basque Country, are among the highest in terms of wikipedia activity. For this purpose, I preselected languages that have at least some native speakers alive and some wikipedia articles. Here there is a list of such languages (Disclaimer: some languages fulfilling such conditions might be missing): languages = {"Abkhaz", "Aceh", "Adyghe", "Afar", "Afrikaans", "Akan", "AlbanianTosk", "Amharic", "Arabic", "ArabicEgyptianSpoken", "Aragonese", "Armenian", "Assamese", "Asturian", "Atikamekw", "Avar", "AzerbaijaniSouth", "Bamanankan", "Banjar", "Bashkir", "Basque", "Bavarian", "Belarusan", "Beng", "Bengali", "BicolanoCentral", "Bishnupriya", "Bislama", "Bosnian", "Breton", "Bugis", "Bulgarian", "BuriatChina", "BuriatRussia", "Burmese", "BwamuCwi", "CatalanValencianBalear", "Cebuano", "Chamorro", "Chavacano", "Chechen", "Cherokee", "Cheyenne", "ChineseGan", "ChineseHakka", "ChineseMandarin", "ChineseMinDong", "ChineseMinNan", "ChineseWu", "ChineseYue", "Choctaw", "Chuvash", "Corsican", "CrimeanTurkish", "Croatian", "Czech", "Danish", "Dimli", "Dutch", "Dzongkha", "English", "Erzya", "Ewe", "Extremaduran", "Faroese", "FarsiEastern", "Fijian", "Finnish", "FrancoProvencal", "French", "FrisianEastern", "FrisianNorthern", "Friulian", "Gagauz", "Galician", "Ganda", "Georgian", "German", "GermanPennsylvania", "Gikuyu", "Gilaki", "Greek", "Gujarati", "HaitianCreoleFrench", "Hausa", "Hawaiian", "Hebrew", "Hindi", "HindustaniFijian", "Hungarian", "Icelandic", "Igbo", "Ilocano", "Indonesian", "InuktitutGreenlandic", "IrishGaelic", "Italian", "JamaicanCreoleEnglish", "Japanese", "Javanese", "Kabardian", "Kabiye", "KalmykOirat", "Kannada", "KarachayBalkar", "Karakalpak", "Kashmiri", "Kashubian", "Kazakh", "KhmerCentral", "Kirghiz", "Kolsch", "KomiPermyak", "KonkaniGoanese", "Koongo", "Korean", "KurdishCentral", "Kwanyama", "Ladino", "Lak", "Lao", "Lezgi", "Ligurian", "Limburgisch", "Lingala", "Lithuanian", "Livvi", "Lombard", "LuriNorthern", "Luxembourgeois", "Macedonian", "Maithili", "Malayalam", "Maldivian", "Maltese", "Maori", "Marathi", "MariEastern", "MariWestern", "Marshallese", "Mazanderani", "Minangkabau", "Mingrelian", "MirandaDoDouro", "Moksha", "Muskogee", "NahuatlCentral", "NapoletanoCalabrese", "Narom", "Nauruan", "Navajo", "Ndonga", "Newar", "Nyanja", "OjibwaSevern", "Osetin", "Pampangan", "Pangasinan", "PanjabiEastern", "PanjabiWestern", "Papiamentu", "PashtoCentral", "Piemontese", "PitcairnNorfolk", "Polish", "Pontic", "Portuguese", "Ravula", "Romanian", "RomanianMacedo", "RomaniVlax", "Romansch", "Rundi", "Russian", "Rusyn", "Rwanda", "SaamiNorth", "SaintLucianCreoleFrench", "Samoan", "Sango", "Sanskrit", "Saterfriesisch", "SaxonLow", "Schwyzerdutsch", "Scots", "ScottishGaelic", "Serbian", "Shona", "Sicilian", "Sindhi", "Sinhala", "Slovak", "Slovenian", "Somali", "SorbianLower", "SorbianUpper", "SothoNorthern", "SothoSouthern", "Spanish", "Sranan", "Sunda", "Swahili", "Swati", "Swedish", "Tagalog", "Tahitian", "Tajiki", "Tamil", "Tatar", "Telugu", "Tetun", "Thai", "TibetanCentral", "Tigrigna", "TokPisin", "Tongan", "Tsonga", "Tswana", "Tulu", "Tumbuka", "Turkish", "Turkmen", "Tuvin", "Udmurt", "Ukrainian", "Urdu", "Uyghur", "Venda", "Venetian", "Veps", "Vietnamese", "Vlaams", "Walloon", "WarayWaray", "Welsh", "Wolof", "Xhosa", "Yakut", "YiddishEastern", "YiSichuan", "Yoruba", "Zeeuws", "Zulu"}; Then, using LanguageData it's quite straightforward to get the native speakers population and the number of wikipedia articles. We can also easily compute the aforementioned ratio: bubbles = Map[Callout[{#[[2]], #[[3]], #[[3]]/#[[2]]}, #[[1]]] &, LanguageData[ languages, {"Name", "NativePopulation", "WikipediaArticleCount"}]] Finally we can plot the BubbleChart: BubbleChart[ bubbles, ScalingFunctions -> {"Log", "Log", Automatic}, ColorFunction -> Function[{x, y, z}, Hue[Log[1 + z]]], ColorFunctionScaling -> False, PlotLabel -> Style["Language Wikipedia Articles Per Native Speaker", Bold, 24], FrameLabel -> {Style["Number Of Native Speakers", 20], Style["Number Of Wikipedia Articles", 20]}, PlotTheme -> "Detailed", ImageSize -> 800] (See Top BubbleChart) It's really interesting to see that most of the biggest bubbles tend to be from languages spoken in developed countries but they don't have their own state yet; i.e. Basque, Scots, Catalan, Breton... My mother tongue is Catalan and I'm quite happy to see that it's still quite healthy (at least according to its wikipedia activity). PS: Two years ago @Vitaliy Kaurov wrote a really nice post about the same celebration day. You can read it here. Happy International Mother Language Day!

enter image description here

Today (21st of February) is UNESCO International Mother Language Day and I decided to celebrate it by exploring a bit LanguageData function.

In particular, I will show how to create the top BubbleChart using two properties of LanguageData: "NativePopulation" and "WikipediaArticleCount". The goal behind using these properties is to explore the "fitness" of languages by measuring the ratio "number of wikipedia articles"/"number native speakers" which will be represented by the size and color of the bubbles.

This way I can easily illustrate how well-protected languages have a bigger internet presence (wikipedia articles counts per native speaker). And languages from poor countries like Tigrigna from Ethiopia and Eritrea (Africa) are underrepresented in wikipedia. Interestingly languages from small European countries like Sweden, Netherlands, Scotland, Catalonia, Basque Country, are among the highest in terms of wikipedia activity.

For this purpose, I preselected languages that have at least some native speakers alive and some wikipedia articles. Here there is a list of such languages (Disclaimer: some languages fulfilling such conditions might be missing):

languages = {"Abkhaz", "Aceh", "Adyghe", "Afar", "Afrikaans", "Akan", 
  "AlbanianTosk", "Amharic", "Arabic", "ArabicEgyptianSpoken", 
  "Aragonese", "Armenian", "Assamese", "Asturian", "Atikamekw", 
  "Avar", "AzerbaijaniSouth", "Bamanankan", "Banjar", "Bashkir", 
  "Basque", "Bavarian", "Belarusan", "Beng", "Bengali", 
  "BicolanoCentral", "Bishnupriya", "Bislama", "Bosnian", "Breton", 
  "Bugis", "Bulgarian", "BuriatChina", "BuriatRussia", "Burmese", 
  "BwamuCwi", "CatalanValencianBalear", "Cebuano", "Chamorro", 
  "Chavacano", "Chechen", "Cherokee", "Cheyenne", "ChineseGan", 
  "ChineseHakka", "ChineseMandarin", "ChineseMinDong", 
  "ChineseMinNan", "ChineseWu", "ChineseYue", "Choctaw", "Chuvash", 
  "Corsican", "CrimeanTurkish", "Croatian", "Czech", "Danish", 
  "Dimli", "Dutch", "Dzongkha", "English", "Erzya", "Ewe", 
  "Extremaduran", "Faroese", "FarsiEastern", "Fijian", "Finnish", 
  "FrancoProvencal", "French", "FrisianEastern", "FrisianNorthern", 
  "Friulian", "Gagauz", "Galician", "Ganda", "Georgian", "German", 
  "GermanPennsylvania", "Gikuyu", "Gilaki", "Greek", "Gujarati", 
  "HaitianCreoleFrench", "Hausa", "Hawaiian", "Hebrew", "Hindi", 
  "HindustaniFijian", "Hungarian", "Icelandic", "Igbo", "Ilocano", 
  "Indonesian", "InuktitutGreenlandic", "IrishGaelic", "Italian", 
  "JamaicanCreoleEnglish", "Japanese", "Javanese", "Kabardian", 
  "Kabiye", "KalmykOirat", "Kannada", "KarachayBalkar", "Karakalpak", 
  "Kashmiri", "Kashubian", "Kazakh", "KhmerCentral", "Kirghiz", 
  "Kolsch", "KomiPermyak", "KonkaniGoanese", "Koongo", "Korean", 
  "KurdishCentral", "Kwanyama", "Ladino", "Lak", "Lao", "Lezgi", 
  "Ligurian", "Limburgisch", "Lingala", "Lithuanian", "Livvi", 
  "Lombard", "LuriNorthern", "Luxembourgeois", "Macedonian", 
  "Maithili", "Malayalam", "Maldivian", "Maltese", "Maori", "Marathi",
   "MariEastern", "MariWestern", "Marshallese", "Mazanderani", 
  "Minangkabau", "Mingrelian", "MirandaDoDouro", "Moksha", "Muskogee",
   "NahuatlCentral", "NapoletanoCalabrese", "Narom", "Nauruan", 
  "Navajo", "Ndonga", "Newar", "Nyanja", "OjibwaSevern", "Osetin", 
  "Pampangan", "Pangasinan", "PanjabiEastern", "PanjabiWestern", 
  "Papiamentu", "PashtoCentral", "Piemontese", "PitcairnNorfolk", 
  "Polish", "Pontic", "Portuguese", "Ravula", "Romanian", 
  "RomanianMacedo", "RomaniVlax", "Romansch", "Rundi", "Russian", 
  "Rusyn", "Rwanda", "SaamiNorth", "SaintLucianCreoleFrench", 
  "Samoan", "Sango", "Sanskrit", "Saterfriesisch", "SaxonLow", 
  "Schwyzerdutsch", "Scots", "ScottishGaelic", "Serbian", "Shona", 
  "Sicilian", "Sindhi", "Sinhala", "Slovak", "Slovenian", "Somali", 
  "SorbianLower", "SorbianUpper", "SothoNorthern", "SothoSouthern", 
  "Spanish", "Sranan", "Sunda", "Swahili", "Swati", "Swedish", 
  "Tagalog", "Tahitian", "Tajiki", "Tamil", "Tatar", "Telugu", 
  "Tetun", "Thai", "TibetanCentral", "Tigrigna", "TokPisin", "Tongan",
   "Tsonga", "Tswana", "Tulu", "Tumbuka", "Turkish", "Turkmen", 
  "Tuvin", "Udmurt", "Ukrainian", "Urdu", "Uyghur", "Venda", 
  "Venetian", "Veps", "Vietnamese", "Vlaams", "Walloon", "WarayWaray",
   "Welsh", "Wolof", "Xhosa", "Yakut", "YiddishEastern", "YiSichuan", 
  "Yoruba", "Zeeuws", "Zulu"};

Then, using LanguageData it's quite straightforward to get the native speakers population and the number of wikipedia articles. We can also easily compute the aforementioned ratio:

bubbles = 
 Map[Callout[{#[[2]], #[[3]], #[[3]]/#[[2]]}, #[[1]]] &, 
  LanguageData[
   languages, {"Name", "NativePopulation", "WikipediaArticleCount"}]]

Finally we can plot the BubbleChart:

BubbleChart[ bubbles, 
             ScalingFunctions -> {"Log", "Log", Automatic},  
             ColorFunction -> Function[{x, y, z}, Hue[Log[1 + z]]], 
             ColorFunctionScaling -> False, 
             PlotLabel -> Style["Language Wikipedia Articles Per Native Speaker", Bold, 24], 
             FrameLabel -> {Style["Number Of Native Speakers", 20], Style["Number Of Wikipedia Articles", 20]},                     
             PlotTheme -> "Detailed", 
             ImageSize -> 800]

(See Top BubbleChart)

It's really interesting to see that most of the biggest bubbles tend to be from languages spoken in developed countries but they don't have their own state yet; i.e. Basque, Scots, Catalan, Breton...

My mother tongue is Catalan and I'm quite happy to see that it's still quite healthy (at least according to its wikipedia activity).

PS: Two years ago @Vitaliy Kaurov wrote a really nice post about the same celebration day. You can read it here.

Happy International Mother Language Day!

POSTED BY: Jofre Espigule-Pons

2 Replies

Sort By:

Jofre Espigule-Pons

Jofre Espigule-Pons, Wolfram Research

Posted 7 years ago

A nice key finding pointed out by a redditor here is the fact that Swedish, Cebuano and Waray Wikipedias have thousands of automated Wikipedia articles generated by Lsjbot, an article-creating program, developed by Sverker Johansson for the Swedish Wikipedia. The bot primarily focused and focuses on articles about living organisms and geographical entities (such as rivers, dams and mountains). And between 80 % and 99 % of the total articles on those languages were automatically generated. So it would be interesting to analyse the number of automatic articles per language and compare them with the human written ones. Here there is a list of bots creating wikipedia articles in different languages that might serve as a starting point. If you are interested in creating similar bubble charts check my recent post on the "Growth of the Internet Users": https://community.wolfram.com/groups/-/m/t/1631212

POSTED BY: Jofre Espigule-Pons

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 7 years ago

- Congratulations! This post is now a Staff Pick as distinguished by a badge on your profile! Thank you, keep it coming, and consider contributing your work to the The Notebook Archive!

POSTED BY: EDITORIAL BOARD

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback