Message Boards Message Boards

[WSSA16] Prediction of the Global Distribution of Tuberculosis

enter image description here

Today science and technology develop with big steps and they simultaneously promote the field of medicine. Nevertheless, a lot of people are still suffering from various infectious diseases. One of the most distributed infectious diseases worldwide is tuberculosis. The prevalence of tuberculosis varies across different countries by ethnicity and socioeconomic status. Statistics shows that there are various factors contributing to the distribution of tuberculosis, for instance drug and tobacco abuse, immigration, and HIV/AIDS cases. The aim of this project is to use machine learning algorithms in order to design a basic prediction model for understanding the underlying mechanisms of the epidemiology of tuberculosis.

The main phases of my project have been:

  • Importing the dataset about the distribution of tuberculosis, which has four different indicators: number of deaths due to tuberculosis, excluding HIV; number of prevalent tuberculosis cases; deaths due to tuberculosis among HIV-negative people (per 100 000 population) and the prevalence of tuberculosis (per 100 000 population).
  • Importing the dataset about the health expenditure to GDP ratio for all the countries.
  • Importing the relevant information regarding the properties of the countries worldwide from the Wolfram Alpha database. Deleting the missing data. Converting the information into the Wolfram Language input.
  • Putting the information about the properties of the countries and the tuberculosis data together. Creating a generalized database.
  • Selecting the properties that can be significant for predicting the distribution of tuberculosis.
  • Using machine learning functions for training the program to predict the prevalence of tuberculosis.

Acquiring the data

After importing the datasets by using the function SemanticImport I did some data manipulation in order to convert the datasets into the Wolfram Language input. For that I used various manipulation functions related to lists, strings, and associations, for example Thread, Select, DeleteDuplicates, ReplaceAll, KeyTake, KeyValueMap etc.

url = "http://apps.who.int/gho/athena/data/xmart.csv?target=GHO/MDG_\
0000000017,TB_e_mort_exc_tbhiv_num,MDG_0000000023,TB_e_prev_num&\
profile=crosstable&filter=COUNTRY:*;YEAR:2014;YEAR:2013;YEAR:2012;\
YEAR:2011;YEAR:2010;YEAR:2009;YEAR:2008;YEAR:2007;REGION:*&x-sideaxis=\
COUNTRY;YEAR&x-topaxis=GHO";
dataraw = SemanticImport[url];
dataraw = Module[
   {countryNames, wrongNames, fixNames},
   countryNames = DeleteDuplicates[Normal[dataraw][[All, "Country"]]];
   wrongNames = Select[countryNames, MissingQ];
   fixNames = 
    Thread[wrongNames -> {Entity["Country", "Bolivia"], 
       Entity["Country", "Micronesia"], 
       Entity["Country", "Venezuela"]}];
   dataraw[All, ReplaceAll[#, fixNames] &]
   ];

enter image description here

The original datasets were not so practical to use thus I had to make a lot of modifications before and after importing them in order to standardize them for convenient future use. Then I collected some useful information concerning the various properties of the countries in the world that I found significant in regard to the prevalence of tuberculosis. I used the functions EntityProperty, EntityValue, EntityList etc. for importing that information from the Wolfram Alpha database. There was a lot of missing data for many countries and I had to fill it in through interpolation.

Analysis

I used different visualization functions, for instance ListPlot or ListPointPlot3D, for getting a hint of the correlation between the prevalence of tuberculosis and some other properties of the countries, for example its GDP, annual health spending, etc.

enter image description here

enter image description here

The last part of my work is dedicated to designing a basic prediction model. I tried to run the model with different sets of properties. I used machine learning functions, mainly the functions Predict and PredictorMeasurements for training the program to make a prediction regarding the prevalence of tuberculosis. I assigned the final data to be the numerical values of the selected properties. 80% of the final data I used as a training set and the next 20% as a validation set. I evaluated the length of the final data for every set of properties and made a comparison plot in order to find out the quality of the performance of the prediction model. The best result was achieved by using the data about the employment to population ratio and life expectancy.

n = Round[.8 Length[finaldata]];
{train, test} = 
TakeDrop[RandomSample[Map[Rule[Most[#], Last[#]] &]@finaldata], n];
pred = Predict[train, Method -> "RandomForest"]
pm = PredictorMeasurements[pred, test];
pm["ComparisonPlot"]

enter image description here

This prediction model is of course approximate because the distribution of an infectious disease cannot depend only on a few factors. However, I tried to find the most rational set of available parameters for designing a reasonable prediction model. I would be glad to hear your comments in regard to the improvement of this model.

Attachments:
POSTED BY: Seda Mirzoyan
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract