The medicinal value of drugs is optimized via Quantitative Structure-Activity Relationships, or QSAR. QSAR seeks to explain the variation in the biological activity of related molecules via variations in their structure and physicochemical properties. Synthesizing and testing new compounds is time-consuming and expensive, so the faster we can reach a mathematical model with predictive power, the better. Recently there have been rich discussions and analyses of which types of chemical descriptors best predict biochemical activity. As a chemistry PhD student, I have become interested in this topic and have used and built tools in the Wolfram Language to facilitate my analysis of multivariate drug data.
Data Normalization
It is often practical to store data via tables in Excel and import them into Mathematica. Then Mathematica automatically recognizes such tables as matrices (lists of lists, where the lists are row vectors). The following functions aid processing:
To delete rows of zeros and nulls:
deleteZeroRows[m_]:=DeleteCases[m,{0..}|{Null..},1]
To delete columns of zeros and nulls:
deleteZeroColumns[m_]:=Block[{Global`s},
Global`s=DeleteCases[Transpose[m],{0..}|{Null..},1];
Transpose[Global`s]]
To normalize each element of a row vector to the element at a specific position:
ratioNormalize[list_,pos_:43]:=list/list[[pos]]
To normalize each element of a row vector by taking the natural log of that element plus a constant (added to make everything positive):
logNormalize[n_,c_]:=Log[n+c]
That last function proved useful for a data set with too large a spread of values to use the sigmoid function; I was able to make sure the results from regression modeling were similar before and after normalization.
Testing all variable sets of a given size to find out which set best describes the data
Inspired by Aouidate, et al's analysis of traditional physicochemical vs Density Functional Theory-based descriptors for the QSAR modeling of phenylalkylamines, I generated functions to compare the linear and nonlinear regression models with all combinations of five variables by R^2 value.
Linear, where reformatted12 was the name of the 12-variable data set (the last column vector was the response vector):
whichParameters = Subsets[Range[Length@First@reformatted12 - 1], {5}];
whichParameters2 = Append[#, -1] & /@ whichParameters
regressThese = Part[reformatted12, All, parameters]
allRSquared = MapIndexed[
Function[
{parameters, index},
regressThese = Part[reformatted12, All, parameters];
First@index ->
LinearModelFit[regressThese, Table[x[i], {i, 1, 5}],
Table[x[i], {i, 1, 5}]]["RSquared"]
],
whichParameters2
];
It is then possible to call for the best five models by R^2 value. Here's the non-linear version:
allRSquaredNonlinear = MapIndexed[
Function[
{parameters, index},
regressThese = Part[reformatted12, All, parameters];
First@index ->
NonlinearModelFit[regressThese,
a + b x1 + c x2 + d x3 + e x4 + f x5 + l x1^2 + m x2^2 +
n x3^2 + o x4^2 + p x5^2, {a, b, c, d, e, f, l, m, n, o, p},
{x1, x2, x3, x4, x5}]["RSquared"]
],
whichParameters2
];
I was able to confirm that the best models indeed made heavy use of DFT descriptors, and that trend was unchanged when I analyzed rigorously balanced data sets with more traditional physicochemical/pharmacological descriptors like Total Polar Surface Area and "druglikeness" score.
Finally, when I split the data randomly into 35 training compounds and 10 test compounds (omitting mescaline as it had been used as a reference), and fed the data into the machine learning function Predict[], the resulting model had a slightly lower Mean Squared Error than Aouidate, et al's Multiple Non-Linear Regression model. This hints at the possible utility of machine learning in QSAR analysis down the road.
In the near future I plan to generalize the aforementioned regression functions, and post about ways to analyze 3-dimensional structural data.
Aouidate, et al. "Combining DFT and QSAR studies for predicting psychotomimetic of substituted phenethylamines using statistical methods." Journal of Taibah University for Science 10 (2016): 787-796.