One of the regular tasks in statistical arbitrage is to compute correlations between a large universe of stocks, such as the S&P500 index members, for example.
Mathematica/WL has some very nice features for obtaining financial data and manipulating time series. And of course it offers all the commonly required statistical functions, including correlation. But the WL Correlation function is missing one vital feature - the ability to handle data series of unequal length. This arises, of course, because stock data series do not all share a common start date and (very occasionally) omit data for dates in the middle of the series. This creates an issue for the Correlation function, which can only handle series of equal length.
The usual way of handling this is to apply pairwise correlation, in which each pair of data vectors is truncated to include only the dates common to both series. Of course this can easily be done in WL; but it is very inefficient.
Let's take an example. We start with the last 10 symbols in the S&P 500 index membership:
In[1]:= tickers = Take[FinancialData["^GSPC", "Members"], -10]
Out[1]= {"NASDAQ:WYNN", "NASDAQ:XEL", "NYSE:XRX", "NASDAQ:XLNX", \
"NYSE:XYL", "NYSE:YUM", "NASDAQ:ZBRA", "NYSE:ZBH", "NASDAQ:ZION", \
"NYSE:ZTS"}
Next we obtain the returns series for these stocks, over the last several years. By default, FinancialData retrieves the data as TimeSeries Objects. This is very elegant, but slows the processing of the data, as we shall see.
tsStocks =
FinancialData[tickers, "Return",
DatePlus[Today, {-2753, "BusinessDay"}]];
Not all the series contain the same number of date-return pairs. So using Correlation is out of the question:
In[282]:= Table[Length@tsStocks[[i]]["Values"], {i, 10}]
Out[282]= {2762, 2762, 2762, 2762, 2388, 2762, 2762, 2762, 2762, 2060}
Since Correlation doesn't offer a pairwise option, we have to create the required functionality in WL. Let's start with:
PairsCorrelation[ts_] := Module[{td, correl},
If[ts[[1]]["PathLength"] == ts[[2]]["PathLength"],
correl = Correlation @@ ts,
td = TimeSeriesResample[ts, "Intersection"];
correl = Correlation @@ td[[All, All, 2]]]];
We first check to see if the two arguments are of equal length, in which case we can Apply the Correlation function directly. If not, we use the "Intersection" option of the TSResample function to reduce the series to a set of common observation dates. The function is designed to be deployed using parallelization, as follows:
PairsListCorrelation[tslist_] := Module[{pairs, i, td, c, correl = {}},
pairs = Subsets[Range[Length@tslist], {2}];
correl =
ParallelTable[
PairsCorrelation[tslist[[pairs[[i]]]]], {i, 1, Length@pairs}];
{correl, pairs}]
The Subsets function is used to generate a non-duplicative list of index pairs and then a correlation table is built in parallel using PairsCorrelation function on each pair of series.
When we apply the function to the ten stock time series, we get the following results:
In[263]:= AbsoluteTiming[{correl, pairs} =
PairsListCorrelation[tsStocks];]
Out[263]= {13.4791, Null}
In[270]:= Length@correl
Out[270]= 45
In[284]:= Through[{Mean, Median, Min, Max}[correl]]
Out[284]= {0.381958, 0.396429, 0.200828, 0.536383}
So far, so good. But look again at the timing of the PairsListCorrelation function. It takes 13.5 seconds to calculate the 45 correlation coefficients for 10 series. To carry out an equivalent exercise for the entire S&P 500 universe would entail computing 124,750 coefficients, taking approximately 10.5 hours! This is far too slow to be practically useful in the given context.
Some speed improvement is achievable by retrieving the stock returns data in legacy (i.e. list rather than time series) format, but it still takes around 10 seconds to calculate the coefficients for our 10 stocks. Perhaps further speed improvements are possible through other means (e.g. compilation), but what is really required is a core language function to handle series of unequal length (or a Pairwise method for the Correlation function).
For comparison, I can produce the correlation coefficients for all 500 S&P member stocks in under 3 seconds using the 'Rows', 'pairwise' options of the equivalent correlation function in another scientific computing language.
UPDATE
Another Mathematica user suggested a way to speed up the pairwise correlation algorithm using associations.
We begin by downloading returns data for the S&P500 membership in legacy (i.e. list) format:
tickers = Take[FinancialData["^GSPC", "Members"]];
stockdata =
FinancialData[tickers, "Return",
DatePlus[Today, {-753, "BusinessDay"}], Method -> "Legacy"];
Then define:
PairwiseCorrelation[stockdata_] :=
Module[{assocStocks, pairs, correl},
assocStocks = Apply[Rule, stockdata, {2}] // Map[Association];
pairs = Subsets[Range@Length@assocStocks, {2}];
correl =
Map[Correlation @@ Values@KeyIntersection[assocStocks[[#]]] &,
pairs];
{correl, pairs}]
Here we are using the KeyIntersection function to identify common dates between two series, which is much faster than other methods. Accordingly:
In[317]:= AbsoluteTiming[{correl, pairs} =
PairwiseCorrelation[stockdata];]
Out[317]= {112.836, Null}
In[318]:= Length@correl
Out[318]= 127260
In[319]:= Through[{Mean, Median, Min, Max}[correl]]
Out[319]= {0.428747, 0.43533, -0.167036, 0.996379}
This is many times faster than the original algorithm and, although much slower (40x to 50x) than equivalent algorithms in other languages, gets the job done in reasonable time.
So I still think we need a Method-> "Pairwise" option for the Correlation function.