# Pairwise Correlation of Financial Data

Posted 4 months ago
1693 Views
|
11 Replies
|
25 Total Likes
|

One of the regular tasks in statistical arbitrage is to compute correlations between a large universe of stocks, such as the S&P500 index members, for example. Mathematica/WL has some very nice features for obtaining financial data and manipulating time series. And of course it offers all the commonly required statistical functions, including correlation. But the WL Correlation function is missing one vital feature - the ability to handle data series of unequal length. This arises, of course, because stock data series do not all share a common start date and (very occasionally) omit data for dates in the middle of the series. This creates an issue for the Correlation function, which can only handle series of equal length.
The usual way of handling this is to apply pairwise correlation, in which each pair of data vectors is truncated to include only the dates common to both series. Of course this can easily be done in WL; but it is very inefficient.

Let's take an example. We start with the last 10 symbols in the S&P 500 index membership:

In[1]:= tickers = Take[FinancialData["^GSPC", "Members"], -10]

Out[1]= {"NASDAQ:WYNN", "NASDAQ:XEL", "NYSE:XRX", "NASDAQ:XLNX", \
"NYSE:XYL", "NYSE:YUM", "NASDAQ:ZBRA", "NYSE:ZBH", "NASDAQ:ZION", \
"NYSE:ZTS"}


Next we obtain the returns series for these stocks, over the last several years. By default, FinancialData retrieves the data as TimeSeries Objects. This is very elegant, but slows the processing of the data, as we shall see.

tsStocks =
FinancialData[tickers, "Return",


Not all the series contain the same number of date-return pairs. So using Correlation is out of the question:

In[282]:= Table[Length@tsStocks[[i]]["Values"], {i, 10}]

Out[282]= {2762, 2762, 2762, 2762, 2388, 2762, 2762, 2762, 2762, 2060}


Since Correlation doesn't offer a pairwise option, we have to create the required functionality in WL. Let's start with:

PairsCorrelation[ts_] := Module[{td, correl},
If[ts[[1]]["PathLength"] == ts[[2]]["PathLength"],
correl = Correlation @@ ts,
td = TimeSeriesResample[ts, "Intersection"];
correl = Correlation @@ td[[All, All, 2]]]];


We first check to see if the two arguments are of equal length, in which case we can Apply the Correlation function directly. If not, we use the "Intersection" option of the TSResample function to reduce the series to a set of common observation dates. The function is designed to be deployed using parallelization, as follows:

PairsListCorrelation[tslist_] := Module[{pairs, i, td, c, correl = {}},
pairs = Subsets[Range[Length@tslist], {2}];
correl =
ParallelTable[
PairsCorrelation[tslist[[pairs[[i]]]]], {i, 1, Length@pairs}];
{correl, pairs}]


The Subsets function is used to generate a non-duplicative list of index pairs and then a correlation table is built in parallel using PairsCorrelation function on each pair of series.

When we apply the function to the ten stock time series, we get the following results:

In[263]:= AbsoluteTiming[{correl, pairs} =
PairsListCorrelation[tsStocks];]

Out[263]= {13.4791, Null}

In[270]:= Length@correl

Out[270]= 45

In[284]:= Through[{Mean, Median, Min, Max}[correl]]

Out[284]= {0.381958, 0.396429, 0.200828, 0.536383}


So far, so good. But look again at the timing of the PairsListCorrelation function. It takes 13.5 seconds to calculate the 45 correlation coefficients for 10 series. To carry out an equivalent exercise for the entire S&P 500 universe would entail computing 124,750 coefficients, taking approximately 10.5 hours! This is far too slow to be practically useful in the given context.

Some speed improvement is achievable by retrieving the stock returns data in legacy (i.e. list rather than time series) format, but it still takes around 10 seconds to calculate the coefficients for our 10 stocks. Perhaps further speed improvements are possible through other means (e.g. compilation), but what is really required is a core language function to handle series of unequal length (or a Pairwise method for the Correlation function).

For comparison, I can produce the correlation coefficients for all 500 S&P member stocks in under 3 seconds using the 'Rows', 'pairwise' options of the equivalent correlation function in another scientific computing language.

# UPDATE

Another Mathematica user suggested a way to speed up the pairwise correlation algorithm using associations. We begin by downloading returns data for the S&P500 membership in legacy (i.e. list) format:

tickers = Take[FinancialData["^GSPC", "Members"]];

stockdata =
FinancialData[tickers, "Return",
DatePlus[Today, {-753, "BusinessDay"}], Method -> "Legacy"];


Then define:

PairwiseCorrelation[stockdata_] :=
Module[{assocStocks, pairs, correl},
assocStocks = Apply[Rule, stockdata, {2}] // Map[Association];
pairs = Subsets[Range@Length@assocStocks, {2}];
correl =
Map[Correlation @@ Values@KeyIntersection[assocStocks[[#]]] &,
pairs];
{correl, pairs}]


Here we are using the KeyIntersection function to identify common dates between two series, which is much faster than other methods. Accordingly:

In[317]:= AbsoluteTiming[{correl, pairs} =
PairwiseCorrelation[stockdata];]

Out[317]= {112.836, Null}

In[318]:= Length@correl

Out[318]= 127260

In[319]:= Through[{Mean, Median, Min, Max}[correl]]

Out[319]= {0.428747, 0.43533, -0.167036, 0.996379}


This is many times faster than the original algorithm and, although much slower (40x to 50x) than equivalent algorithms in other languages, gets the job done in reasonable time.

So I still think we need a Method-> "Pairwise" option for the Correlation function.

11 Replies
Sort By:
Posted 4 months ago
 Great post Jonathan! Looking forward to more of them in the near future
Posted 4 months ago
 Another Mathematica user suggested a way to speed up the pairwise correlation algorithm using associations. We begin by downloading returns data for the S&P500 membership in legacy (i.e. list) format: tickers = Take[FinancialData["^GSPC", "Members"]]; stockdata = FinancialData[tickers, "Return", DatePlus[Today, {-753, "BusinessDay"}], Method -> "Legacy"]; Then define: PairwiseCorrelation[stockdata_] := Module[{assocStocks, pairs, correl}, assocStocks = Apply[Rule, stockdata, {2}] // Map[Association]; pairs = Subsets[Range@Length@assocStocks, {2}]; correl = Map[Correlation @@ Values@KeyIntersection[assocStocks[[#]]] &, pairs]; {correl, pairs}] Here we are using the KeyIntersection function to identify common dates between two series, which is much faster than other methods. Accordingly: In[317]:= AbsoluteTiming[{correl, pairs} = PairwiseCorrelation[stockdata];] Out[317]= {112.836, Null} In[318]:= Length@correl Out[318]= 127260 In[319]:= Through[{Mean, Median, Min, Max}[correl]] Out[319]= {0.428747, 0.43533, -0.167036, 0.996379} This is many times faster than the original algorithm and, although much slower (40x to 50x) than equivalent algorithms in other languages, gets the job done in reasonable time.So I still think we need a Method-> "Pairwise" option for the Correlation function.
Posted 3 months ago
 -- you have earned Featured Contributor Badge Your exceptional post has been selected for our editorial column Staff Picks http://wolfr.am/StaffPicks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!
Posted 15 days ago
Posted 15 days ago
 Chris, Very well done! That would indeed be. a significant step forward.Yes, I think many of us would be very interested to review and test the code. Also, you might want to consider adding the function to the Wolfram Function Repository.Again, great job.Jonathan
Posted 14 days ago
 Here it goes !
Posted 12 days ago
 Great solution, Chris. Impressive work.
Posted 14 days ago
Posted 12 days ago
 Another excellent implementation!