Message Boards Message Boards

Pairwise Correlation of Financial Data

One of the regular tasks in statistical arbitrage is to compute correlations between a large universe of stocks, such as the S&P500 index members, for example. Mathematica/WL has some very nice features for obtaining financial data and manipulating time series. And of course it offers all the commonly required statistical functions, including correlation. But the WL Correlation function is missing one vital feature - the ability to handle data series of unequal length. This arises, of course, because stock data series do not all share a common start date and (very occasionally) omit data for dates in the middle of the series. This creates an issue for the Correlation function, which can only handle series of equal length.
The usual way of handling this is to apply pairwise correlation, in which each pair of data vectors is truncated to include only the dates common to both series. Of course this can easily be done in WL; but it is very inefficient.

Let's take an example. We start with the last 10 symbols in the S&P 500 index membership:

In[1]:= tickers = Take[FinancialData["^GSPC", "Members"], -10]

Out[1]= {"NASDAQ:WYNN", "NASDAQ:XEL", "NYSE:XRX", "NASDAQ:XLNX", \
"NYSE:XYL", "NYSE:YUM", "NASDAQ:ZBRA", "NYSE:ZBH", "NASDAQ:ZION", \
"NYSE:ZTS"}

Next we obtain the returns series for these stocks, over the last several years. By default, FinancialData retrieves the data as TimeSeries Objects. This is very elegant, but slows the processing of the data, as we shall see.

tsStocks = 
 FinancialData[tickers, "Return", 
  DatePlus[Today, {-2753, "BusinessDay"}]];

Not all the series contain the same number of date-return pairs. So using Correlation is out of the question:

In[282]:= Table[Length@tsStocks[[i]]["Values"], {i, 10}]

Out[282]= {2762, 2762, 2762, 2762, 2388, 2762, 2762, 2762, 2762, 2060}

Since Correlation doesn't offer a pairwise option, we have to create the required functionality in WL. Let's start with:

PairsCorrelation[ts_] := Module[{td, correl},
   If[ts[[1]]["PathLength"] == ts[[2]]["PathLength"], 
    correl = Correlation @@ ts,
    td = TimeSeriesResample[ts, "Intersection"];
    correl = Correlation @@ td[[All, All, 2]]]];

We first check to see if the two arguments are of equal length, in which case we can Apply the Correlation function directly. If not, we use the "Intersection" option of the TSResample function to reduce the series to a set of common observation dates. The function is designed to be deployed using parallelization, as follows:

PairsListCorrelation[tslist_] := Module[{pairs, i, td, c, correl = {}},
  pairs = Subsets[Range[Length@tslist], {2}];
  correl = 
   ParallelTable[
    PairsCorrelation[tslist[[pairs[[i]]]]], {i, 1, Length@pairs}];
  {correl, pairs}]

The Subsets function is used to generate a non-duplicative list of index pairs and then a correlation table is built in parallel using PairsCorrelation function on each pair of series.

When we apply the function to the ten stock time series, we get the following results:

In[263]:= AbsoluteTiming[{correl, pairs} = 
   PairsListCorrelation[tsStocks];]

Out[263]= {13.4791, Null}

In[270]:= Length@correl

Out[270]= 45

In[284]:= Through[{Mean, Median, Min, Max}[correl]]

Out[284]= {0.381958, 0.396429, 0.200828, 0.536383}

So far, so good. But look again at the timing of the PairsListCorrelation function. It takes 13.5 seconds to calculate the 45 correlation coefficients for 10 series. To carry out an equivalent exercise for the entire S&P 500 universe would entail computing 124,750 coefficients, taking approximately 10.5 hours! This is far too slow to be practically useful in the given context.

Some speed improvement is achievable by retrieving the stock returns data in legacy (i.e. list rather than time series) format, but it still takes around 10 seconds to calculate the coefficients for our 10 stocks. Perhaps further speed improvements are possible through other means (e.g. compilation), but what is really required is a core language function to handle series of unequal length (or a Pairwise method for the Correlation function).

For comparison, I can produce the correlation coefficients for all 500 S&P member stocks in under 3 seconds using the 'Rows', 'pairwise' options of the equivalent correlation function in another scientific computing language.


UPDATE

Another Mathematica user suggested a way to speed up the pairwise correlation algorithm using associations. We begin by downloading returns data for the S&P500 membership in legacy (i.e. list) format:

tickers = Take[FinancialData["^GSPC", "Members"]];

stockdata = 
  FinancialData[tickers, "Return", 
   DatePlus[Today, {-753, "BusinessDay"}], Method -> "Legacy"];

Then define:

PairwiseCorrelation[stockdata_] := 
 Module[{assocStocks, pairs, correl}, 
  assocStocks = Apply[Rule, stockdata, {2}] // Map[Association];
  pairs = Subsets[Range@Length@assocStocks, {2}];
  correl = 
   Map[Correlation @@ Values@KeyIntersection[assocStocks[[#]]] &, 
    pairs];
  {correl, pairs}]

Here we are using the KeyIntersection function to identify common dates between two series, which is much faster than other methods. Accordingly:

In[317]:= AbsoluteTiming[{correl, pairs} = 
   PairwiseCorrelation[stockdata];]

Out[317]= {112.836, Null}

In[318]:= Length@correl

Out[318]= 127260

In[319]:= Through[{Mean, Median, Min, Max}[correl]]

Out[319]= {0.428747, 0.43533, -0.167036, 0.996379}

This is many times faster than the original algorithm and, although much slower (40x to 50x) than equivalent algorithms in other languages, gets the job done in reasonable time.

So I still think we need a Method-> "Pairwise" option for the Correlation function.

POSTED BY: Jonathan Kinlay
11 Replies

Another excellent implementation!

POSTED BY: Jonathan Kinlay

Great solution, Chris. Impressive work.

POSTED BY: Jonathan Kinlay

Hi Sam, no it was a very general question I raised, independently.

POSTED BY: Jonathan Kinlay

Guys, just FYI, is this discussion related to the blog: Graph Theory and Finance in Mathematica ?

https://blog.wolfram.com/2012/06/01/graph-theory-and-finance-in-mathematica

POSTED BY: Sam Carrettie

POSTED BY: Martijn Froeling
Posted 3 years ago

Here it goes !

POSTED BY: Chris P

Chris, Very well done! That would indeed be. a significant step forward.

Yes, I think many of us would be very interested to review and test the code. Also, you might want to consider adding the function to the Wolfram Function Repository.

Again, great job.

Jonathan

POSTED BY: Jonathan Kinlay
Posted 3 years ago

Hi,

for fun and as an exercise, I was curious to see if I could speed up the pairwise correlation trying some alternative approaches. There was indeed some room for improvement.

In short, using some rather basic Mathematica code and also the (old) compiler (but without compilation to C), I have finally been able with some fine tuning to reach a 100x factor (compared to your last approach and with the same data).

For example, running the computation for the whole 505 S&P index and 753 business days, took me 1.1 s instead of 120 s (your PairwiseCorrelation) in the Wolfram basic free cloud (Mathematica v12.3), or it took me 2. seconds instead of 200 seconds in the Wolfram player 12.0.0 on my old desktop.

Also for comparison, you said that using another scientific language it takes you under 3 seconds to produce the correlation coefficients for all 500 S&P but it is not clear for how many business days (in your first example you get the stocks data for 2753 days)? In my case, the computation (in the wolfram free basic cloud) took me about 2 s for 1500 days, 4 s for 2000, days, 6 s for 2500 days and 9 s for 3000 days, so it does not scale well but the timings remain almost acceptable I guess.

These results show simply that probably a core function in Mathematica (as you wish it existed) would speed up the computation even more so the timings would be comparable to other programming languages optimized for speed (so no need to make external evaluations).

Concerning my approach, the "challenge" was mostly to speed up the computation of the intersections. As you can easily check this takes 75% of the total time in your PairwiseCorrelation function which is very long in absolute time in your case. If i am not mistaken, the speed up is here about 180x, using a few tricks and basic Mathematica code without even any compilation. But I also did some fine tuning at every step (= comparing which Mathematica code will give you the best time computation time to accomplish the same task) which allowed me to grasp some fractions of second here and there. I have used compilation only to speed up the correlation computation, the speed up factor is about 25x. (I couldn't experiment with the new compiler for technical reasons, but it looks very promising)

Of course, I don't pretend at all to have the most elegant and fastest approach, and if you want I can of course publish my code here. Tell me if you are interested, I will just have to make it more readable and commented ;)

Chris

POSTED BY: Chris P

enter image description here -- you have earned Featured Contributor Badge enter image description here Your exceptional post has been selected for our editorial column Staff Picks http://wolfr.am/StaffPicks and Your Profile is now distinguished by a Featured Contributor Badge and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD

Another Mathematica user suggested a way to speed up the pairwise correlation algorithm using associations. We begin by downloading returns data for the S&P500 membership in legacy (i.e. list) format:

tickers = Take[FinancialData["^GSPC", "Members"]];

stockdata = 
  FinancialData[tickers, "Return", 
   DatePlus[Today, {-753, "BusinessDay"}], Method -> "Legacy"];

Then define:

PairwiseCorrelation[stockdata_] := 
 Module[{assocStocks, pairs, correl}, 
  assocStocks = Apply[Rule, stockdata, {2}] // Map[Association];
  pairs = Subsets[Range@Length@assocStocks, {2}];
  correl = 
   Map[Correlation @@ Values@KeyIntersection[assocStocks[[#]]] &, 
    pairs];
  {correl, pairs}]

Here we are using the KeyIntersection function to identify common dates between two series, which is much faster than other methods. Accordingly:

In[317]:= AbsoluteTiming[{correl, pairs} = 
   PairwiseCorrelation[stockdata];]

Out[317]= {112.836, Null}

In[318]:= Length@correl

Out[318]= 127260

In[319]:= Through[{Mean, Median, Min, Max}[correl]]

Out[319]= {0.428747, 0.43533, -0.167036, 0.996379}

This is many times faster than the original algorithm and, although much slower (40x to 50x) than equivalent algorithms in other languages, gets the job done in reasonable time.

So I still think we need a Method-> "Pairwise" option for the Correlation function.

POSTED BY: Jonathan Kinlay

Great post Jonathan! Looking forward to more of them in the near future

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract