Message Boards Message Boards

Clustering techniques for a large multi-asset portfolio


Clustering decomposition is a data-mining and data processing technique for large datasets grouping into simple representative objects. Although clustering is less known approach than other grouping methods (factor analysis in particular), it comes particularly handy when we face large sets of data and trying to identify patterns and features common to each cluster. We demonstrate the case on the financial data where we analyse a large multi-asset portfolio with unique financial returns. We compare decomposition methods and provide cluster characteristics using statistical and probabilistic analysis including risk.

Clustering image

Cluster analysis - brief introduction

Cluster analysis is a data-mining and data processing approach for determining grouping patterns in the data structure.

It is less known than other exploratory techniques, but still powerful and valid method for large dataset classification, pattern recognition, information retrieval and image analysis.

It is generally referred as is an unsupervised learning technique used for classification of data where individual elements are partitioned into groups called clusters that represent proximate collections of data elements based on a distance or dissimilarity function. Identical element pairs have zero distance or dissimilarity, and all others have positive distance or dissimilarity.

What is special about clusters? It is a method for determining how cases rather than variables group together. Structural features are handled though the distance function. Elements with similar proximity are grouped into clusters with consistent distance characteristics.

Clusters in financial data analysis

Clusters are ideal method for classifying large sets of data into meaningful groups that can be further analysed using standard exploratory techniques.

Clusters can be effectively used to reduce the dimensionality of data to be analysed. Each cluster can be effectively used as a combined variable with its own grouped characteristic.

Clustering is closely linked to statistical distributions and represents multi-objective optimisation problems that naturally arise in finance.

Clustering of financial returns

We look at clustering application on financial returns and consider the following case:

We have collected large multi-asset portfolios with computed financial returns. Our objective is to divide the total portfolio into groups with similar characteristics and obtain meaningful statistics of each group.

Simulation of financial returns

portA = RandomVariate[NormalDistribution[-0.01, 0.008], {1600, 2}];
portB = RandomVariate[LogisticDistribution[0.007, 0.01], {1200, 2}];
portC = RandomVariate[StudentTDistribution[0.015, 0.01, 4], {1200, 2}];
portD = RandomVariate[LaplaceDistribution[-0.007, 0.01], {1000, 2}];

Creating combined portfolio

portret = Join[portA, portB, portC, portD];

Total size of combined portfolio:

Total[Length /@ portret]


With distributional patterns showing skewed distribution of combined portfolio returns

Histogram[Flatten[portret], 40, ChartStyle -> "Rainbow"]

enter image description here

This is confirmed in the 3D image

Histogram3D[portret, 50, ColorFunction -> "Rainbow"]

enter image description here

Defining clustering of financial returns:

Clusters are built on the notion of similarity / proximity amongst individual returns. Mathematica 10 offers several measures to choose from - we test four of them:

  1. Euclidean distance:

  2. Squared Euclidean distance:

  3. Manhattan distance:

  4. Canberra distance:



    iSize = 250;
      FindClusters[portret, 4, DistanceFunction -> EuclideanDistance], 
      PlotLabel -> "Eucld Dist", ImageSize -> iSize], 
     ListPlot[FindClusters[portret, 4, 
       DistanceFunction -> SquaredEuclideanDistance], 
      PlotLabel -> "Sq Eucld Dist", ImageSize -> iSize], 
     ListPlot[FindClusters[portret, 4, 
       DistanceFunction -> ManhattanDistance], PlotLabel -> "Manht Dist", 
      ImageSize -> iSize], 
     ListPlot[FindClusters[portret, 4, 
       DistanceFunction -> CanberraDistance], 
      PlotLabel -> "Canberra Dist", ImageSize -> iSize]}

enter image description here

We can see that each distance function partitions clusters in a distinct way.

Working with clusters

Selecting the cluster

Let's select the Manhattan distance method for clustering of our mega-portfolio:

mclust = FindClusters[portret, 4, 
   DistanceFunction -> ManhattanDistance];
ListPlot[mclust, PlotMarkers -> Automatic, 
 PlotLegends -> {"Clust 1", "Clust 2", "Clust 3", "Clust 4"}]

enter image description here

Analysing each cluster profile

Clusters size:

BarChart[Map[Length, mclust], PlotLabel -> Style["Clusters size", 16],
  ChartStyle -> "DarkRainbow", BarOrigin -> Left, 
 LabelingFunction -> After]

enter image description here

We can observe unequal distribution of cluster sizes => cluster 1 is just half of the cluster 4. This is due to the distancing distribution of clusters specification.

Cluster Norm

Cluster norm provides valuable information about the distance function inside each cluster.

BarChart[Map[Norm, mclust], PlotLabel -> Style["Clusters norm", 16], 
 ChartStyle -> "RoseColors", BarOrigin -> Right, 
 LabelingFunction -> Before]

enter image description here

Statistical analysis of clusters

Clusters are treated as 'objects' and we can perform any statistical analysis on them. For example - we can compute descriptive statistics to identify moments of each cluster.

Location and dispersion measures

size = Length[mclust];
legnd = {"Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"};
clmean = Table[Mean[Flatten[mclust[[i]]]], {i, size}];
clvol = Table[StandardDeviation[Flatten[mclust[[i]]]], {i, size}];
{BarChart[clmean, ChartStyle -> "Pastel", 
  PlotLabel -> "Clusters means", LabelingFunction -> Above, 
  ImageSize -> 320, ChartLegends -> legnd], 
 BarChart[clvol, ChartStyle -> "FallColors", 
  PlotLabel -> "Clusters vols", LabelingFunction -> Above, 
  ImageSize -> 320, ChartLegends -> legnd]}

enter image description here

First two clusters reside in the negative return territory, but are less volatile than the ones which show positive returns.

We can get further insight into each cluster dispersion by looking at quartile deviation and inter quartile range:

clqdev = Table[QuartileDeviation[Flatten[mclust[[i]]]], {i, size}];
cliqrng = Table[InterquartileRange[Flatten[mclust[[i]]]], {i, size}];
{BarChart[clqdev, ChartStyle -> "LakeColors", 
  PlotLabel -> Style["Clusters Quartile Deviation", 16], 
  LabelingFunction -> Above, ImageSize -> 320, ChartLegends -> legnd],
  BarChart[cliqrng, ChartStyle -> "SunsetColors", 
  PlotLabel -> Style["Clusters vols", 16], LabelingFunction -> Above, 
  ImageSize -> 320, ChartLegends -> legnd]}

enter image description here

The above dispersion statistics is confirmed also in the quartile domain - quartile deviation is the lowest for negative returns clusters, but higher for the positive one. This indicates higher risk in the positive return segments.

Clusters shape measures

cskew = Table[Skewness[Flatten[mclust[[i]]]], {i, size}];
ckurt = Table[Kurtosis[Flatten[mclust[[i]]]], {i, size}];
{BarChart[cskew, ChartStyle -> "MintColors", 
  PlotLabel -> Style["Clusters skewness", 16], 
  LabelingFunction -> Above, ImageSize -> 320, ChartLegends -> legnd],
  BarChart[ckurt, ChartStyle -> "LakeColors", 
  PlotLabel -> Style["Clusters kurtosis", 16], 
  LabelingFunction -> Above, ImageSize -> 320, ChartLegends -> legnd]}

enter image description here Moments statistics reveal non-normality pattern in all clustered data.

Probabilistic analysis of clustered returns

Extension to the probabilistic calculations on the clusters is trivial: For example - we may want to determine each cluster probability of financial return < 0

Table[Probability[x < 0, x \[Distributed] Flatten[mclust[[i]]]], {i, 
    size}] // N;
BarChart[%, PlotLabel -> Style["Clusters return probability < 0", 16],
  ChartStyle -> "Rainbow", LabelingFunction -> Above]

enter image description here Being in the negative return domain, the first two clusters show the highest probability of negative returns.

Clusters risk measure

We can easily calculate Value at Risk threshold for each cluster:

varT = Table[Quantile[Flatten[mclust[[i]]], 0.01], {i, size}];
BarChart[varT, ChartStyle -> "SolarColors", 
 PlotLabel -> Style["VaR threshold at 99% level", 16], 
 LabelingFunction -> Below]

enter image description here

and then using the above measure to obtain the Tail VaR statistics:

tVaR = Table[
   NExpectation[x \[Conditioned] x < varT[[i]], 
    x \[Distributed] Flatten[mclust[[i]]]], {i, size}];
BarChart[tVaR, ChartStyle -> "TemperatureMap", 
 PlotLabel -> Style["Tail VaR threshold at 99% level", 16], 
 LabelingFunction -> Below]
![enter image description here][13]

Tail VaR (Expected shortfall) reveals the same risk as the VaR above. The first cluster is the most riskiest whilst the last one in the least riskiest cluster.

Cluster distributions

Creating clusters objects allow us to visualise and work with cluster distributions

Table[Histogram[Flatten[mclust[[i]]], 25, 
  ColorFunction -> Function[{height}, ColorData["Rainbow"][height]], 
  ImageSize -> 225], {i, size}]

enter image description here Clusters one and four generate highly skewed profiles - cluster 1 negative and cluster 4 positive. Clusters two are three are more symmetrical.

The distribution patterns can also be analysed in the 3D format as they provide richer insight into the distributions in higher dimensions. As we can see below, the low-frequency outliers are still present in each cluster and add to the risk of each group.

clr = {"Pastel", "LightTemperatureMap", "RedBlueTones", 
   Histogram3D[mclust[[i]], 50, "PDF", ImageSize -> 250, 
    ColorFunction -> clr[[i]]], {i, size}] // Quiet

enter image description here


Why clusters?

  • Ideal tool for knowledge discovery and pattern recognition in datasets.
  • Great technique for automatic data classification and numerical taxonomy.
  • Well-suited tool for financial data analysis:
    • Patterns detection
    • Drivers recognition
    • Characteristics extraction

Great technique for automatic data classification and numerical taxonomy

POSTED BY: Igor Hlivka
2 years ago

Hi Igor,

Thanks for sharing this. I'd like to learn how to construct multidimensional clusters, for example, observations comprised of the value of a company, its industry (Standard Industrial Classification), the state or city of its headquarters, and the names of its beneficial owners.


POSTED BY: Gregory Lypny
2 years ago

Hello Gregory

Yes, clusters can be multi-dimensional. Below is a simple example using numerical data - in fact just 3 dimensional array to be able to visualise it. I am asking <FindClusters> command to generate 4 clusters

cdata = RandomVariate[NormalDistribution[0.01, 0.01], {100, 3}];
FindClusters[cdata, 4];

This will look as follow: Clusters for 3D data

As you can see, you can work with multi-dimensional data, but visualisation of the clustered objects can be more difficult if you move beyond 3 dimensions. If the data is non-numeric, you are likely to work with strings. You can still try using the default distance function or select specific method such as DamerauLevenshteinDistance or HammingDistance.

To execute it: build your data as nested list where each sub list contains the data you want. Then simply apply the FindClusters command to your data object.

Hope this answers your query. Kind regards Igor

POSTED BY: Igor Hlivka
2 years ago

Hi Igor,

Thanks for the example. Crystal clear. The catch with qualitative data will be to transform it and/or to modify the clustering methods so that the clusters are meaningful. It will be interesting.


POSTED BY: Gregory Lypny
2 years ago


Thank you for including the notebook as well. That way I can play around with the results.


POSTED BY: Michael Kelly
2 years ago

enter image description here - Congratulations! This post is now a Staff Pick! Thank you for your wonderful contributions. Please, keep them coming!

POSTED BY: Moderation Team
12 days ago

Group Abstract Group Abstract