Group Abstract

Message Boards

WOLFRAM COMMUNITY

30.3K Views

5 Replies

16 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks Computer Science Data Science Economics Finance Graphics and Visualization Wolfram Language Optimization

Clustering techniques for a large multi-asset portfolio

Igor Hlivka

Igor Hlivka, MThree Consulting

Posted 11 years ago

Clustering decomposition is a data-mining and data processing technique for large datasets grouping into simple representative objects. Although clustering is less known approach than other grouping methods (factor analysis in particular), it comes particularly handy when we face large sets of data and trying to identify patterns and features common to each cluster. We demonstrate the case on the financial data where we analyse a large multi-asset portfolio with unique financial returns. We compare decomposition methods and provide cluster characteristics using statistical and probabilistic analysis including risk. Cluster analysis - brief introduction Cluster analysis is a data-mining and data processing approach for determining grouping patterns in the data structure. It is less known than other exploratory techniques, but still powerful and valid method for large dataset classification, pattern recognition, information retrieval and image analysis. It is generally referred as is an unsupervised learning technique used for classification of data where individual elements are partitioned into groups called clusters that represent proximate collections of data elements based on a distance or dissimilarity function. Identical element pairs have zero distance or dissimilarity, and all others have positive distance or dissimilarity. What is special about clusters? It is a method for determining how cases rather than variables group together. Structural features are handled though the distance function. Elements with similar proximity are grouped into clusters with consistent distance characteristics. Clusters in financial data analysis Clusters are ideal method for classifying large sets of data into meaningful groups that can be further analysed using standard exploratory techniques. Clusters can be effectively used to reduce the dimensionality of data to be analysed. Each cluster can be effectively used as a combined variable with its own grouped characteristic. Clustering is closely linked to statistical distributions and represents multi-objective optimisation problems that naturally arise in finance. Clustering of financial returns We look at clustering application on financial returns and consider the following case: We have collected large multi-asset portfolios with computed financial returns. Our objective is to divide the total portfolio into groups with similar characteristics and obtain meaningful statistics of each group. Simulation of financial returns portA = RandomVariate[NormalDistribution[-0.01, 0.008], {1600, 2}]; portB = RandomVariate[LogisticDistribution[0.007, 0.01], {1200, 2}]; portC = RandomVariate[StudentTDistribution[0.015, 0.01, 4], {1200, 2}]; portD = RandomVariate[LaplaceDistribution[-0.007, 0.01], {1000, 2}]; Creating combined portfolio portret = Join[portA, portB, portC, portD]; Total size of combined portfolio: Total[Length /@ portret] 10000 With distributional patterns showing skewed distribution of combined portfolio returns Histogram[Flatten[portret], 40, ChartStyle -> "Rainbow"] This is confirmed in the 3D image Histogram3D[portret, 50, ColorFunction -> "Rainbow"] Defining clustering of financial returns: Clusters are built on the notion of similarity / proximity amongst individual returns. Mathematica 10 offers several measures to choose from - we test four of them: Euclidean distance: Sqrt[\[Sum](u-v)^2] Squared Euclidean distance: \[Sum](u-v)^2 Manhattan distance: \[Sum]\[LeftBracketingBar]u-v\[RightBracketingBar] Canberra distance: \[Sum]\[LeftBracketingBar]u-v\[RightBracketingBar]/(\[LeftBracketingBar]u\[RightBracketingBar]+\[LeftBracketingBar]v\[RightBracketingBar]) . iSize = 250; {ListPlot[ FindClusters[portret, 4, DistanceFunction -> EuclideanDistance], PlotLabel -> "Eucld Dist", ImageSize -> iSize], ListPlot[FindClusters[portret, 4, DistanceFunction -> SquaredEuclideanDistance], PlotLabel -> "Sq Eucld Dist", ImageSize -> iSize], ListPlot[FindClusters[portret, 4, DistanceFunction -> ManhattanDistance], PlotLabel -> "Manht Dist", ImageSize -> iSize], ListPlot[FindClusters[portret, 4, DistanceFunction -> CanberraDistance], PlotLabel -> "Canberra Dist", ImageSize -> iSize]} We can see that each distance function partitions clusters in a distinct way. Working with clusters Selecting the cluster Let's select the Manhattan distance method for clustering of our mega-portfolio: mclust = FindClusters[portret, 4, DistanceFunction -> ManhattanDistance]; ListPlot[mclust, PlotMarkers -> Automatic, PlotLegends -> {"Clust 1", "Clust 2", "Clust 3", "Clust 4"}] Analysing each cluster profile Clusters size: BarChart[Map[Length, mclust], PlotLabel -> Style["Clusters size", 16], ChartStyle -> "DarkRainbow", BarOrigin -> Left, LabelingFunction -> After] We can observe unequal distribution of cluster sizes => cluster 1 is just half of the cluster 4. This is due to the distancing distribution of clusters specification. Cluster Norm Cluster norm provides valuable information about the distance function inside each cluster. BarChart[Map[Norm, mclust], PlotLabel -> Style["Clusters norm", 16], ChartStyle -> "RoseColors", BarOrigin -> Right, LabelingFunction -> Before] Statistical analysis of clusters Clusters are treated as 'objects' and we can perform any statistical analysis on them. For example - we can compute descriptive statistics to identify moments of each cluster. Location and dispersion measures size = Length[mclust]; legnd = {"Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"}; clmean = Table[Mean[Flatten[mclust[[i]]]], {i, size}]; clvol = Table[StandardDeviation[Flatten[mclust[[i]]]], {i, size}]; {BarChart[clmean, ChartStyle -> "Pastel", PlotLabel -> "Clusters means", LabelingFunction -> Above, ImageSize -> 320, ChartLegends -> legnd], BarChart[clvol, ChartStyle -> "FallColors", PlotLabel -> "Clusters vols", LabelingFunction -> Above, ImageSize -> 320, ChartLegends -> legnd]} First two clusters reside in the negative return territory, but are less volatile than the ones which show positive returns. We can get further insight into each cluster dispersion by looking at quartile deviation and inter quartile range: clqdev = Table[QuartileDeviation[Flatten[mclust[[i]]]], {i, size}]; cliqrng = Table[InterquartileRange[Flatten[mclust[[i]]]], {i, size}]; {BarChart[clqdev, ChartStyle -> "LakeColors", PlotLabel -> Style["Clusters Quartile Deviation", 16], LabelingFunction -> Above, ImageSize -> 320, ChartLegends -> legnd], BarChart[cliqrng, ChartStyle -> "SunsetColors", PlotLabel -> Style["Clusters vols", 16], LabelingFunction -> Above, ImageSize -> 320, ChartLegends -> legnd]} The above dispersion statistics is confirmed also in the quartile domain - quartile deviation is the lowest for negative returns clusters, but higher for the positive one. This indicates higher risk in the positive return segments. Clusters shape measures cskew = Table[Skewness[Flatten[mclust[[i]]]], {i, size}]; ckurt = Table[Kurtosis[Flatten[mclust[[i]]]], {i, size}]; {BarChart[cskew, ChartStyle -> "MintColors", PlotLabel -> Style["Clusters skewness", 16], LabelingFunction -> Above, ImageSize -> 320, ChartLegends -> legnd], BarChart[ckurt, ChartStyle -> "LakeColors", PlotLabel -> Style["Clusters kurtosis", 16], LabelingFunction -> Above, ImageSize -> 320, ChartLegends -> legnd]} Moments statistics reveal non-normality pattern in all clustered data. Probabilistic analysis of clustered returns Extension to the probabilistic calculations on the clusters is trivial: For example - we may want to determine each cluster probability of financial return < 0 Table[Probability[x < 0, x \[Distributed] Flatten[mclust[[i]]]], {i, size}] // N; BarChart[%, PlotLabel -> Style["Clusters return probability < 0", 16], ChartStyle -> "Rainbow", LabelingFunction -> Above] Being in the negative return domain, the first two clusters show the highest probability of negative returns. Clusters risk measure We can easily calculate Value at Risk threshold for each cluster: varT = Table[Quantile[Flatten[mclust[[i]]], 0.01], {i, size}]; BarChart[varT, ChartStyle -> "SolarColors", PlotLabel -> Style["VaR threshold at 99% level", 16], LabelingFunction -> Below] and then using the above measure to obtain the Tail VaR statistics: tVaR = Table[ NExpectation[x \[Conditioned] x < varT[[i]], x \[Distributed] Flatten[mclust[[i]]]], {i, size}]; BarChart[tVaR, ChartStyle -> "TemperatureMap", PlotLabel -> Style["Tail VaR threshold at 99% level", 16], LabelingFunction -> Below] ![enter image description here][13] Tail VaR (Expected shortfall) reveals the same risk as the VaR above. The first cluster is the most riskiest whilst the last one in the least riskiest cluster. Cluster distributions Creating clusters objects allow us to visualise and work with cluster distributions Table[Histogram[Flatten[mclust[[i]]], 25, ColorFunction -> Function[{height}, ColorData["Rainbow"][height]], ImageSize -> 225], {i, size}] Clusters one and four generate highly skewed profiles - cluster 1 negative and cluster 4 positive. Clusters two are three are more symmetrical. The distribution patterns can also be analysed in the 3D format as they provide richer insight into the distributions in higher dimensions. As we can see below, the low-frequency outliers are still present in each cluster and add to the risk of each group. clr = {"Pastel", "LightTemperatureMap", "RedBlueTones", "AvocadoColors"}; Table[Evaluate@ Histogram3D[mclust[[i]], 50, "PDF", ImageSize -> 250, ColorFunction -> clr[[i]]], {i, size}] // Quiet Summary Why clusters? Ideal tool for knowledge discovery and pattern recognition in datasets. Great technique for automatic data classification and numerical taxonomy. Well-suited tool for financial data analysis: Patterns detection Drivers recognition Characteristics extraction Great technique for automatic data classification and numerical taxonomy Attachments:

POSTED BY: Igor Hlivka

5 Replies

Sort By:

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 8 years ago

- Congratulations! This post is now a Staff Pick! Thank you for your wonderful contributions. Please, keep them coming!

POSTED BY: EDITORIAL BOARD

Michael Kelly

Michael Kelly, Wolfram Research Inc.

Posted 11 years ago

Igor Thank you for including the notebook as well. That way I can play around with the results. Michael

POSTED BY: Michael Kelly

Gregory Lypny

Posted 11 years ago

Hi Igor, Thanks for the example. Crystal clear. The catch with qualitative data will be to transform it and/or to modify the clustering methods so that the clusters are meaningful. It will be interesting. Gregory

POSTED BY: Gregory Lypny

Igor Hlivka

Igor Hlivka, MThree Consulting

Posted 11 years ago

Hello Gregory Yes, clusters can be multi-dimensional. Below is a simple example using numerical data - in fact just 3 dimensional array to be able to visualise it. I am asking <FindClusters> command to generate 4 clusters cdata = RandomVariate[NormalDistribution[0.01, 0.01], {100, 3}]; FindClusters[cdata, 4]; ListPlot3D[%] This will look as follow: As you can see, you can work with multi-dimensional data, but visualisation of the clustered objects can be more difficult if you move beyond 3 dimensions. If the data is non-numeric, you are likely to work with strings. You can still try using the default distance function or select specific method such as DamerauLevenshteinDistance or HammingDistance. To execute it: build your data as nested list where each sub list contains the data you want. Then simply apply the FindClusters command to your data object. Hope this answers your query. Kind regards Igor

POSTED BY: Igor Hlivka

Gregory Lypny

Posted 11 years ago

Hi Igor, Thanks for sharing this. I'd like to learn how to construct multidimensional clusters, for example, observations comprised of the value of a company, its industry (Standard Industrial Classification), the state or city of its headquarters, and the names of its beneficial owners. Gregory

POSTED BY: Gregory Lypny

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback

Clustering techniques for a large multi-asset portfolio

Cluster analysis - brief introduction

Clusters in financial data analysis

Clustering of financial returns

Simulation of financial returns

Creating combined portfolio

This is confirmed in the 3D image

Defining clustering of financial returns:

Working with clusters

Selecting the cluster

Analysing each cluster profile

Cluster Norm

Statistical analysis of clusters

Location and dispersion measures

Clusters shape measures

Probabilistic analysis of clustered returns

Clusters risk measure

Cluster distributions

Summary

Why clusters?