Message Boards Message Boards

Sweeping parameter spaces using gzip-Entropy

This post is actually a reply to a question I was asked in one of the other dicussions on "Detecting copy-move forgery in images". I had indicated that
There is, by the way, a very similar and simple trick, which one can use to estimate the complexity of time series based on compression rates. Using an algorithm such as gzip one can very crudely estimate the so called algorithmic complexity. It is very easy and fast to study the parameter space of a dynamical system and look e.g. for chaotic regions.
I was then asked how this works. The idea is actually quite simple and well known. A time series generated from a map, such as the logistic map, or from a set of ODEs, such as the chaotic Rössler system, can show very different types of behaviour. For some parameter values they might behave rather regular and for others they might exhibit a quite complex type of behaviour. Let's consider the logistic map first.
ListLinePlot[NestList[4*#*(1 - #) &, RandomReal[], 100]]
The resulting time series looks like this:

When we change the paramter (in the example above "4") we obtain the well known bifurcation diagram.
M = 400; listlogistic =
Table[{\[Mu], NestList[\[Mu] *#*(1 - #) &, RandomReal[],M]}, {\[Mu], 3, 4, 0.0001}]; allcoords = Flatten[Table[Table[{listlogistic[[i, 1]], listlogistic[[i, 2, j]]}, {j, 380, M}], {i, 1, Length[listlogistic]}], 1]; ListPlot[allcoords, PlotStyle -> {Black, AbsolutePointSize[0.01]}, ImageSize -> 750]

For different values of the paramter \ it shows the range of values of the resulting time series. The "black bands" indicate chaotic behaviour. A rather standard way to quantify this type of behaviour is the so-called Lyapunov exponent, which can be calculated easily in this case for all values of the paramter. (This will be more complicated in the higher dimensional example I will discuss later.)
 list5 = {};
 For[a = 3.0, a <= 4, a = a + (4 - 1.5)/5000,
 (*Generate data*)
 data = NestList[a*#*(1 - #) &, RandomReal[], 10000];
 smpl = 2000;
 (*Evaluate Log[D[a*#*(1 - #) ,#]] at every point of the trajectory, sum up and add result to list*)
 AppendTo[list5, {a, Total[Table[Log[Abs[a - 2.*a*data[[i]]]], {i, Length[data] - smpl, Length[data]}]]/smpl}]];
ListLinePlot[list5, ImageSize -> 750]

Positive values indicate chaotic behaviour; this corresponds to the black bands in the bifucation diagram. In higher dimensions it is more complicated to compute the Lyapunov exponent, particularly if we do not know the equations, but only have a time series. The following simple idea can help us to qualitatively assess the "complexity" of the time series. We first save the time series in a binary file, and determine its size - say in Bytes. We then compress the file using gzip, and again determine the size of the compressed file. The ratio of the two numbers, i.e. the compression rate, has information about the complexity of the time series. The code is very simple.
 list0 = {};
 (*Change Paramter from 3 to 4 in 5000 steps *)
 For[a = 3.0, a <= 4, a = a + (4. - 3.)/5000,
 (*Generate data for paramter a*)
 data = NestList[a*#*(1 - #) &, RandomReal[], 15000];
(*Write data to binary file*)
BinaryWrite["/Users/thiel/Desktop/sample.dat", data[[10000 ;;]],"Real32"]; Close["/Users/thiel/Desktop/sample.dat"];
(*Determine length of file*)
fsizebefore = FileByteCount["~/Desktop/sample.dat"];
(*gzip the file and determine new length*)
Run["gzip ~/Desktop/sample.dat"];
fsizeafter = FileByteCount["~/Desktop/sample.dat.gz"];
(*Clean up*)
Run["rm ~/Desktop/sample.dat.gz"]; Run["rm ~/Desktop/sample.dat"];
(*Collect data*)
AppendTo[list0, {a, N[fsizeafter/fsizebefore]}]]; ListLinePlot[list0,ImageSize -> 750]
This gives the following figure:

(Note that I am using a Mac and terminal commands. This will have to modified on a Windows machine.)
The compression rate can obviously not be negative. Low values indicate regular time series (which correspond to negative Lyapunov exponents in the figure above) and positive values indicate complex, potentially chaotic behaviour. The horizontal lines clearly indicate the periodic windows in the bifurcation diagram. The qualitative structure is very similar to the one of the diagram of the Lyaponov exponents. There are also peaks at the bifurcation points, e.g. at \=3; these are due to long transients at these points. In the literature it is claimed that the "gzip-entropy" is an estimate of the so-called algorithmic complexity. 
I have tried out to get rid of the requirement to write the files to the hard drive using native Mathematica commands. There are two alternative approaches and the resulting figures.
list2 = {}; For[a = 3.0, a <= 4, a = a + (4 - 1.5)/5000, data = NestList[a*#*(1 - #) &, RandomReal[], 15000]; AppendTo[list2, {a, Entropy[data[[10000 ;;]]]}]]; ListLinePlot[list2,ImageSize -> 750]

Monitor[list4 = {}; For[a = 3.0, a <= 4, a = a + (4 - 1.5)/5000, data = NestList[a*#*(1 - #) &, RandomReal[], 15000];
AppendTo[list4, {a, N[ByteCount[Compress[Flatten[RealDigits[data[[10000 ;;]]]]]]/ByteCount[Flatten[RealDigits[data[[10000 ;;]]]]]]}]], a]; ListLinePlot[list4, ImageSize -> 750]

It appears that the two approaches that use Mathematica functions and do not rely on the external "gzip" give reasonable results as well. I am using solid state drives, so the speed up is not very large.
One might wonder, why we do not directly estimate the Lyapunov exponents, which can be done using e.g. Wolf's algorithm:
The point is that the gzip method is very easy to implement and rather fast.  Let's consider the chaotic Roessler system. 
sols = NDSolve[
{D[x[t], t] == -y[t] - z[t], D[y[t], t] == x[t] + a y[t], D[z[t], t] == b + (x[t] - c) z[t], x[0] == RandomReal[], y[0] == RandomReal[], z[0] == RandomReal[]} /. {a -> 0.2, b -> 0.2, c -> 5.7}, {x, y, z}, {t, 0, 3000}, MaxSteps -> Infinity];
(*Plot; do not show transient, i.e. start at t=2000*)
ParametricPlot3D[{x[t], y[t], z[t]} /. sols, {t, 2000, 3000}, PlotRange -> All, ImageSize -> 500, PlotPoints -> 1000]
which gives

Let's sweep the paramter space from where a=b range from 0.2 to 0.3 and c goes from 20 to 45. The following command runs for slightly less then 6300 seconds on my laptop.
 AbsoluteTiming[Monitor[listros = {};
 (*Set up the loops for the paramter sweep, use 100 steps for a and for c *)
 For[c = 20.0, c <= 45.0, c = c + (45. - 20.)/100.,
 For[a = 0.2, a <= 0.3, a = a + (0.3 - 0.2)/100.,
 (*Integrate the ODEs*)
 sols = NDSolve[{D[x[t], t] == -y[t] - z[t], D[y[t], t] == x[t] + a y[t], D[z[t], t] == b + (x[t] - c) z[t], x[0] == RandomReal[], y[0] == RandomReal[], z[0] == RandomReal[]} /. b -> a, {x, y,z}, {t, 0, 2000}, MaxSteps -> Infinity];
(*Sample the time series, discard transient of 100 points*)
data = Table[{t, Part[x[t] /. sols, 1]}, {t, 1000, 2000, 0.2}];
(*Write the data to binary file*)
BinaryWrite["/Users/thiel/Desktop/sample.dat", data[[2000 ;;]],"Real64"]; Close["/Users/thiel/Desktop/sample.dat"];
(*Determine file size*)
fsizebefore = FileByteCount["~/Desktop/sample.dat"];
(*Compress and determine new file size*)
Run["gzip ~/Desktop/sample.dat"];
fsizeafter = FileByteCount["~/Desktop/sample.dat.gz"];
(*Clean up*)
Run["rm ~/Desktop/sample.dat.gz"];
Run["rm ~/Desktop/sample.dat"];
(*Collect results*)
AppendTo[listros, {c, a, N[fsizeafter/fsizebefore]}]]], {c, a}];]
Note that I only use the x-component of the system. The result can be plotted using 
ListDensityPlot[listros, InterpolationOrder -> 0, ColorFunction -> "Rainbow", PlotRange -> All, PlotLegends -> Placed[Automatic, Right]]
which gives

The structures do not become very clear, but after a bit of  "creative (and clumsy) rescaling"
listrosnorm = (listros[[All, 3]] - Min[listros[[All, 3]]])/(Max[listros[[All, 3]]] - Min[listros[[All, 3]]]);
listrosnorm2 = (listrosnorm - Mean[listrosnorm])/StandardDeviation[listrosnorm];
listnorm = Table[{listros[[k, 1]], listros[[k, 2]], Exp[listrosnorm2[[k]]]}, {k,1, Length[listrosnorm]}]
they become much clearer:

The blue regions indicate regular, e.g. periodic behaviour, and the green/red regions mean irregular, e.g. chaotic. This method allows us to screen large ranges of paramters to better understand the behaviour of a system. Note that in the middle of the image we start recognising a periodic (blue) structure that is called "shrimp". The following image uses a more precise method (Recurrence Plots) to esimate the correlation entropy of the same paramter region. 

The periodic (blue) regions are much clearer now and the colours indicate how chaotic/erratic the system is, but that comes at the cost of CPU time. This figure was obtained using the computer cluster of my home institute. In this case the magnitude of the values that are colour-coded actually have meaning and are not only "proxies" as in the case of the gzip entropies above. I am currently optimising some Mathematica code and hope to be able to provide a piece of code soon, which will generate similar high quality pictures on a laptop. Mathematica's ability to use multiple CPUs, and other built in functions, such as curve-fitting, will be very useful to speed up the program.

There are several interesting features in this figure. First, there is the dominating "shrimp". Note that there are smaller shrimp below its "antennae". This shrimp structure is in fact fractal. A careful study of the figure allows us to guess what types of bifurcations occur and can provide useful information for a thorough bifurcation analysis. 

None of the codes presented here is optimised at all. I just wanted to answer the question how to use gzip to estimate the behaviour of time series and how to screen larger regions of the paramter space. The method is quite quick and very dirty...
POSTED BY: Marco Thiel
Dear Marco:

Very neat indeed. Algorithmic complexity approximated by uncompressibility tests has proven to be a quite successful method in several practical areas of research. I would only add for the sake of others, that it is not only because it is easy and fast that one may want to approximate the Kolmogorov complexity of an object. Lossless compression is a sufficient test of non-algorithmic randomness because the result of lossless compressing an object is effectively an upper bound of its Kolmogorov complexity. But more important, algorithmic information theory proves that no computable measure can spot all effective regularities in, for example, a(n infinite) time series. Despite and thanks to its uncomputability, Kolmogorov complexity (K) is a measure powerful enough to deal with any computable regularity (regularities for which an effective statistical test can be devised and implemented with a Turing machine).

I have myself applied uncompressibility to the behaviour of various systems, particularly space-time diagrams of abstract computing machines such as cellular automata and Turing machines:

- "Asymptotic Behaviour and Ratios of Complexity in Cellular Automata Rule Spaces" Int. Journal of Bifurcation and Chaos vol. 13, no. 9, 2013.

- "Computation and Universality: Class IV versus Class III Cellular Automata", J. of Cellular Automata, vol. 7, no. 5-6, p. 393-430, 2013.

- "Compression-based Investigation of the Dynamical Properties of Cellular Automata and Other Systems", Complex Systems, vol. 19, No. 1, pages 1-28, 2010.

And a Wolfram Demonstration about it to play with:
Cellular Automaton Compressibility

And I have also proposed the only other alternative to lossless compression to approximate K by means of the coding theorem method (Levin), specially useful for small objects, such as short time series, where lossless compression algorithms fail:

- "Numerical Evaluation of the Complexity of Short Strings: A Glance Into the Innermost Structure of Algorithmic Randomness", Applied Mathematics and Computation 219, pp. 63-77, 2012.

We recently used this method as a psychological metric for subjective randomness and before to price market time series:

- Algorithmic complexity for short binary strings applied to psychology: a primer, Behavior Research Methods (in press)

- An Algorithmic Information-theoretic Approach to the Behaviour of Financial Markets, Journal of Economic Surveys, vol. 25-3, pp. 463, 2011.

I apologize for so much self-advertising, I am actually restraining myself from listing even more.

Concerning the function Compress[] in Mathematica, it is based on DEFLATE. Just in case people wonder. For images, doing Export[] to PNG can also serve as an approximation given that PNG lossly compress images.

Thanks for your great post!
POSTED BY: Hector Zenil
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract