Message Boards Message Boards

Benchmarking: unexplained switching between high and low performance

GROUPS:

This is a cross-post of a question I asked today on StackExchange.


tl;dr I am trying to accurately benchmark some vectorized operations, and compare them between systems. But benchmarking is hard to do well, and I am getting inconsistent results: performance is switching, apparently randomly, between "slow" and "fast". Why?

Here is some code that benchmarks adding two packed arrays of size n, where n is just above a million. The timing is measured 5 times, to ensure consistency, then n is increased a bit, then the summation is timed again, etc. The whole benchmark is repeated twice.

Table[
 n = 1000000 + k;
 SeedRandom[120];
 a = RandomReal[1, n];
 b = RandomReal[1, n];
 {k, Table[First@RepeatedTiming[a + b;], {5}]},
 {2}, {k, 20000, 200000, 20000}
 ]

The results are below. In each row, the first number is the array size, the rest are the 5 timings.

{{ {20000, {0.000799, 0.000801, 0.000797, 0.000804, 0.000800}}, 
   {40000, {0.00224, 0.00225, 0.00223, 0.00224, 0.00223}}, 
   {60000, {0.00226, 0.00226, 0.00227, 0.00226, 0.00226}}, 
   {80000, {0.00229, 0.00229, 0.00229, 0.00229, 0.00229}}, 
  {100000, {0.00087, 0.000868, 0.000874, 0.000873, 0.00089}}, 
  {120000, {0.00235, 0.00236, 0.00235, 0.00236, 0.00235}}, 
  {140000, {0.00240, 0.00240, 0.00240, 0.00239, 0.00240}}, 
  {160000, {0.00245, 0.00246, 0.00245, 0.00246, 0.00245}}, 
  {180000, {0.00097, 0.000964, 0.000965, 0.000961, 0.000963}}, 
  {200000, {0.00255, 0.00258, 0.00254, 0.00256, 0.00254}}}, 

   {{20000, {0.00224, 0.00224, 0.00224, 0.00220, 0.00221}}, 
    {40000, {0.00224, 0.00224, 0.00223, 0.00224, 0.00223}},
    {60000, {0.00227, 0.00227, 0.00227, 0.00226, 0.00227}},
    {80000, {0.00234, 0.00235, 0.00233, 0.00230, 0.00230}},
   {100000, {0.00233, 0.00232, 0.00232, 0.00233, 0.00233}}, 
   {120000, {0.00234, 0.00238, 0.00235, 0.00239, 0.00237}}, 
   {140000, {0.00238, 0.00238, 0.00238, 0.00238, 0.00238}}, 
   {160000, {0.00247, 0.00245, 0.00245, 0.00246, 0.00245}}, 
   {180000, {0.000965, 0.000961, 0.000962, 0.000967, 0.000968}}, 
   {200000, {0.00254, 0.00259, 0.00255, 0.00254, 0.00254}}}}

Things to notice:

  • The 5 timings for the same array are always consistent.
  • The timings are generally proportional to the array size.
  • However, I see some "fast" (about 0.0008 s) and some "slow" (about 0.002 s) timings.
  • Between the two runs, it is not always the same array size that is fast. Look at 20,000, 80,000 and 180,000 in the first run and 180,000 in the second run. These change randomly between runs.
  • "Slow" and "fast" differ by a very significant factor of about 2.5-2.6.

Why do I see this switching between fast and slow timing? What is causing it? It prevents me from getting consistent benchmark results.

enter image description here


The measurements were done with Mathematica 11.1.0 on a 2014 MacBook Pro (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 4 cores) connected to AC power. Turbo Boost is disabled using this tool.

I closed other programs as much as possible (but there are always many background tasks on a modern OS).


You might think that such short timings are not relevant in real-world applications. But remember that RepeatedTiming repeats the operation enough times to run for at least a second. I still get some fast timings if I increase this to as many as 15 seconds, or if I run the test many times consecutively.

I compared RepeatedTiming with AbsoluteTiming@Do[..., {bigNumber}], and there is no difference (other than the occasional fast-slow switching).

I noticed that longer arrays are less likely to produce fast timings than shorter ones, but I am not sure. I also noticed that running a long one tends to cause the subsequent short one to be slow again. Due to the fickle nature of the results, it is hard to be sure about these things.


On first sight, this may not look like a Mathematica question. But benchmarking is hard, and many things can go wrong. If I post it on another site, people may rightfully suspect that it is something specific to Mathematica that is causing it.

I believe that vector arithmetic is parallelized in Mathematica. There is an interesting talk here by Mark Sofroniou from WTC2016, which also discusses how the stragety used to distribute parlallel threads between cores can have a significant impact on performance.

POSTED BY: Szabolcs Horvát
Answer
2 months ago

Hi I got these results. Not much difference because they are all slow (although 0.0046 is slowest) compared to your beast machine ;) Win10 MM11.1

{{{20000,{0.0035,0.0037,0.0035,0.0036,0.0037}}, {40000,{0.0036,0.0034,0.0036,0.0035,0.0039}}, {60000,{0.0041,0.0040,0.0039,0.0036,0.0037}}, {80000,{0.0036,0.0037,0.0037,0.0036,0.0038}}, {100000,{0.00362,0.0038,0.0037,0.0042,0.0045}}, {120000,{0.0040,0.0038,0.0038,0.0045,0.0038}}, {140000,{0.0038,0.0038,0.0044,0.0038,0.0038}}, {160000,{0.0039,0.0038,0.0041,0.0039,0.0041}}, {180000,{0.0042,0.0040,0.0040,0.0040,0.0039}}, {200000,{0.0042,0.0040,0.0040,0.0043,0.0041}}},

{{20000,{0.0034,0.0035,0.0034,0.0034,0.0035}}, {40000,{0.0036,0.0036,0.0035,0.0035,0.0041}}, {60000,{0.0040,0.0037,0.0037,0.0036,0.0037}}, {80000,{0.0037,0.0037,0.0037,0.0037,0.0037}}, {100000,{0.0039,0.0038,0.0038,0.0037,0.0044}}, {120000,{0.0046,0.0046,0.0038,0.0038,0.0037}}, {140000,{0.0040,0.0039,0.0038,0.0042,0.0040}}, {160000,{0.0039,0.0039,0.0038,0.0039,0.0039}}, {180000,{0.0040,0.0040,0.0040,0.0040,0.0040}}, {200000,{0.0041,0.0041,0.0040,0.0043,0.0041}}}}

POSTED BY: l van Veen
Answer
2 months ago
  • First thought: RepeatedTiming uses a trimmed mean, such that outliers (either fast or slow) are removed, such that catching outliers might be even harder...
  • Second thought: 2 million floats (16MB?) will be generally be larger than the CPU cache which is generally, 4, 6 of 8 MB for desktop/laptop machines. Could that influence it? If it needs just 2 full caches, or if other processes have remnants, it needs 3 caches? just a thought, not an export on this...
POSTED BY: Sander Huisman
Answer
2 months ago

I slightly changed your code:

timings=Table[
    n=1000000+8000 j;
    SeedRandom[120];
    a=RandomReal[1,n];
    b=RandomReal[1,n];
    First@AbsoluteTiming[Do[a+b;,{5}]]
,
    {10}
,
    {j,20}
]
ListPlot[Flatten[timings],PlotRange->{0,All}]

giving:

enter image description here

very reproducible behavior... Not yet sure what the cause is though...

POSTED BY: Sander Huisman
Answer
2 months ago

Thanks! This change to the code is useful. The only problem is that I do not trust very short timings. As I remember, on Windows XP, the timer's resolution was more than 10 ms. This shouldn't be an issue on today's OSs, but short timings are still much more affected by fluctuations. Here's the same with more repetitions in the Do. The result is basically the same what you showed:

timings = Table[n = 1000000 + 8000 j;
  SeedRandom[120];
  a = RandomReal[1, n];
  b = RandomReal[1, n];
  First@AbsoluteTiming[Do[a + b;, {500}]], {10}, {j, 20}]
ListPlot[Flatten[timings], PlotRange -> {0, All}]

With Turbo Boost on:

enter image description here

With Turbo Boost off:

enter image description here

It is strange that in addition to the slow-fast pattern, each subsequent evaluation gets slower than the previous one.

POSTED BY: Szabolcs Horvát
Answer
2 months ago

That's because of the 1000000+ 8000*j right? the calculation becomes slightly bigger each time?

POSTED BY: Sander Huisman
Answer
2 months ago

Running Sanders code:

In[8]:= $Version

Out[8]= "11.1.0 for Microsoft Windows (64-bit) (March 13, 2017)"

enter image description here

enter image description here

POSTED BY: David Keith
Answer
2 months ago

Your Xeon CPU probably has more cache, could you try:

n=1000000+20000 j;

?

POSTED BY: Sander Huisman
Answer
2 months ago

With n=1000000+20000 j;

enter image description here

Why would more cache always give me the slow timings? Or are these all fast timings on a slower system?

POSTED BY: David Keith
Answer
2 months ago

UPDATE 2:

I did a comparison with MATLAB 2017a. I believe that both Mathematica and MATLAB use the MKL, and this should perform similarly on vector arithmetic.

a=rand(1,1000000);
b=rand(1,1000000);

tic; for i=1:1000
c=a+b;
end;toc

The timing (divided by 1000 to account for the repetitions) is 0.00065 s. This happens to be the same as the "fast" timing I get from Mathematica. But in MATLAB, I get the "fast" timing consistently.

This suggests that what I see is "a bug" in Mathematica, or rather that there is a potential for a speedup by a factor of 2.5 or more on my hardware and OS.

POSTED BY: Szabolcs Horvát
Answer
2 months ago

THIS IS NOT MY POST! THE FORUM HAS GONE FUNNY!

UPDATE 2:

I did a comparison with MATLAB 2017a. I believe that both Mathematica and MATLAB use the MKL, and this should perform similarly on vector arithmetic.

a=rand(1,1000000);
b=rand(1,1000000);

tic; for i=1:1000
c=a+b;
end;toc

The timing (divided by 1000 to account for the repetitions) is 0.00065 s. This happens to be the same as the "fast" timing I get from Mathematica. But in MATLAB, I get the "fast" timing consistently.

This suggests that what I see is "a bug" in Mathematica, or rather that there is a potential for a speedup by a factor of 2.5 or more on my hardware and OS.

POSTED BY: David Keith
Answer
2 months ago

The forum has a notoriously droll sense of humor.

POSTED BY: Daniel Lichtblau
Answer
2 months ago

This has happened to me before. I tried to make a new post, and instead it duplicated the previous one.

POSTED BY: Szabolcs Horvát
Answer
2 months ago

Hi Szabolcs, I have no solution but I did find some consistent behaviour on my system.(win10 mm11.1). I timed just the same a+b a lot of times and the first run is not always faster than the subsequent runs. For small arrays the first run is slower and for larger it is faster. On my computer it seems to "switch" at an array size of 130564. I ran it multiple times and it seems "consistent". Perhaps this helps in the troubleshooting process.

g[seed_, array_] :=
 Block[{out = 
    Table[SeedRandom[seed]; a = RandomReal[1, array]; 
     b = RandomReal[1, array]; 
     Reap[For[i = 0, i < 200, i++, Sow[AbsoluteTiming[(a + b);]]]][[2,
        1, All, 1]], {30}]}, Map[Mean, out\[Transpose]]]
list = Table[
  Block[{data = g[12567, 130530 + x]}, 
   Show[ListPlot[data, PlotRange -> {{0, 200}, {0, 0.0005}}, 
     AxesOrigin -> {0, 0}, AspectRatio -> Full, 
     PlotLabel -> 130540 + x], 
    Graphics[{PointSize[Large], Red, Point[{1, data[[1]]}]}]]], {x, 0,
    35}]

The run time also rises at that point. enter image description here

POSTED BY: l van Veen
Answer
2 months ago

Group Abstract Group Abstract