# Benchmarking: unexplained switching between high and low performance

Posted 1 year ago
2547 Views
|
13 Replies
|
10 Total Likes
|
 This is a cross-post of a question I asked today on StackExchange. tl;dr I am trying to accurately benchmark some vectorized operations, and compare them between systems. But benchmarking is hard to do well, and I am getting inconsistent results: performance is switching, apparently randomly, between "slow" and "fast". Why?Here is some code that benchmarks adding two packed arrays of size n, where n is just above a million. The timing is measured 5 times, to ensure consistency, then n is increased a bit, then the summation is timed again, etc. The whole benchmark is repeated twice. Table[ n = 1000000 + k; SeedRandom[120]; a = RandomReal[1, n]; b = RandomReal[1, n]; {k, Table[First@RepeatedTiming[a + b;], {5}]}, {2}, {k, 20000, 200000, 20000} ] The results are below. In each row, the first number is the array size, the rest are the 5 timings. {{ {20000, {0.000799, 0.000801, 0.000797, 0.000804, 0.000800}}, {40000, {0.00224, 0.00225, 0.00223, 0.00224, 0.00223}}, {60000, {0.00226, 0.00226, 0.00227, 0.00226, 0.00226}}, {80000, {0.00229, 0.00229, 0.00229, 0.00229, 0.00229}}, {100000, {0.00087, 0.000868, 0.000874, 0.000873, 0.00089}}, {120000, {0.00235, 0.00236, 0.00235, 0.00236, 0.00235}}, {140000, {0.00240, 0.00240, 0.00240, 0.00239, 0.00240}}, {160000, {0.00245, 0.00246, 0.00245, 0.00246, 0.00245}}, {180000, {0.00097, 0.000964, 0.000965, 0.000961, 0.000963}}, {200000, {0.00255, 0.00258, 0.00254, 0.00256, 0.00254}}}, {{20000, {0.00224, 0.00224, 0.00224, 0.00220, 0.00221}}, {40000, {0.00224, 0.00224, 0.00223, 0.00224, 0.00223}}, {60000, {0.00227, 0.00227, 0.00227, 0.00226, 0.00227}}, {80000, {0.00234, 0.00235, 0.00233, 0.00230, 0.00230}}, {100000, {0.00233, 0.00232, 0.00232, 0.00233, 0.00233}}, {120000, {0.00234, 0.00238, 0.00235, 0.00239, 0.00237}}, {140000, {0.00238, 0.00238, 0.00238, 0.00238, 0.00238}}, {160000, {0.00247, 0.00245, 0.00245, 0.00246, 0.00245}}, {180000, {0.000965, 0.000961, 0.000962, 0.000967, 0.000968}}, {200000, {0.00254, 0.00259, 0.00255, 0.00254, 0.00254}}}} Things to notice: The 5 timings for the same array are always consistent. The timings are generally proportional to the array size. However, I see some "fast" (about 0.0008 s) and some "slow" (about 0.002 s) timings. Between the two runs, it is not always the same array size that is fast. Look at 20,000, 80,000 and 180,000 in the first run and 180,000 in the second run. These change randomly between runs. "Slow" and "fast" differ by a very significant factor of about 2.5-2.6. Why do I see this switching between fast and slow timing? What is causing it? It prevents me from getting consistent benchmark results. The measurements were done with Mathematica 11.1.0 on a 2014 MacBook Pro (Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz, 4 cores) connected to AC power. Turbo Boost is disabled using this tool.I closed other programs as much as possible (but there are always many background tasks on a modern OS). You might think that such short timings are not relevant in real-world applications. But remember that RepeatedTiming repeats the operation enough times to run for at least a second. I still get some fast timings if I increase this to as many as 15 seconds, or if I run the test many times consecutively.I compared RepeatedTiming with AbsoluteTiming@Do[..., {bigNumber}], and there is no difference (other than the occasional fast-slow switching).I noticed that longer arrays are less likely to produce fast timings than shorter ones, but I am not sure. I also noticed that running a long one tends to cause the subsequent short one to be slow again. Due to the fickle nature of the results, it is hard to be sure about these things. On first sight, this may not look like a Mathematica question. But benchmarking is hard, and many things can go wrong. If I post it on another site, people may rightfully suspect that it is something specific to Mathematica that is causing it.I believe that vector arithmetic is parallelized in Mathematica. There is an interesting talk here by Mark Sofroniou from WTC2016, which also discusses how the stragety used to distribute parlallel threads between cores can have a significant impact on performance.
13 Replies
Sort By:
Posted 1 year ago
 Hi I got these results. Not much difference because they are all slow (although 0.0046 is slowest) compared to your beast machine ;) Win10 MM11.1{{{20000,{0.0035,0.0037,0.0035,0.0036,0.0037}}, {40000,{0.0036,0.0034,0.0036,0.0035,0.0039}}, {60000,{0.0041,0.0040,0.0039,0.0036,0.0037}}, {80000,{0.0036,0.0037,0.0037,0.0036,0.0038}}, {100000,{0.00362,0.0038,0.0037,0.0042,0.0045}}, {120000,{0.0040,0.0038,0.0038,0.0045,0.0038}}, {140000,{0.0038,0.0038,0.0044,0.0038,0.0038}}, {160000,{0.0039,0.0038,0.0041,0.0039,0.0041}}, {180000,{0.0042,0.0040,0.0040,0.0040,0.0039}}, {200000,{0.0042,0.0040,0.0040,0.0043,0.0041}}},{{20000,{0.0034,0.0035,0.0034,0.0034,0.0035}}, {40000,{0.0036,0.0036,0.0035,0.0035,0.0041}}, {60000,{0.0040,0.0037,0.0037,0.0036,0.0037}}, {80000,{0.0037,0.0037,0.0037,0.0037,0.0037}}, {100000,{0.0039,0.0038,0.0038,0.0037,0.0044}}, {120000,{0.0046,0.0046,0.0038,0.0038,0.0037}}, {140000,{0.0040,0.0039,0.0038,0.0042,0.0040}}, {160000,{0.0039,0.0039,0.0038,0.0039,0.0039}}, {180000,{0.0040,0.0040,0.0040,0.0040,0.0040}}, {200000,{0.0041,0.0041,0.0040,0.0043,0.0041}}}}
Posted 1 year ago
 First thought: RepeatedTiming uses a trimmed mean, such that outliers (either fast or slow) are removed, such that catching outliers might be even harder... Second thought: 2 million floats (16MB?) will be generally be larger than the CPU cache which is generally, 4, 6 of 8 MB for desktop/laptop machines. Could that influence it? If it needs just 2 full caches, or if other processes have remnants, it needs 3 caches? just a thought, not an export on this...
Posted 1 year ago
 I slightly changed your code: timings=Table[ n=1000000+8000 j; SeedRandom[120]; a=RandomReal[1,n]; b=RandomReal[1,n]; First@AbsoluteTiming[Do[a+b;,{5}]] , {10} , {j,20} ] ListPlot[Flatten[timings],PlotRange->{0,All}] giving:very reproducible behavior... Not yet sure what the cause is though...
Posted 1 year ago
 Thanks! This change to the code is useful. The only problem is that I do not trust very short timings. As I remember, on Windows XP, the timer's resolution was more than 10 ms. This shouldn't be an issue on today's OSs, but short timings are still much more affected by fluctuations. Here's the same with more repetitions in the Do. The result is basically the same what you showed: timings = Table[n = 1000000 + 8000 j; SeedRandom[120]; a = RandomReal[1, n]; b = RandomReal[1, n]; First@AbsoluteTiming[Do[a + b;, {500}]], {10}, {j, 20}] ListPlot[Flatten[timings], PlotRange -> {0, All}] With Turbo Boost on:With Turbo Boost off:It is strange that in addition to the slow-fast pattern, each subsequent evaluation gets slower than the previous one.
Posted 1 year ago
 That's because of the 1000000+ 8000*j right? the calculation becomes slightly bigger each time?
Posted 1 year ago
 Running Sanders code: In[8]:= \$Version Out[8]= "11.1.0 for Microsoft Windows (64-bit) (March 13, 2017)" 
Posted 1 year ago
 Your Xeon CPU probably has more cache, could you try: n=1000000+20000 j; ?
Posted 1 year ago
 With n=1000000+20000 j;Why would more cache always give me the slow timings? Or are these all fast timings on a slower system?
Posted 1 year ago
 UPDATE 2:I did a comparison with MATLAB 2017a. I believe that both Mathematica and MATLAB use the MKL, and this should perform similarly on vector arithmetic. a=rand(1,1000000); b=rand(1,1000000); tic; for i=1:1000 c=a+b; end;toc The timing (divided by 1000 to account for the repetitions) is 0.00065 s. This happens to be the same as the "fast" timing I get from Mathematica. But in MATLAB, I get the "fast" timing consistently.This suggests that what I see is "a bug" in Mathematica, or rather that there is a potential for a speedup by a factor of 2.5 or more on my hardware and OS.
Posted 1 year ago
 THIS IS NOT MY POST! THE FORUM HAS GONE FUNNY!UPDATE 2:I did a comparison with MATLAB 2017a. I believe that both Mathematica and MATLAB use the MKL, and this should perform similarly on vector arithmetic. a=rand(1,1000000); b=rand(1,1000000); tic; for i=1:1000 c=a+b; end;toc The timing (divided by 1000 to account for the repetitions) is 0.00065 s. This happens to be the same as the "fast" timing I get from Mathematica. But in MATLAB, I get the "fast" timing consistently.This suggests that what I see is "a bug" in Mathematica, or rather that there is a potential for a speedup by a factor of 2.5 or more on my hardware and OS.
Posted 1 year ago
 The forum has a notoriously droll sense of humor.
 Hi Szabolcs, I have no solution but I did find some consistent behaviour on my system.(win10 mm11.1). I timed just the same a+b a lot of times and the first run is not always faster than the subsequent runs. For small arrays the first run is slower and for larger it is faster. On my computer it seems to "switch" at an array size of 130564. I ran it multiple times and it seems "consistent". Perhaps this helps in the troubleshooting process. g[seed_, array_] := Block[{out = Table[SeedRandom[seed]; a = RandomReal[1, array]; b = RandomReal[1, array]; Reap[For[i = 0, i < 200, i++, Sow[AbsoluteTiming[(a + b);]]]][[2, 1, All, 1]], {30}]}, Map[Mean, out\[Transpose]]] list = Table[ Block[{data = g[12567, 130530 + x]}, Show[ListPlot[data, PlotRange -> {{0, 200}, {0, 0.0005}}, AxesOrigin -> {0, 0}, AspectRatio -> Full, PlotLabel -> 130540 + x], Graphics[{PointSize[Large], Red, Point[{1, data[[1]]}]}]]], {x, 0, 35}] The run time also rises at that point.