Group Abstract

Message Boards

WOLFRAM COMMUNITY

11.6K Views

13 Replies

10 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Mathematica High-Performance Computing Tuning and Debugging

Benchmarking: unexplained switching between high and low performance

Szabolcs Horvát

Posted 8 years ago

This is a cross-post of a question I asked today on StackExchange. tl;dr I am trying to accurately benchmark some vectorized operations, and compare them between systems. But benchmarking is hard to do well, and I am getting inconsistent results: performance is switching, apparently randomly, between "slow" and "fast". Why? Here is some code that benchmarks adding two packed arrays of size `n`, where `n` is just above a million. The timing is measured 5 times, to ensure consistency, then `n` is increased a bit, then the summation is timed again, etc. The whole benchmark is repeated twice. Table[ n = 1000000 + k; SeedRandom[120]; a = RandomReal[1, n]; b = RandomReal[1, n]; {k, Table[First@RepeatedTiming[a + b;], {5}]}, {2}, {k, 20000, 200000, 20000} ] The results are below. In each row, the first number is the array size, the rest are the 5 timings. {{ {20000, {0.000799, 0.000801, 0.000797, 0.000804, 0.000800}}, {40000, {0.00224, 0.00225, 0.00223, 0.00224, 0.00223}}, {60000, {0.00226, 0.00226, 0.00227, 0.00226, 0.00226}}, {80000, {0.00229, 0.00229, 0.00229, 0.00229, 0.00229}}, {100000, {0.00087, 0.000868, 0.000874, 0.000873, 0.00089}}, {120000, {0.00235, 0.00236, 0.00235, 0.00236, 0.00235}}, {140000, {0.00240, 0.00240, 0.00240, 0.00239, 0.00240}}, {160000, {0.00245, 0.00246, 0.00245, 0.00246, 0.00245}}, {180000, {0.00097, 0.000964, 0.000965, 0.000961, 0.000963}}, {200000, {0.00255, 0.00258, 0.00254, 0.00256, 0.00254}}}, {{20000, {0.00224, 0.00224, 0.00224, 0.00220, 0.00221}}, {40000, {0.00224, 0.00224, 0.00223, 0.00224, 0.00223}}, {60000, {0.00227, 0.00227, 0.00227, 0.00226, 0.00227}}, {80000, {0.00234, 0.00235, 0.00233, 0.00230, 0.00230}}, {100000, {0.00233, 0.00232, 0.00232, 0.00233, 0.00233}}, {120000, {0.00234, 0.00238, 0.00235, 0.00239, 0.00237}}, {140000, {0.00238, 0.00238, 0.00238, 0.00238, 0.00238}}, {160000, {0.00247, 0.00245, 0.00245, 0.00246, 0.00245}}, {180000, {0.000965, 0.000961, 0.000962, 0.000967, 0.000968}}, {200000, {0.00254, 0.00259, 0.00255, 0.00254, 0.00254}}}} Things to notice: The 5 timings for the same array are always consistent. The timings are generally proportional to the array size. However, I see some "fast" (about 0.0008 s) and some "slow" (about 0.002 s) timings. Between the two runs, it is not always the same array size that is fast. Look at 20,000, 80,000 and 180,000 in the first run and 180,000 in the second run. These change randomly between runs. "Slow" and "fast" differ by a very significant factor of about 2.5-2.6. Why do I see this switching between fast and slow timing? What is causing it? It prevents me from getting consistent benchmark results. The measurements were done with Mathematica 11.1.0 on a 2014 MacBook Pro (`Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz`, 4 cores) connected to AC power. Turbo Boost is disabled using this tool. I closed other programs as much as possible (but there are always many background tasks on a modern OS). You might think that such short timings are not relevant in real-world applications. But remember that `RepeatedTiming` repeats the operation enough times to run for at least a second. I still get some fast timings if I increase this to as many as 15 seconds, or if I run the test many times consecutively. I compared `RepeatedTiming` with `AbsoluteTiming@Do[..., {bigNumber}]`, and there is no difference (other than the occasional fast-slow switching). I noticed that longer arrays are less likely to produce fast timings than shorter ones, but I am not sure. I also noticed that running a long one tends to cause the subsequent short one to be slow again. Due to the fickle nature of the results, it is hard to be sure about these things. On first sight, this may not look like a Mathematica question. But benchmarking is hard, and many things can go wrong. If I post it on another site, people may rightfully suspect that it is something specific to Mathematica that is causing it. I believe that vector arithmetic is parallelized in Mathematica. There is an interesting talk here by Mark Sofroniou from WTC2016, which also discusses how the stragety used to distribute parlallel threads between cores can have a significant impact on performance.

POSTED BY: Szabolcs Horvát

13 Replies

Sort By:

l van Veen

l van Veen, Hewlett-Packard Enterprise

Posted 8 years ago

Hi Szabolcs, I have no solution but I did find some consistent behaviour on my system.(win10 mm11.1). I timed just the same a+b a lot of times and the first run is not always faster than the subsequent runs. For small arrays the first run is slower and for larger it is faster. On my computer it seems to "switch" at an array size of 130564. I ran it multiple times and it seems "consistent". Perhaps this helps in the troubleshooting process. g[seed_, array_] := Block[{out = Table[SeedRandom[seed]; a = RandomReal[1, array]; b = RandomReal[1, array]; Reap[For[i = 0, i < 200, i++, Sow[AbsoluteTiming[(a + b);]]]][[2, 1, All, 1]], {30}]}, Map[Mean, out\[Transpose]]] list = Table[ Block[{data = g[12567, 130530 + x]}, Show[ListPlot[data, PlotRange -> {{0, 200}, {0, 0.0005}}, AxesOrigin -> {0, 0}, AspectRatio -> Full, PlotLabel -> 130540 + x], Graphics[{PointSize[Large], Red, Point[{1, data[[1]]}]}]]], {x, 0, 35}] The run time also rises at that point.

POSTED BY: l van Veen

Szabolcs Horvát

Posted 8 years ago

This has happened to me before. I tried to make a new post, and instead it duplicated the previous one.

POSTED BY: Szabolcs Horvát

Sander Huisman

Sander Huisman, University of Twente

Posted 8 years ago

That's because of the 1000000+ 8000*j right? the calculation becomes slightly bigger each time?

POSTED BY: Sander Huisman

Daniel Lichtblau

Daniel Lichtblau, Wolfram Research

Posted 8 years ago

The forum has a notoriously droll sense of humor.

POSTED BY: Daniel Lichtblau

David Keith

Posted 8 years ago

With n=1000000+20000 j; Why would more cache always give me the slow timings? Or are these all fast timings on a slower system?

POSTED BY: David Keith

David Keith

Posted 8 years ago

THIS IS NOT MY POST! THE FORUM HAS GONE FUNNY! UPDATE 2: I did a comparison with MATLAB 2017a. I believe that both Mathematica and MATLAB use the MKL, and this should perform similarly on vector arithmetic. a=rand(1,1000000); b=rand(1,1000000); tic; for i=1:1000 c=a+b; end;toc The timing (divided by 1000 to account for the repetitions) is 0.00065 s. This happens to be the same as the "fast" timing I get from Mathematica. But in MATLAB, I get the "fast" timing consistently. This suggests that what I see is "a bug" in Mathematica, or rather that there is a potential for a speedup by a factor of 2.5 or more on my hardware and OS.

POSTED BY: David Keith

Sander Huisman

Sander Huisman, University of Twente

Posted 8 years ago

Your Xeon CPU probably has more cache, could you try: n=1000000+20000 j; ?

POSTED BY: Sander Huisman

Szabolcs Horvát

Posted 8 years ago

UPDATE 2: I did a comparison with MATLAB 2017a. I believe that both Mathematica and MATLAB use the MKL, and this should perform similarly on vector arithmetic. a=rand(1,1000000); b=rand(1,1000000); tic; for i=1:1000 c=a+b; end;toc The timing (divided by 1000 to account for the repetitions) is 0.00065 s. This happens to be the same as the "fast" timing I get from Mathematica. But in MATLAB, I get the "fast" timing consistently. This suggests that what I see is "a bug" in Mathematica, or rather that there is a potential for a speedup by a factor of 2.5 or more on my hardware and OS.

POSTED BY: Szabolcs Horvát

David Keith

Posted 8 years ago

Running Sanders code: In[8]:= $Version Out[8]= "11.1.0 for Microsoft Windows (64-bit) (March 13, 2017)"

POSTED BY: David Keith

Szabolcs Horvát

Posted 8 years ago

POSTED BY: Szabolcs Horvát

Sander Huisman

Sander Huisman, University of Twente

Posted 8 years ago

I slightly changed your code: timings=Table[ n=1000000+8000 j; SeedRandom[120]; a=RandomReal[1,n]; b=RandomReal[1,n]; First@AbsoluteTiming[Do[a+b;,{5}]] , {10} , {j,20} ] ListPlot[Flatten[timings],PlotRange->{0,All}] giving: very reproducible behavior... Not yet sure what the cause is though...

POSTED BY: Sander Huisman

Sander Huisman

Sander Huisman, University of Twente

Posted 8 years ago

First thought: RepeatedTiming uses a trimmed mean, such that outliers (either fast or slow) are removed, such that catching outliers might be even harder... Second thought: 2 million floats (16MB?) will be generally be larger than the CPU cache which is generally, 4, 6 of 8 MB for desktop/laptop machines. Could that influence it? If it needs just 2 full caches, or if other processes have remnants, it needs 3 caches? just a thought, not an export on this...

POSTED BY: Sander Huisman

l van Veen

l van Veen, Hewlett-Packard Enterprise

Posted 8 years ago

Hi I got these results. Not much difference because they are all slow (although 0.0046 is slowest) compared to your beast machine ;) Win10 MM11.1 {{{20000,{0.0035,0.0037,0.0035,0.0036,0.0037}}, {40000,{0.0036,0.0034,0.0036,0.0035,0.0039}}, {60000,{0.0041,0.0040,0.0039,0.0036,0.0037}}, {80000,{0.0036,0.0037,0.0037,0.0036,0.0038}}, {100000,{0.00362,0.0038,0.0037,0.0042,0.0045}}, {120000,{0.0040,0.0038,0.0038,0.0045,0.0038}}, {140000,{0.0038,0.0038,0.0044,0.0038,0.0038}}, {160000,{0.0039,0.0038,0.0041,0.0039,0.0041}}, {180000,{0.0042,0.0040,0.0040,0.0040,0.0039}}, {200000,{0.0042,0.0040,0.0040,0.0043,0.0041}}}, {{20000,{0.0034,0.0035,0.0034,0.0034,0.0035}}, {40000,{0.0036,0.0036,0.0035,0.0035,0.0041}}, {60000,{0.0040,0.0037,0.0037,0.0036,0.0037}}, {80000,{0.0037,0.0037,0.0037,0.0037,0.0037}}, {100000,{0.0039,0.0038,0.0038,0.0037,0.0044}}, {120000,{0.0046,0.0046,0.0038,0.0038,0.0037}}, {140000,{0.0040,0.0039,0.0038,0.0042,0.0040}}, {160000,{0.0039,0.0039,0.0038,0.0039,0.0039}}, {180000,{0.0040,0.0040,0.0040,0.0040,0.0040}}, {200000,{0.0041,0.0041,0.0040,0.0043,0.0041}}}}

POSTED BY: l van Veen

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback