This article primarily focuses on how to enhance the performance of highly intensive computational tasks through minor techniques, including functional programming, kernel parallelization, compilation optimization, and ultimately, utilizing the capability of GPU acceleration using CUDA.
Linking Mathematica with C++ is also a great way to boost performance, but we will not cover it here, as Luyan Yu has authored another excellent article on this topic, available at:
Linking C++ code with Mathematica using LibraryLink - Luyan Yu
Let’s begin with a simple task, find out the sum of all number digits within 1,000,000.:
for example, sum of all number digits within 12 will be :
1+2+3+4+5+6+7+8+9+1+0+1+1+1+2=51.
To solve the problem itself is fairly easy, we first consider number in this form from 000,000 to 999,999, each digit (0 through 9) appears an equal number of times in each position. For each position, the sum of every 10 digits is 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 = 45. Thus, The sum of the digits of all numbers in any given position is 45 * 100,000 = 4,500,000. Then, since there are six positions, the total sum of all digits from 000,000 to 999,999 is 6 * 4,500,000 = 27,000,000. Then we add 1,000,000 as a final 1 to get the result 27,000,001.
The formula of total digits sum with in 10^x can be easily deduced, x*45*10^(x-1)+1
or
$$ f(x)=\frac{9}{2}\ 10^x x+1 $$
However, solving this specific problem is not our focus here, nor is the solving algorithm itself. This article aims to teach you techniques for enhancing the performance of any general function or algorithm you might write in the future. We will focus on optimizing performance by using different implementations of the same algorithm in Mathematica. Thus, the algorithm used for testing should be intentionally kept consistent for control variable purposes.
Let's start with the simplest and most basic approach: C-style coding. We will use this as our baseline performance test algorithm.
If you are a C programmer, you might write Mathematica code like this:
cDigitSum[nn_] := Module[{sum = 0, n = nn},
While[
n > 0
,
sum += Mod[n, 10];
n = Quotient[n, 10];
];
sum
]
AbsoluteTiming[
total = 0;
For[n = 0, n <= 10^6, n++,
total += cDigitSum[n]
];
total
]
{11.9438, 27000001}
Costs around 12 seconds, which is not very efficient. but don’t be discoursed. Let’s see what Mathematica can do.
Surprisingly, Mathematica has tons of magic built-in function, even include this one : DigitSum
. Even more surprising is that it is slower than our humble and trivial cDigitSum
function.
AbsoluteTiming[
total = 0;
For[n = 0, n <= 10^6, n++,
total += DigitSum[n]
];
total
]
{13.3594, 27000001}
It costs 13 seconds."
Congratulations, you have beaten Mathematica in this case. We will no longer use Mathematica's built-in DigitSum
function and will continue working based on our cDigitSum
function."
In Mathematica, using C-style coding is very inefficient. You can use a Do
statement instead of For
to make the code more elegant.
AbsoluteTiming[
total = 0;
Do[total += cDigitSum[n], {n, 10^6}];
total
]
{10.9844, 27000001}
What can make code even more elegant by using Map
operator form/@
, and postfix form //
.
cDigitSum/@Range[10^6]//Total//AbsoluteTiming
{10.7969, 27000001}
Functional programming usually makes code more elegant and easier to read and often provides better performance. However, since our performance here is limited by cDigitSum
, it doesn’t boost the speed significantly.
Nevertheless, Mathematica provides a way to boost functional-style code like Do
or Map
without modifying cDigitSum
itself: kernel parallelization. Simply by applying the function Parallelize
to the code, you can achieve a tremendous improvement in speed.
Warning:
Parallelization using Do
involves setting up shared variables, which can cause a deadly performance drop due to rapid, concurrent writes to the same variable from all parallel kernels. Do not try this practice.
However, I can demonstrate what happens if you do so. Please note that the quantity has been reduced to 10^4
to prevent the computation from hanging or crashing.
(*DO NOT TRY THIS CODE*)
AbsoluteTiming[
total = 0;
SetSharedVariable[total];
Parallelize@Do[total += cDigitSum[n], {n, 10^4}];
total]
(*DO NOT TRY THIS CODE*)
{20.9006, 180001}
Here is the correct implement using a parallelizedMap
, as you can see there is a vast performance boost, we successfully achieved 2.2 seconds
Parallelize[cDigitSum/@Range[10^6]]//Total//AbsoluteTiming
{2.26036, 27000001}
Or even using a parallelized Sum
:
Parallelize@Sum[cDigitSum[n],{n,10^6}]//AbsoluteTiming
{2.23265, 27000001}
We have achieved a amazing speed boost without change cDigitSum
at all.
The next technique is to tweak the implementation of function, without change the algorithm , what you can do is called Compile
and FunctionComile
.
By using Compile
we can write a compiled version of our cDigitSum
:
cDigitSumCompile = Compile[
{{nn, _Integer}}
,
Module[{sum = 0, digit, n = nn},
While[
n > 0
,
digit = Mod[n, 10];
sum += digit;
n = Quotient[n, 10];
];
sum
]
]
By default, the Compile
function will compile to Wolfram Virtual Machine
, it is equivalent to
cDigitSumCompileWVM = Compile[
{{nn, _Integer}}
,
Module[{sum = 0, digit, n = nn},
While[
n > 0
,
digit = Mod[n, 10];
sum += digit;
n = Quotient[n, 10];
];
sum
]
,
CompilationTarget -> "WVM"
]
Let’s test it
cDigitSumCompile/@Range[10^6]//Total//AbsoluteTiming
{0.327939, 27000001}
cDigitSumCompileWVM/@Range[10^6]//Total//AbsoluteTiming
{0.323876, 27000001}
Without using parallelization or modify the algorithm, we achieved 0.3 seconds !
From now, we will raise our n for this problem from 10^6
to 10^7
, let’s try it again:
cDigitSumCompileWVM/@Range[10^7]//Total//AbsoluteTiming
{3.68126, 315000001}
Instead of compile to Wolfram Virtual Machine
, what you can do to achieve further improvement is compile to C
cDigitSumCompileC = Compile[
{{nn, _Integer}}
,
Module[{sum = 0, digit, n = nn},
While[
n > 0
,
digit = Mod[n, 10];
sum += digit;
n = Quotient[n, 10];
];
sum
]
,
CompilationTarget -> "C"
]
cDigitSumCompileC/@Range[10^7]//Total//AbsoluteTiming
{2.36898, 315000001}
However, you can achieve more with compilation optimization options.
Please note that in this specific case, since our test function is relatively simple, most optimization options would have little effect or meaningless. Nevertheless, I will enable all of them for demonstration purpose.
cDigitSumCompileCOptimize = Compile[
{{nn, _Integer}}
,
Module[{sum = 0, digit, n = nn},
While[
n > 0
,
digit = Mod[n, 10];
sum += digit;
n = Quotient[n, 10];
];
sum
]
,
CompilationTarget -> "C"
,
CompilationOptions -> {"ExpressionOptimization" -> True}
,
RuntimeAttributes -> {Listable}
,
RuntimeOptions -> "Speed"
,
Parallelization -> True
]
cDigitSumCompileCOptimize/@Range[10^7]//Total//AbsoluteTiming
{2.03582, 315000001}
Another compile method is called FunctionCompile
,
cDigitSumFunctionCompile = FunctionCompile[
Function[Typed[nn, "MachineInteger"],
Module[{sum = 0, digit, n = nn},
While[
n > 0
,
digit = Mod[n, 10];
sum += digit;
n = Quotient[n, 10];
];
sum
]
]
]
We will not delve deeply into how to fine-tune FunctionCompile
as it is currently in the experimental testing phase. However, its default settings generally perform quite well.
cDigitSumFunctionCompile/@Range[10^7]//Total//AbsoluteTiming
{0.538493, 315000001}
Once again, we achieved a speed of within 1 second. From now on, we will increase our test quantities from 10^7
to 10^8
and try it again.:
cDigitSumFunctionCompile/@Range[10^8]//Total//AbsoluteTiming
{5.41658, 3600000001}
We can certainly integrate the techniques of kernel parallelization with Compile
or FunctionCompile
to further enhance the speed.
Parallelize[cDigitSumFunctionCompile/@Range[10^8]]//Total//AbsoluteTiming
{2.22658, 3600000001}
What next? Can we further improve performance without altering the algorithm?"
Yes, the answer is CUDA.
First, you will need to load CUDALink
package
Needs["CUDALink`"]
Let’s write a CUDA function code cudaDigitSumCode
first, this CUDA function takes a list of value as input, then for each element we apply our classic test algorithm to calculate the digit sum, then replace its original value. Notice here the mint
data or _Integer
in the specification of CUDAFunctionLoad
is Wolfram Language integer
cudaDigitSumCode = "
__global__ void cudaDigitSum(mint *inout) {
mint index = threadIdx.x + blockIdx.x * blockDim.x;
mint n = inout[index];
mint sum = 0;
while (n > 0) {
sum += n % 10;
n /= 10;
}
inout[index] = sum;
}
";
we can then load this CUDA code by using CUDAFunctionLoad
to get our cudaDigitSum
function.
cudaDigitSum = CUDAFunctionLoad[cudaDigitSumCode,
"cudaDigitSum", {{_Integer}}, 1024]
Then we simply pass a list from 1 to 10^8 to this CUDA function
cudaDigitSum[Range[10^8]]//First//Total//AbsoluteTiming
{1.31861, 3600000001}
Can we go even faster?
Yes, the data transfer time between the host (CPU) and the device (GPU) can be a significant bottleneck. Loading the data onto the GPU first using CUDAMemoryLoad
can save considerable time during calculations. It's also crucial to remember to release the allocated GPU memory using CUDAMemoryUnload
when it's no longer needed. CUDA functions operate directly on the memory space using pointers, which is why you need CUDAMemoryGet
to retrieve the results.
range=CUDAMemoryLoad[Range[10^8]];
AbsoluteTiming[cudaDigitSum[range];CUDAMemoryGet[range]//Total]
CUDAMemoryUnload[range];
{1.06409, 3600000001}
Can we go even faster?
Another speed optimization will be using the correct data type :unsigned int
(32-bit unsigned integer): This type can hold values from 0 to 4,294,967,295. So, it can hold 3,600,000,001.
we need to change the CUDA code slightly:
cudaDigitSumUnsignedIntCode = "
__global__ void cudaDigitSum(unsigned int *inout) {
unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int n = inout[index];
unsigned int sum = 0;
while (n > 0) {
sum += n % 10;
n /= 10;
}
inout[index] = sum;
}
";
and recompile our CUDA function with the correct data specification:
cudaDigitSumUnsignedInt = CUDAFunctionLoad[cudaDigitSumUnsignedIntCode,
"cudaDigitSum", {{"UnsignedInteger32"}}, 1024]
Let’s retest the performance after optimized data type, remember to load the correct data type UnsignedInteger32
in to memory:
range=CUDAMemoryLoad[Range[10^8],"UnsignedInteger32"];
AbsoluteTiming[cudaDigitSumUnsignedInt[range];CUDAMemoryGet[range]//Total]
CUDAMemoryUnload[range];
{0.785289, 3600000001}
Can we go even faster?
Yes.
You might notice that we keep using the Total
function to calculate the total sum of the digit sums in the list.
What if we perform this summation within CUDA as well? In other words, we pass a list of data to CUDA, and it returns the total sum directly. In this case, the computation is entirely offloaded to CUDA; Mathematica only serves to pass the raw data to CUDA and receive the computed result.
We can write a new CUDA function called cudaDigitSumsTotal
.
cudaDigitSumsTotalCode = "
__global__ void cudaDigitSum(unsigned int* in, unsigned int* out) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int n = in[index];
unsigned int sum = 0;
while (n > 0) {
sum += n % 10;
n /= 10;
}
atomicAdd(out, sum);
}
";
The atomicAdd
function is used here to ensure the correctness of the variable during massive concurrent summation. You will almost certainly get incorrect results if you don't use it.
Compile this to cudaDigitSumsTotal
.
cudaDigitSumsTotal = CUDAFunctionLoad[cudaDigitSumsTotalCode, "cudaDigitSum",
{{"UnsignedInteger32", "Input"}, {"UnsignedInteger32", "Output"}}, 1024
];
Load the data into memory.
in = CUDAMemoryLoad[Range[10^8], "UnsignedInteger32"];
out = CUDAMemoryLoad[{0}, "UnsignedInteger32"];
And here are the results:
AbsoluteTiming[
cudaDigitSumsTotal[in, out];
CUDAMemoryGet[out]
]
{0.27297, {3600000001}}
Always remember to free memory.
CUDAMemoryUnload[in];
CUDAMemoryUnload[out];
0.27297 seconds! for a quantity of 10^8
.
Let's try a quantity of 10^9
, but we need to change our data type from unsigned int
, as the result will overflow its range (0 to 4,294,967,295).
atomicAdd
only supports unsigned int
and unsigned long long
.
With the extended range of unsigned long long
(0 to 18,446,744,073,709,551,615), we should handle the sum correctly.
`Needs["CUDALink`"]
cudaDigitSumsTotalExtendCode = "
__global__ void cudaDigitSum(unsigned int* in, unsigned long long * out) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int n = in[index];
unsigned long long sum = 0;
while (n > 0) {
sum += n % 10;
n /= 10;
}
atomicAdd(out, sum);
}
";
cudaDigitSumsTotalExtend = CUDAFunctionLoad[cudaDigitSumsTotalExtendCode,
"cudaDigitSum", {{"UnsignedInteger32", "Input"}, {"Integer64", "Output"
}}, 1024];
in = CUDAMemoryLoad[Range[10^9], "UnsignedInteger32"];
out = CUDAMemoryLoad[{0}, "Integer64"];
AbsoluteTiming[
cudaDigitSumsTotalExtend[in, out];
CUDAMemoryGet[out]
]
CUDAMemoryUnload[in];
CUDAMemoryUnload[out];
{2.73938, {40500000001}}
2.7 seconds. This is where we end up.
Let's do a quick review. Compared to the initial native approach, which took 12 seconds for
n = 10^6
, we've now reached 2.7 seconds for n = 10^9
. That's more than 4,000 times faster, without changing the algorithm itself at all.
In conclusion, we've journeyed from a straightforward but inefficient C-style solution to a highly optimized CUDA implementation, achieving a remarkable 4000x speedup. By maintaining a consistent digit-summing algorithm throughout, we've clearly demonstrated the profound impact of various optimization techniques. Functional programming, kernel parallelization, function compilation with optimization, and CUDA GPU acceleration, when strategically applied, can unlock the true potential of Mathematica for computationally intensive tasks. I hope this exploration has provided valuable insights and inspires you to optimize your own code. Thank you for joining me on this journey.