Group Abstract

Message Boards

WOLFRAM COMMUNITY

1.2K Views

0 Replies

3 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Mathematica High Performance Computing (HPC)

eo iles

Posted 8 months ago

This article primarily focuses on how to enhance the performance of highly intensive computational tasks through minor techniques, including functional programming, kernel parallelization, compilation optimization, and ultimately, utilizing the capability of GPU acceleration using CUDA. Linking Mathematica with C++ is also a great way to boost performance, but we will not cover it here, as Luyan Yu has authored another excellent article on this topic, available at: Linking C++ code with Mathematica using LibraryLink - Luyan Yu Let’s begin with a simple task, find out the sum of all number digits within 1,000,000.: for example, sum of all number digits within 12 will be : 1+2+3+4+5+6+7+8+9+1+0+1+1+1+2=51. To solve the problem itself is fairly easy, we first consider number in this form from 000,000 to 999,999, each digit (0 through 9) appears an equal number of times in each position. For each position, the sum of every 10 digits is 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 = 45. Thus, The sum of the digits of all numbers in any given position is 45 * 100,000 = 4,500,000. Then, since there are six positions, the total sum of all digits from 000,000 to 999,999 is 6 * 4,500,000 = 27,000,000. Then we add 1,000,000 as a final 1 to get the result 27,000,001. The formula of total digits sum with in 10^x can be easily deduced, `x4510^(x-1)+1` or $$ f(x)=\frac{9}{2}\ 10^x x+1 $$ However, solving this specific problem is not our focus here, nor is the solving algorithm itself. This article aims to teach you techniques for enhancing the performance of any general function or algorithm you might write in the future. We will focus on optimizing performance by using different implementations of the same algorithm in Mathematica. Thus, the algorithm used for testing should be intentionally kept consistent for control variable purposes. Let's start with the simplest and most basic approach: C-style coding. We will use this as our baseline performance test algorithm. If you are a C programmer, you might write Mathematica code like this: cDigitSum[nn_] := Module[{sum = 0, n = nn}, While[ n > 0 , sum += Mod[n, 10]; n = Quotient[n, 10]; ]; sum ] AbsoluteTiming[ total = 0; For[n = 0, n <= 10^6, n++, total += cDigitSum[n] ]; total ] {11.9438, 27000001} Costs around 12 seconds, which is not very efficient. but don’t be discoursed. Let’s see what Mathematica can do. Surprisingly, Mathematica has tons of magic built-in function, even include this one : `DigitSum`. Even more surprising is that it is slower than our humble and trivial `cDigitSum` function. AbsoluteTiming[ total = 0; For[n = 0, n <= 10^6, n++, total += DigitSum[n] ]; total ] {13.3594, 27000001} It costs 13 seconds." Congratulations, you have beaten Mathematica in this case. We will no longer use Mathematica's built-in `DigitSum` function and will continue working based on our `cDigitSum` function." In Mathematica, using C-style coding is very inefficient. You can use a `Do` statement instead of `For` to make the code more elegant. AbsoluteTiming[ total = 0; Do[total += cDigitSum[n], {n, 10^6}]; total ] {10.9844, 27000001} What can make code even more elegant by using `Map` operator form`/@`, and postfix form `//`. cDigitSum/@Range[10^6]//Total//AbsoluteTiming {10.7969, 27000001} Functional programming usually makes code more elegant and easier to read and often provides better performance. However, since our performance here is limited by `cDigitSum`, it doesn’t boost the speed significantly. Nevertheless, Mathematica provides a way to boost functional-style code like `Do` or `Map` without modifying `cDigitSum` itself: kernel parallelization. Simply by applying the function `Parallelize` to the code, you can achieve a tremendous improvement in speed. `Warning:` Parallelization using `Do` involves setting up shared variables, which can cause a deadly performance drop due to rapid, concurrent writes to the same variable from all parallel kernels. Do not try this practice. However, I can demonstrate what happens if you do so. Please note that the quantity has been reduced to `10^4` to prevent the computation from hanging or crashing. (DO NOT TRY THIS CODE) AbsoluteTiming[ total = 0; SetSharedVariable[total]; Parallelize@Do[total += cDigitSum[n], {n, 10^4}]; total] (DO NOT TRY THIS CODE) {20.9006, 180001} Here is the correct implement using a parallelized`Map` , as you can see there is a vast performance boost, we successfully achieved 2.2 seconds Parallelize[cDigitSum/@Range[10^6]]//Total//AbsoluteTiming {2.26036, 27000001} Or even using a parallelized `Sum`: Parallelize@Sum[cDigitSum[n],{n,10^6}]//AbsoluteTiming {2.23265, 27000001} We have achieved a amazing speed boost without change `cDigitSum` at all. The next technique is to tweak the implementation of function, without change the algorithm , what you can do is called `Compile` and `FunctionComile` . By using `Compile` we can write a compiled version of our `cDigitSum` : cDigitSumCompile = Compile[ {{nn, _Integer}} , Module[{sum = 0, digit, n = nn}, While[ n > 0 , digit = Mod[n, 10]; sum += digit; n = Quotient[n, 10]; ]; sum ] ] By default, the `Compile` function will compile to `Wolfram Virtual Machine`, it is equivalent to cDigitSumCompileWVM = Compile[ {{nn, _Integer}} , Module[{sum = 0, digit, n = nn}, While[ n > 0 , digit = Mod[n, 10]; sum += digit; n = Quotient[n, 10]; ]; sum ] , CompilationTarget -> "WVM" ] Let’s test it cDigitSumCompile/@Range[10^6]//Total//AbsoluteTiming {0.327939, 27000001} cDigitSumCompileWVM/@Range[10^6]//Total//AbsoluteTiming {0.323876, 27000001} Without using parallelization or modify the algorithm, we achieved 0.3 seconds ! From now, we will raise our n for this problem from `10^6` to `10^7` , let’s try it again: cDigitSumCompileWVM/@Range[10^7]//Total//AbsoluteTiming {3.68126, 315000001} Instead of compile to `Wolfram Virtual Machine`, what you can do to achieve further improvement is compile to `C` cDigitSumCompileC = Compile[ {{nn, _Integer}} , Module[{sum = 0, digit, n = nn}, While[ n > 0 , digit = Mod[n, 10]; sum += digit; n = Quotient[n, 10]; ]; sum ] , CompilationTarget -> "C" ] cDigitSumCompileC/@Range[10^7]//Total//AbsoluteTiming {2.36898, 315000001} However, you can achieve more with compilation optimization options. Please note that in this specific case, since our test function is relatively simple, most optimization options would have little effect or meaningless. Nevertheless, I will enable all of them for demonstration purpose. cDigitSumCompileCOptimize = Compile[ {{nn, _Integer}} , Module[{sum = 0, digit, n = nn}, While[ n > 0 , digit = Mod[n, 10]; sum += digit; n = Quotient[n, 10]; ]; sum ] , CompilationTarget -> "C" , CompilationOptions -> {"ExpressionOptimization" -> True} , RuntimeAttributes -> {Listable} , RuntimeOptions -> "Speed" , Parallelization -> True ] cDigitSumCompileCOptimize/@Range[10^7]//Total//AbsoluteTiming {2.03582, 315000001} Another compile method is called `FunctionCompile` , cDigitSumFunctionCompile = FunctionCompile[ Function[Typed[nn, "MachineInteger"], Module[{sum = 0, digit, n = nn}, While[ n > 0 , digit = Mod[n, 10]; sum += digit; n = Quotient[n, 10]; ]; sum ] ] ] We will not delve deeply into how to fine-tune `FunctionCompile` as it is currently in the experimental testing phase. However, its default settings generally perform quite well. cDigitSumFunctionCompile/@Range[10^7]//Total//AbsoluteTiming {0.538493, 315000001} Once again, we achieved a speed of within 1 second. From now on, we will increase our test quantities from `10^7` to `10^8` and try it again.: cDigitSumFunctionCompile/@Range[10^8]//Total//AbsoluteTiming {5.41658, 3600000001} We can certainly integrate the techniques of kernel parallelization with `Compile` or `FunctionCompile` to further enhance the speed. Parallelize[cDigitSumFunctionCompile/@Range[10^8]]//Total//AbsoluteTiming {2.22658, 3600000001} What next? Can we further improve performance without altering the algorithm?" Yes, the answer is CUDA. First, you will need to load `CUDALink` package Needs["CUDALink`"] Let’s write a CUDA function code `cudaDigitSumCode` first, this CUDA function takes a list of value as input, then for each element we apply our classic test algorithm to calculate the digit sum, then replace its original value. Notice here the `mint` data or `_Integer` in the specification of `CUDAFunctionLoad` is `Wolfram Language integer` cudaDigitSumCode = " __global__ void cudaDigitSum(mint inout) { mint index = threadIdx.x + blockIdx.x blockDim.x; mint n = inout[index]; mint sum = 0; while (n > 0) { sum += n % 10; n /= 10; } inout[index] = sum; } "; we can then load this CUDA code by using `CUDAFunctionLoad` to get our `cudaDigitSum` function. cudaDigitSum = CUDAFunctionLoad[cudaDigitSumCode, "cudaDigitSum", {{_Integer}}, 1024] Then we simply pass a list from 1 to 10^8 to this CUDA function cudaDigitSum[Range[10^8]]//First//Total//AbsoluteTiming {1.31861, 3600000001} Can we go even faster? Yes, the data transfer time between the host (CPU) and the device (GPU) can be a significant bottleneck. Loading the data onto the GPU first using `CUDAMemoryLoad` can save considerable time during calculations. It's also crucial to remember to release the allocated GPU memory using `CUDAMemoryUnload` when it's no longer needed. CUDA functions operate directly on the memory space using pointers, which is why you need `CUDAMemoryGet` to retrieve the results. range=CUDAMemoryLoad[Range[10^8]]; AbsoluteTiming[cudaDigitSum[range];CUDAMemoryGet[range]//Total] CUDAMemoryUnload[range]; {1.06409, 3600000001} Can we go even faster? Another speed optimization will be using the correct data type :`unsigned int` (32-bit unsigned integer): This type can hold values from 0 to 4,294,967,295. So, it can hold 3,600,000,001. we need to change the CUDA code slightly: cudaDigitSumUnsignedIntCode = " __global__ void cudaDigitSum(unsigned int inout) { unsigned int index = threadIdx.x + blockIdx.x blockDim.x; unsigned int n = inout[index]; unsigned int sum = 0; while (n > 0) { sum += n % 10; n /= 10; } inout[index] = sum; } "; and recompile our CUDA function with the correct data specification: cudaDigitSumUnsignedInt = CUDAFunctionLoad[cudaDigitSumUnsignedIntCode, "cudaDigitSum", {{"UnsignedInteger32"}}, 1024] Let’s retest the performance after optimized data type, remember to load the correct data type `UnsignedInteger32` in to memory: range=CUDAMemoryLoad[Range[10^8],"UnsignedInteger32"]; AbsoluteTiming[cudaDigitSumUnsignedInt[range];CUDAMemoryGet[range]//Total] CUDAMemoryUnload[range]; {0.785289, 3600000001} Can we go even faster? Yes. You might notice that we keep using the `Total` function to calculate the total sum of the digit sums in the list. What if we perform this summation within CUDA as well? In other words, we pass a list of data to CUDA, and it returns the total sum directly. In this case, the computation is entirely offloaded to CUDA; Mathematica only serves to pass the raw data to CUDA and receive the computed result. We can write a new CUDA function called `cudaDigitSumsTotal`. cudaDigitSumsTotalCode = " __global__ void cudaDigitSum(unsigned int* in, unsigned int* out) { int index = threadIdx.x + blockIdx.x * blockDim.x; unsigned int n = in[index]; unsigned int sum = 0; while (n > 0) { sum += n % 10; n /= 10; } atomicAdd(out, sum); } "; The `atomicAdd` function is used here to ensure the correctness of the variable during massive concurrent summation. You will almost certainly get incorrect results if you don't use it. Compile this to `cudaDigitSumsTotal`. cudaDigitSumsTotal = CUDAFunctionLoad[cudaDigitSumsTotalCode, "cudaDigitSum", {{"UnsignedInteger32", "Input"}, {"UnsignedInteger32", "Output"}}, 1024 ]; Load the data into memory. in = CUDAMemoryLoad[Range[10^8], "UnsignedInteger32"]; out = CUDAMemoryLoad[{0}, "UnsignedInteger32"]; And here are the results: AbsoluteTiming[ cudaDigitSumsTotal[in, out]; CUDAMemoryGet[out] ] {0.27297, {3600000001}} Always remember to free memory. CUDAMemoryUnload[in]; CUDAMemoryUnload[out]; 0.27297 seconds! for a quantity of `10^8`. Let's try a quantity of `10^9`, but we need to change our data type from `unsigned int`, as the result will overflow its range (0 to 4,294,967,295). `atomicAdd` only supports `unsigned int` and `unsigned long long`. With the extended range of `unsigned long long` (0 to 18,446,744,073,709,551,615), we should handle the sum correctly. `Needs["CUDALink`"] cudaDigitSumsTotalExtendCode = " __global__ void cudaDigitSum(unsigned int* in, unsigned long long * out) { int index = threadIdx.x + blockIdx.x * blockDim.x; unsigned int n = in[index]; unsigned long long sum = 0; while (n > 0) { sum += n % 10; n /= 10; } atomicAdd(out, sum); } "; cudaDigitSumsTotalExtend = CUDAFunctionLoad[cudaDigitSumsTotalExtendCode, "cudaDigitSum", {{"UnsignedInteger32", "Input"}, {"Integer64", "Output" }}, 1024]; in = CUDAMemoryLoad[Range[10^9], "UnsignedInteger32"]; out = CUDAMemoryLoad[{0}, "Integer64"]; AbsoluteTiming[ cudaDigitSumsTotalExtend[in, out]; CUDAMemoryGet[out] ] CUDAMemoryUnload[in]; CUDAMemoryUnload[out]; {2.73938, {40500000001}} 2.7 seconds. This is where we end up. Let's do a quick review. Compared to the initial native approach, which took 12 seconds for `n = 10^6`, we've now reached 2.7 seconds for `n = 10^9`. That's more than 4,000 times faster, without changing the algorithm itself at all. In conclusion, we've journeyed from a straightforward but inefficient C-style solution to a highly optimized CUDA implementation, achieving a remarkable 4000x speedup. By maintaining a consistent digit-summing algorithm throughout, we've clearly demonstrated the profound impact of various optimization techniques. Functional programming, kernel parallelization, function compilation with optimization, and CUDA GPU acceleration, when strategically applied, can unlock the true potential of Mathematica for computationally intensive tasks. I hope this exploration has provided valuable insights and inspires you to optimize your own code. Thank you for joining me on this journey.

POSTED BY: eo iles

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback