Message Boards Message Boards

Mathematica High Performance Computing (HPC)

Posted 2 days ago

This article primarily focuses on how to enhance the performance of highly intensive computational tasks through minor techniques, including functional programming, kernel parallelization, compilation optimization, and ultimately, utilizing the capability of GPU acceleration using CUDA.

Linking Mathematica with C++ is also a great way to boost performance, but we will not cover it here, as Luyan Yu has authored another excellent article on this topic, available at:

Linking C++ code with Mathematica using LibraryLink - Luyan Yu

Let’s begin with a simple task, find out the sum of all number digits within 1,000,000.:

for example, sum of all number digits within 12 will be :

1+2+3+4+5+6+7+8+9+1+0+1+1+1+2=51.

To solve the problem itself is fairly easy, we first consider number in this form from 000,000 to 999,999, each digit (0 through 9) appears an equal number of times in each position. For each position, the sum of every 10 digits is 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 = 45. Thus, The sum of the digits of all numbers in any given position is 45 * 100,000 = 4,500,000. Then, since there are six positions, the total sum of all digits from 000,000 to 999,999 is 6 * 4,500,000 = 27,000,000. Then we add 1,000,000 as a final 1 to get the result 27,000,001.

The formula of total digits sum with in 10^x can be easily deduced, x*45*10^(x-1)+1 or

$$ f(x)=\frac{9}{2}\ 10^x x+1 $$

However, solving this specific problem is not our focus here, nor is the solving algorithm itself. This article aims to teach you techniques for enhancing the performance of any general function or algorithm you might write in the future. We will focus on optimizing performance by using different implementations of the same algorithm in Mathematica. Thus, the algorithm used for testing should be intentionally kept consistent for control variable purposes.

Let's start with the simplest and most basic approach: C-style coding. We will use this as our baseline performance test algorithm.

If you are a C programmer, you might write Mathematica code like this:

cDigitSum[nn_] := Module[{sum = 0, n = nn},
    While[
        n > 0
        ,
        sum += Mod[n, 10];
        n = Quotient[n, 10];
    ];
    sum
]

AbsoluteTiming[
    total = 0;
    For[n = 0, n <= 10^6, n++,
        total += cDigitSum[n]
    ];
    total
]

{11.9438, 27000001}

Costs around 12 seconds, which is not very efficient. but don’t be discoursed. Let’s see what Mathematica can do.

Surprisingly, Mathematica has tons of magic built-in function, even include this one : DigitSum. Even more surprising is that it is slower than our humble and trivial cDigitSum function.

AbsoluteTiming[
    total = 0;
    For[n = 0, n <= 10^6, n++,
        total += DigitSum[n]
    ];
    total
]

{13.3594, 27000001}

It costs 13 seconds."

Congratulations, you have beaten Mathematica in this case. We will no longer use Mathematica's built-in DigitSum function and will continue working based on our cDigitSum function."

In Mathematica, using C-style coding is very inefficient. You can use a Do statement instead of For to make the code more elegant.

AbsoluteTiming[
    total = 0;
    
    Do[total += cDigitSum[n], {n, 10^6}];
    
    total
]

{10.9844, 27000001}

What can make code even more elegant by using Map operator form/@, and postfix form //.

cDigitSum/@Range[10^6]//Total//AbsoluteTiming

{10.7969, 27000001}

Functional programming usually makes code more elegant and easier to read and often provides better performance. However, since our performance here is limited by cDigitSum, it doesn’t boost the speed significantly.

Nevertheless, Mathematica provides a way to boost functional-style code like Do or Map without modifying cDigitSum itself: kernel parallelization. Simply by applying the function Parallelize to the code, you can achieve a tremendous improvement in speed.

Warning:

Parallelization using Do involves setting up shared variables, which can cause a deadly performance drop due to rapid, concurrent writes to the same variable from all parallel kernels. Do not try this practice.

However, I can demonstrate what happens if you do so. Please note that the quantity has been reduced to 10^4 to prevent the computation from hanging or crashing.

(*DO NOT TRY THIS CODE*)

AbsoluteTiming[

total = 0;
SetSharedVariable[total];

Parallelize@Do[total += cDigitSum[n], {n, 10^4}];

total]

(*DO NOT TRY THIS CODE*)

{20.9006, 180001}

Here is the correct implement using a parallelizedMap , as you can see there is a vast performance boost, we successfully achieved 2.2 seconds

Parallelize[cDigitSum/@Range[10^6]]//Total//AbsoluteTiming

{2.26036, 27000001}

Or even using a parallelized Sum:

Parallelize@Sum[cDigitSum[n],{n,10^6}]//AbsoluteTiming

{2.23265, 27000001}

We have achieved a amazing speed boost without change cDigitSum at all.

The next technique is to tweak the implementation of function, without change the algorithm , what you can do is called Compile and FunctionComile .

By using Compile we can write a compiled version of our cDigitSum :

cDigitSumCompile = Compile[
    {{nn, _Integer}}
    ,
    Module[{sum = 0, digit, n = nn},
        While[
            n > 0
            ,
            digit = Mod[n, 10];
            sum += digit;
            n = Quotient[n, 10];
        ];
        sum
    ]
]

By default, the Compile function will compile to Wolfram Virtual Machine, it is equivalent to

cDigitSumCompileWVM = Compile[
    {{nn, _Integer}}
    ,
    Module[{sum = 0, digit, n = nn},
        While[
            n > 0
            ,
            digit = Mod[n, 10];
            sum += digit;
            n = Quotient[n, 10];
        ];
        sum
    ]
    ,
    CompilationTarget -> "WVM"
]

Let’s test it

cDigitSumCompile/@Range[10^6]//Total//AbsoluteTiming

{0.327939, 27000001}
cDigitSumCompileWVM/@Range[10^6]//Total//AbsoluteTiming

{0.323876, 27000001}

Without using parallelization or modify the algorithm, we achieved 0.3 seconds !

From now, we will raise our n for this problem from 10^6 to 10^7 , let’s try it again:

cDigitSumCompileWVM/@Range[10^7]//Total//AbsoluteTiming

{3.68126, 315000001}

Instead of compile to Wolfram Virtual Machine, what you can do to achieve further improvement is compile to C

cDigitSumCompileC = Compile[
    {{nn, _Integer}}
    ,
    Module[{sum = 0, digit, n = nn},
        While[
            n > 0
            ,
            digit = Mod[n, 10];
            sum += digit;
            n = Quotient[n, 10];
        ];
        sum
    ]
    ,
    CompilationTarget -> "C"
]
cDigitSumCompileC/@Range[10^7]//Total//AbsoluteTiming

{2.36898, 315000001}

However, you can achieve more with compilation optimization options.

Please note that in this specific case, since our test function is relatively simple, most optimization options would have little effect or meaningless. Nevertheless, I will enable all of them for demonstration purpose.

cDigitSumCompileCOptimize = Compile[
    {{nn, _Integer}}
    ,
    Module[{sum = 0, digit, n = nn},
        While[
            n > 0
            ,
            digit = Mod[n, 10];
            sum += digit;
            n = Quotient[n, 10];
        ];
        sum
    ]
    ,
    CompilationTarget -> "C"
    ,
    CompilationOptions -> {"ExpressionOptimization" -> True}
    ,
    RuntimeAttributes -> {Listable}
    ,
    RuntimeOptions -> "Speed"
    ,
    Parallelization -> True
]
cDigitSumCompileCOptimize/@Range[10^7]//Total//AbsoluteTiming

{2.03582, 315000001}

Another compile method is called FunctionCompile ,

cDigitSumFunctionCompile = FunctionCompile[
    Function[Typed[nn, "MachineInteger"],
        Module[{sum = 0, digit, n = nn},
            While[
                n > 0
                ,
                digit = Mod[n, 10];
                sum += digit;
                n = Quotient[n, 10];
            ];
            sum
        ]
    ]
]

We will not delve deeply into how to fine-tune FunctionCompile as it is currently in the experimental testing phase. However, its default settings generally perform quite well.

cDigitSumFunctionCompile/@Range[10^7]//Total//AbsoluteTiming

{0.538493, 315000001}

Once again, we achieved a speed of within 1 second. From now on, we will increase our test quantities from 10^7 to 10^8 and try it again.:

cDigitSumFunctionCompile/@Range[10^8]//Total//AbsoluteTiming

{5.41658, 3600000001}

We can certainly integrate the techniques of kernel parallelization with Compile or FunctionCompile to further enhance the speed.

Parallelize[cDigitSumFunctionCompile/@Range[10^8]]//Total//AbsoluteTiming

{2.22658, 3600000001}

What next? Can we further improve performance without altering the algorithm?"

Yes, the answer is CUDA.

First, you will need to load CUDALink package

Needs["CUDALink`"]

Let’s write a CUDA function code cudaDigitSumCode first, this CUDA function takes a list of value as input, then for each element we apply our classic test algorithm to calculate the digit sum, then replace its original value. Notice here the mint data or _Integer in the specification of CUDAFunctionLoad is Wolfram Language integer

cudaDigitSumCode = "
__global__ void cudaDigitSum(mint *inout) { 
    mint index = threadIdx.x + blockIdx.x * blockDim.x;

        mint n = inout[index];
        mint sum = 0;

        while (n > 0) {
            sum += n % 10;
            n /= 10;
        }

        inout[index] = sum;
}
";

we can then load this CUDA code by using CUDAFunctionLoad to get our cudaDigitSum function.

cudaDigitSum = CUDAFunctionLoad[cudaDigitSumCode, 
	"cudaDigitSum", {{_Integer}}, 1024]

Then we simply pass a list from 1 to 10^8 to this CUDA function

cudaDigitSum[Range[10^8]]//First//Total//AbsoluteTiming

{1.31861, 3600000001}

Can we go even faster?

Yes, the data transfer time between the host (CPU) and the device (GPU) can be a significant bottleneck. Loading the data onto the GPU first using CUDAMemoryLoad can save considerable time during calculations. It's also crucial to remember to release the allocated GPU memory using CUDAMemoryUnload when it's no longer needed. CUDA functions operate directly on the memory space using pointers, which is why you need CUDAMemoryGet to retrieve the results.

range=CUDAMemoryLoad[Range[10^8]];

AbsoluteTiming[cudaDigitSum[range];CUDAMemoryGet[range]//Total]

CUDAMemoryUnload[range];

{1.06409, 3600000001}

Can we go even faster?

Another speed optimization will be using the correct data type :unsigned int (32-bit unsigned integer): This type can hold values from 0 to 4,294,967,295. So, it can hold 3,600,000,001.

we need to change the CUDA code slightly:

cudaDigitSumUnsignedIntCode = "
__global__ void cudaDigitSum(unsigned int *inout) { 
    unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;

        unsigned int n = inout[index];
        unsigned int sum = 0;

        while (n > 0) {
            sum += n % 10;
            n /= 10;
        }

        inout[index] = sum;
}
";

and recompile our CUDA function with the correct data specification:

cudaDigitSumUnsignedInt = CUDAFunctionLoad[cudaDigitSumUnsignedIntCode, 
	"cudaDigitSum", {{"UnsignedInteger32"}}, 1024]

Let’s retest the performance after optimized data type, remember to load the correct data type UnsignedInteger32 in to memory:

range=CUDAMemoryLoad[Range[10^8],"UnsignedInteger32"];

AbsoluteTiming[cudaDigitSumUnsignedInt[range];CUDAMemoryGet[range]//Total]

CUDAMemoryUnload[range];

{0.785289, 3600000001}

Can we go even faster?

Yes.

You might notice that we keep using the Total function to calculate the total sum of the digit sums in the list.

What if we perform this summation within CUDA as well? In other words, we pass a list of data to CUDA, and it returns the total sum directly. In this case, the computation is entirely offloaded to CUDA; Mathematica only serves to pass the raw data to CUDA and receive the computed result.

We can write a new CUDA function called cudaDigitSumsTotal.

cudaDigitSumsTotalCode = "
__global__ void cudaDigitSum(unsigned int* in, unsigned int* out) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;

    unsigned int n = in[index];
    unsigned int sum = 0;

    while (n > 0) {
        sum += n % 10;
        n /= 10;
    }

    atomicAdd(out, sum);
}
";

The atomicAdd function is used here to ensure the correctness of the variable during massive concurrent summation. You will almost certainly get incorrect results if you don't use it.

Compile this to cudaDigitSumsTotal.

cudaDigitSumsTotal = CUDAFunctionLoad[cudaDigitSumsTotalCode, "cudaDigitSum",
     {{"UnsignedInteger32", "Input"}, {"UnsignedInteger32", "Output"}}, 1024
    ];

Load the data into memory.

in = CUDAMemoryLoad[Range[10^8], "UnsignedInteger32"];
out = CUDAMemoryLoad[{0}, "UnsignedInteger32"];

And here are the results:

AbsoluteTiming[
    cudaDigitSumsTotal[in, out];
    CUDAMemoryGet[out]
]

{0.27297, {3600000001}}

Always remember to free memory.

CUDAMemoryUnload[in];
CUDAMemoryUnload[out];

0.27297 seconds! for a quantity of 10^8.

Let's try a quantity of 10^9, but we need to change our data type from unsigned int, as the result will overflow its range (0 to 4,294,967,295).

atomicAdd only supports unsigned int and unsigned long long.

With the extended range of unsigned long long (0 to 18,446,744,073,709,551,615), we should handle the sum correctly.

`Needs["CUDALink`"]

cudaDigitSumsTotalExtendCode = "
__global__ void cudaDigitSum(unsigned int* in, unsigned long long * out) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;

    unsigned int n = in[index];
    unsigned long long sum = 0;

    while (n > 0) {
        sum += n % 10;
        n /= 10;
    }

    atomicAdd(out, sum);
}
";

cudaDigitSumsTotalExtend = CUDAFunctionLoad[cudaDigitSumsTotalExtendCode,
     "cudaDigitSum", {{"UnsignedInteger32", "Input"}, {"Integer64", "Output"
    }}, 1024];

in = CUDAMemoryLoad[Range[10^9], "UnsignedInteger32"];

out = CUDAMemoryLoad[{0}, "Integer64"];

AbsoluteTiming[
    cudaDigitSumsTotalExtend[in, out];
    CUDAMemoryGet[out]
]

CUDAMemoryUnload[in];

CUDAMemoryUnload[out];

{2.73938, {40500000001}}

2.7 seconds. This is where we end up.

Let's do a quick review. Compared to the initial native approach, which took 12 seconds for

n = 10^6, we've now reached 2.7 seconds for n = 10^9. That's more than 4,000 times faster, without changing the algorithm itself at all.

In conclusion, we've journeyed from a straightforward but inefficient C-style solution to a highly optimized CUDA implementation, achieving a remarkable 4000x speedup. By maintaining a consistent digit-summing algorithm throughout, we've clearly demonstrated the profound impact of various optimization techniques. Functional programming, kernel parallelization, function compilation with optimization, and CUDA GPU acceleration, when strategically applied, can unlock the true potential of Mathematica for computationally intensive tasks. I hope this exploration has provided valuable insights and inspires you to optimize your own code. Thank you for joining me on this journey.

POSTED BY: eo iles
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract