Group Abstract

Message Boards

WOLFRAM COMMUNITY

5.5K Views

6 Replies

23 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

MetalLink: GPU computations on Apple Silicon

Soumya Sambeet Mohapatra

Soumya Sambeet Mohapatra, ADDITIVE GmbH

Posted 1 year ago

*MetalLink Project GitHub:* https://github.com/s4m13337/MetalLink Introduction While working on some machine learning projects recently, I encountered a significant limitation: Mathematica’s inability to use Apple’s M-Series GPU for training neural networks. I had to resort to TensorFlow to speed up the training and and advance my tasks. However, my curiosity led me on a slight detour to investigate potential methods for utilizing Apple’s GPU within Mathematica. Background A quick internet search brings up this community post: https://community.wolfram.com/groups/-/m/t/2113522 The post is about 3 years old at the time of writing this and despite that, the absence of support for training neural networks on Apple’s GPU remains a puzzle. I am very certain that there is demand for this from many users of Mathematica such as myself. Maybe someone from Wolfram Research could illuminate about the current situation and future plans regarding this. Nevertheless, I have been experimenting with Metal and various ways to access the GPU from Mathematica. In this post I summarize the different things I have found out. I had to refresh my knowledge on several low level functions in Mathematica since I have not worked with them with in a while. I also had to learn a bit of Objective C because as it is requisite for interfacing with Metal. Design Broadly, there are two components: a Metal program that is written in C and Objective C that can interacts with the GPU and there is a layer in Mathematica that establishes connection to this external program to exchange data. The Metal side presents significant challenges as the documentation is hard to follow and does not contain any examples. I had to rely on heavy trial and errors to figure things out. This is the general workflow that needs to be done to pass data to the GPU for computation: Create a reference to the Metal device Load a library of kernels/compute shader functions Build a pipeline against the kernels/compute shader functions Create a command queue Create compute buffers/textures and load the data into them Define work grid size, threadgroup size and threads per thread group Push the compute buffers/textures into the command queue using a command buffer Commit the work to the GPU and wait till it completes the computation Steps 1 to 4 can be done once and each of the objects can be reused. Step 5 is where the data obtained from Mathematica is loaded into a buffer and pushed to the GPU in Step 7. Step 6 is where the work is divided into groups for the GPU to process them in parallel. After computation, result is copied back to the program’s memory from the buffer and it is returned to Mathematica. I am omitting several details on the terminologies mentioned above, as they are quite complex and would render this post excessively lengthy. Detailed explanations can be found in Metal's documentation. After implementing a basic version of the aforementioned program, I needed to interface it with Mathematica. Fortunately, the Mathematica side is relatively straightforward to implement. There are three well-documented options for this: ForeignFunctionInterface WSTP LibraryLink ForeignFunctionInterface is a quick way to load an external program and play around with it. But this does not scale well for developing a bigger library. WSTP is a much more sophisticated framework and provides lot more features. But the main drawback here is that the external program loaded is connected to Mathematica only through a link and it consists of its own memory space. Any data from Mathematica must be pushed through the link and this becomes problem when pushing large amount of data. And this defeats the entire purpose of parallel computation. LibraryLink is the most versatile solution. This requires the external program to be compiled as a dynamic library. When it is loaded into Mathematica, its memory space gets shared i.e. the external program has direct access to data in Mathematica. Development Taking inspiration from the existing CUDALink, I have named this library MetalLink (the most original name I could come up with). To begin with, I developed a simple function that just adds two lists in parallel. Although a trivial task, it serves as a good benchmark to assess the efficiency of GPU computation. And to really put pressure on the GPU, each list consisted of 1 billion items. The initial results I obtained where a bit disappointing, however: Addition on Mathematica took close to 2 minutes (118 seconds to be exact), where as on the GPU, it took 5 minutes - more than twice the time. Upon closer inspection during debugging, I found that the actual computation time on the GPU was less than a second. The majority of the time was consumed in memory allocation and data writing to the buffer. To address this bottleneck, I delved deep into the GPU's architecture to devise a method for accelerating these processes. Subsequently, I implemented a technique to asynchronously chunk the data from Mathematica and create buffers of these chunk sizes. This optimization significantly enhanced performance, reducing the time for adding two lists to just over a minute (72 seconds, to be exact). That's nearly a 50% improvement in speed. Continuing with my development efforts, another function I created is the Map function, inspired by the existing CUDAMap, which I aptly named MetalMap. The table below summarizes the computation times for applying the Sin function to lists of various length: Conclusion Furthermore, to explore the image processing capabilities, I have developed a function that creates a negative of an Image. Although it technically works, its optimization is still a work in progress. I am still figuring out ways to optimize but the plan is to slowly extend the capabilities and make the library ready for general purpose GPU computation on Apple silicon devices. Perhaps one day, this library will mature to the point where I can finally just set the option “TargetDevice” -> “GPU” on the Apple silicon devices to train Neural Networks. The project is available at https://github.com/s4m13337/MetalLink and any contributions are welcome. Remarks Note 1: All the tests were done on a M1 chip MacBook with 8 GB RAM. This device uses a unified memory architecture and the maximum memory shared with the GPU is about 5 GB. Going beyond 5 GB requires further low level memory management strategies. Note 2: Just out of curiosity, I also tried testing the library on a MacBook with NVIDIA GPU. Although it works, subsequent computations produce garbage results. This is probably due to incompatibilities in the automatic reference counter / garbage collector of the compiler used.

POSTED BY: Soumya Sambeet Mohapatra

6 Replies

Sort By:

Kirill Vasin

Kirill Vasin, Augsburg University

Posted 1 year ago

I believe one should try to marry LibraryLink and WebGPU (this one for example). This will be a killer solution for all platforms at once, no issues with compilers, drivers and etc anymore

POSTED BY: Kirill Vasin

Ryan Wu

Posted 1 year ago

Hi, your GitHub link is unavailable.

POSTED BY: Ryan Wu

Soumya Sambeet Mohapatra

Soumya Sambeet Mohapatra, ADDITIVE GmbH

Posted 1 year ago

Hello Ryan, Thank you for your interest in this project. Currently, I am collaborating with my company on this project, which means the code cannot be made public just yet. However, I would be happy to send you an archived copy of the repository if you're interested. Let me know how I can share a copy if you are interested.

POSTED BY: Soumya Sambeet Mohapatra

Ryan Wu

Posted 1 year ago

Hi Soumya, Thank you for your reply. Please send your archived copy of the repository to my email. Thanks Ryan

POSTED BY: Ryan Wu

George Woodrow III

George Woodrow III, lifelong learner

Posted 1 year ago

Great. This is long overdue. I hope that Wolfram will 'seamlessly' link with Apple's Neural engine, and with the Apple Intelligence (I hope that there is a low level non-marketing name for this), so I can deal with LLMs and other things pretty much seamlessly in Wolfram Language.

POSTED BY: George Woodrow III

EDITORIAL BOARD

EDITORIAL BOARD, WOLFRAM

Posted 1 year ago

-- you have earned *Featured Contributor Badge* Your exceptional post has been selected for our editorial column *Staff Picks* http://wolfr.am/StaffPicks and Your Profile is now distinguished by a *Featured Contributor Badge* and is displayed on the Featured Contributor Board. Thank you!

POSTED BY: EDITORIAL BOARD

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback

MetalLink: GPU computations on Apple Silicon

Introduction

Background

Design

Development

Conclusion

Remarks