Message Boards Message Boards

Perform a ParallelTable operation on multicore?

Posted 7 years ago

Dear Friends,

I am trying to evaluate a table under two variables (theta, phi) on a 4 core i7 Windows computer.

For an operation of 2X2 (4 elements, 2 values each of phi and theta), my system takes about 10 minutes to give the o/p under ParallelTable operation.

When I go for the finer grid 4X4 (16 elements, phi and theta take 4 values each in the same range), I expect the ParallelTable to give the o/p in roughly 40 minutes as no of evaluation points have been four times. Though, time taken by the ParallelTable to give o/p is more than 3 hours, in spite of the fact that there is no reason to believe that these extra points will need any extra time to be evaluated.

Will appreciate any suggestion on this problem. Thanks

POSTED BY: S G
9 Replies
Posted 7 years ago

I don't know if there's a lot anyone can say without actually seeing the code you're working with. Are you able to post a minimal example of what you're doing that demonstrates what you'd like to solve?

POSTED BY: Kyle Martin

Without any further details, all I can say is: please make sure that your operation gets parallelized correctly. There are many reasons why parallelization can fail to be efficient. A very common cause is the sharing of values between sub kernels: the moment the kernels need to start talking to each other (or the master kernel) performance often goes down dramatically. Please make sure that each task that gets send to the sub kernels is independent of the other ones for the best performance.

Another common reason is simply failing to distribute the correct definitions to the sub kernels. If that happens, there is effectively no parallelization whatsoever since the kernels return the assignment unfinished and leave it to the master kernel to actually evaluate everything.

POSTED BY: Sjoerd Smit
Posted 7 years ago

Thanks for the response. Attached is the original code. I am not giving the simplified version as it will loose the prospect, since in both cases time taken will be very short to differentiate. Please ignore the code other than last two lines. XX evaluates XXXX at four points, but YY does so on 16 points in the same range of theta and phi. On my 4 core computer (Windows i7), XX takes 10 minuts, while YY takes about 3h 5 m. Once again appreciate your help

POSTED BY: S G

By the looks of it, your code can be optimized quite a bit. It's a bit to much to go through everything, but I really recommend you avoid using For loops in Mathematica since it's rather inefficient and there simply are better tools to use. Here are some references on good and efficient Mathematica programming practices:

Now as for why your code doesn't work well in parallel: I'm going to hazard a guess and say that one of the problems is your use of the capital K as a variable. Mathematica uses this symbol (like many other capitals, such as N, C and D), which you can tell by the fact that it turns black when you type it in a notebook (unlike other symbols, which start blue and only turn black when you give them a value). I'm guessing that updating K causes the subkernels to talk to each other, though I might be wrong. However, your code is difficult to read and there may be numerous other inefficiencies and problems that I can't find without really taking it apart.

POSTED BY: Sjoerd Smit
Posted 7 years ago

Thanks Sjoerd, I went through one of the links and tried to fix few things you suggested, changed "For" loops to "Do" and replaced the variable "K", though they did not make any major impact on processing speed. I would be thankful if you have any further suggestions. Attached is the modified *.nb file.

Also, one more doubt - If certain inefficiency in code is causing the slowness of ParallelTable process, why it is more prominent for 4X4 grid (taking > 3h) as compared to 2X2 grid (taking only 10 minutes) ?

POSTED BY: S G

Like I explained earlier, you have to be absolutely sure that there are no dependencies between the calculations you're trying to parallelize. So far I haven't been able to analyze your code in enough detail to be able to rule this out as a problem. Currently I don't have the time, but I hope I'll have another shot later.

However, here's another quick check you can do in the mean while: use your OS system monitor to check how much of your CPU and RAM is being used during the computations. It might be something as simple as running out of RAM, in which case splitting the load over different CPUs isn't going to do you any good (which should also help to illustrate how inefficient coding can harm your parallelization attempt).

POSTED BY: Sjoerd Smit
Posted 7 years ago

Hi Sjoerd, I am noting down the CPU usage and RAM for each run. CPU varies b/w 58 - 67% for both 2X2 and 4X4 grid calculations. While RAM usage is stable at about 320 MB and 430 MB respectively for two cases (Total RAM of the system - 16 GB). Thanks

POSTED BY: S G
Posted 7 years ago

Dear Sjoerd, I am still struggling with this code, not working the way it should, for 4 core computer. I tried many things in between, but nothing seems to be working.

Will appreciate if you can spare some time and have a piece of advice on this. Thanks

POSTED BY: S G

Hi S G. I appreciate your problem, but unfortunately I don't have that much time to spare and your code is simply quite difficult to read. The only way I could really assess the problem is by taking it apart completely and rewriting it, which is a lot of work.

However, I noticed one more thing in your code that almost certainly problematic: in the definition of XXXX there is an assignment to the function Et[m]. This function is used to define other functions elsewhere and I'm fairly certain that these assignments will get communicated back and forth between kernels. Assignments like these shouldn't be done inside of code that gets evaluated in a parallel fashion. Your code is has a slew of other side effects (meaning that non-local variables and functions are defined during the parallel evaluation) and all of them could be the culprit of poor performance.

I recommend you restructure your code to one function of phi and theta. To be on the safe side, this function should not depend on any other global variables, so every internal function should be localized with Module. So you'd end up with code that looks roughly like this:

myFunction[phi_, theta_] := Module[{
  var1, var2, ...,
  fun1, fun2, ....
},
  ...
];

In the definition of myFunction there shouldn't be any blue symbols left at the end; it should be entirely self-contained. After that, you can just call it with:

ParallelTable[{myFunction[phi, theta], phi, theta}, {phi, 0, Pi/4, Pi/12}, {theta, 0, ArcCot[Cos[phi]], ArcCot[Cos[phi]]/3}]
POSTED BY: Sjoerd Smit
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract