Group Abstract Group Abstract

Message Boards Message Boards

Use ParallelTable to improve calculation speed?

Posted 6 years ago
POSTED BY: Jürgen Kanz
24 Replies
POSTED BY: Sander Huisman

My apologies I missed the functionality of SemanticImport to handle the dates correctly.

But yes I agree with your statement and hence my explanation that using all Mathematicas advanced functions can make it very slow indeed! For example your 15MB dataset becomes over 400 MB using SemanticImport and once distributed over all parallel kernels it was able to eat up a whopping 3.5Gb of memory.

enter image description here enter image description here

Processing large datasets is in my opinion possible but in my experience not using the "make my life easy functions". I work with medical imaging data several Gb at a time and Mathematica does the job fine as long is I keep treating the data as plain Numbers and Strings and Lists.

POSTED BY: Martijn Froeling
POSTED BY: Ian Williams
POSTED BY: Martijn Froeling

Yes, doing things in parallel doesn't always make it faster. There's overhead involved with parallelizing the task. By the looks of it, your parallel timing includes the launch time of the kernels. Have you tried evaluating LaunchKernels[] first before doing the timing?

On my system:

dates = DateRange[Now - 200 Quantity[1, "Years"], Now];
LaunchKernels[];
Map[AbsoluteTime, dates]; // AbsoluteTiming
ParallelMap[AbsoluteTime, dates]; // AbsoluteTiming

Out[12]= {0.220474, Null}
Out[13]= {1.38644, Null}

So the ParallelMap is indeed slower. What I think is going on, is that the date objects are being sent over to the parallel kernels and then the kernels have to re-interpret them (since date objects are both containers and constructors) before they can do anything with them. So the parallel kernels spend a lot of time just receiving the data before they can actually get to work. Overhead like that is why parallelization is not always faster.

POSTED BY: Sjoerd Smit
POSTED BY: Sjoerd Smit

It is pretty much impossible to determine what might be issues without actual input that shows the problem.

POSTED BY: Daniel Lichtblau

Not a bug! Are you kidding? This kind of "new" functionality is completely out of use.

POSTED BY: Sander Huisman
POSTED BY: Jürgen Kanz

Dataset are not obviously designed for small datasets (15MB CSV file???!!! ), too.

Ones again, welcome to the Wolfram's world of big data processing!

Just for your info, Matlab and Maple offer direct parallelization, of course.

Posted 6 years ago
POSTED BY: Updating Name
POSTED BY: Jürgen Kanz
POSTED BY: Jürgen Kanz
POSTED BY: Martijn Froeling

Thank you ! Based on your suggestion it is possible for me to run all Parallel commands for the entire file in appropriate times. So, what is the difference apart from using the Quiet command? I did work completely with Datasets starting with the use of SemanticImport. You just imported the CSV file as is (List). My Mathematica knowledge and experience is still on a low level, but it seems to me now, that Mathematica is going into trouble with large files, datasets and parallelization.

Are there different opinions?

POSTED BY: Jürgen Kanz

No, I did not launch the kernels in front of the calculation, but it is a goog idea especially when there are more tasks that should run in parallel.

Nevertheless, I am still seeking for a (fast) ParallelTable solution. With the following lines of code I extract the MonthName and DayName from the dataset. Without other changes only these two lines of code deliver currently the needed results.

mvtMonth = Table[DateString[data[[n, 2]], {"MonthName"}], {n, 1, nrows}];
mvtWeekday = Table[DateString[data[[n, 2]], {"DayName"}], {n, 1, nrows}];

Map has difficulties with the Dataset and ParallelTable runs "forever". So, I can continue with my work, but I am still not happy with the calculation speed.

POSTED BY: Jürgen Kanz

I am rather interested to know why the Dataset is the problem. Anyhow, I have made two tests:

  1. with the following command and result enter image description here

  2. with the ParallelMap command enter image description here

My observation is, that this Parallel-Command takes also more time compared to the Single-Command.

POSTED BY: Jürgen Kanz

Aha! So it's Dataset that's being annoying here. Can you try it again with Normal[data[[All, 2]]]?

POSTED BY: Sjoerd Smit

The command:

times = Map[AbsoluteTime, data[[All, 2]]] // AbsoluteTiming

runs very well and fast 0.927 sec.

Regarding the suggested ParallelMap command, I got the following message enter image description here

For your own investigations I have attached the CSV input file (15MB).

Please keep me informed about your investigations.

Attachments:
POSTED BY: Jürgen Kanz
  • My Desktop is running with 16 GB Ram.
  • The CSV-File has a size of 15 MB and imported via SemanticImport.
  • Until now I have not worked with above mentioned options. I will do later.
  • Attached a screenshot of my Task-Manager: enter image description here
POSTED BY: Jürgen Kanz

How much memory (ram) does your data consume? How much memory is consumed during the actual computation? One guess is that the computation you are requesting inflates the memory so much that the paging file on your hard drive is being used to complete the computation.

Have you tried ParallelTable options such as "FinestGrained" and "CoarsestGrained"? Either way, if the data communication overhead is significant ParallelTable will not be faster than Table.

POSTED BY: Todd Allen
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard