Message Boards Message Boards

GROUPS:

Use ParallelTable to improve calculation speed?

Posted 3 months ago
1119 Views
|
24 Replies
|
10 Total Likes
|

Mathematica V12:

The snapshot is based on a notebook. I want to improve the calculation speed of my notebook by changing the function Table to ParallelTable. Unfortunately ParallelTable does not come to an end in appropriate time. enter image description here

Mathematica told me that 4 Kernels have been launched, but after more than 30 minutes I aborted the calculation. Why does it take so much time to create the table with several kernels?

24 Replies

How much memory (ram) does your data consume? How much memory is consumed during the actual computation? One guess is that the computation you are requesting inflates the memory so much that the paging file on your hard drive is being used to complete the computation.

Have you tried ParallelTable options such as "FinestGrained" and "CoarsestGrained"? Either way, if the data communication overhead is significant ParallelTable will not be faster than Table.

Posted 3 months ago
  • My Desktop is running with 16 GB Ram.
  • The CSV-File has a size of 15 MB and imported via SemanticImport.
  • Until now I have not worked with above mentioned options. I will do later.
  • Attached a screenshot of my Task-Manager: enter image description here

It is pretty much impossible to determine what might be issues without actual input that shows the problem.

Have you tried something like:

times = ParallelMap[AbsoluteTime, data[[All, 2]]]

I would expect that to work. Also, what happens when you run your ParallelTable on a much smaller dataset?

Posted 3 months ago

The command:

times = Map[AbsoluteTime, data[[All, 2]]] // AbsoluteTiming

runs very well and fast 0.927 sec.

Regarding the suggested ParallelMap command, I got the following message enter image description here

For your own investigations I have attached the CSV input file (15MB).

Please keep me informed about your investigations.

Attachments:

Aha! So it's Dataset that's being annoying here. Can you try it again with Normal[data[[All, 2]]]?

Posted 3 months ago

I am rather interested to know why the Dataset is the problem. Anyhow, I have made two tests:

  1. with the following command and result enter image description here

  2. with the ParallelMap command enter image description here

My observation is, that this Parallel-Command takes also more time compared to the Single-Command.

Yes, doing things in parallel doesn't always make it faster. There's overhead involved with parallelizing the task. By the looks of it, your parallel timing includes the launch time of the kernels. Have you tried evaluating LaunchKernels[] first before doing the timing?

On my system:

dates = DateRange[Now - 200 Quantity[1, "Years"], Now];
LaunchKernels[];
Map[AbsoluteTime, dates]; // AbsoluteTiming
ParallelMap[AbsoluteTime, dates]; // AbsoluteTiming

Out[12]= {0.220474, Null}
Out[13]= {1.38644, Null}

So the ParallelMap is indeed slower. What I think is going on, is that the date objects are being sent over to the parallel kernels and then the kernels have to re-interpret them (since date objects are both containers and constructors) before they can do anything with them. So the parallel kernels spend a lot of time just receiving the data before they can actually get to work. Overhead like that is why parallelization is not always faster.

Posted 3 months ago

No, I did not launch the kernels in front of the calculation, but it is a goog idea especially when there are more tasks that should run in parallel.

Nevertheless, I am still seeking for a (fast) ParallelTable solution. With the following lines of code I extract the MonthName and DayName from the dataset. Without other changes only these two lines of code deliver currently the needed results.

mvtMonth = Table[DateString[data[[n, 2]], {"MonthName"}], {n, 1, nrows}];
mvtWeekday = Table[DateString[data[[n, 2]], {"DayName"}], {n, 1, nrows}];

Map has difficulties with the Dataset and ParallelTable runs "forever". So, I can continue with my work, but I am still not happy with the calculation speed.

In parallel computing much time can be spent distributing the data to all the kernels (making copies in memory for each kernel), so for large datasets this can take a while and sometimes even longer than the computation. Im not sure why Mathematica takes so long for this, but i have experienced minutes for large datasets. And somehow the second run is always faster.

enter image description here

Also your code produces errors which have to be printed to screen in parallel computing this can get ugly. suppressing the errors makes ParallelTable faster

enter image description here enter image description here

Posted 3 months ago

Thank you ! Based on your suggestion it is possible for me to run all Parallel commands for the entire file in appropriate times. So, what is the difference apart from using the Quiet command? I did work completely with Datasets starting with the use of SemanticImport. You just imported the CSV file as is (List). My Mathematica knowledge and experience is still on a low level, but it seems to me now, that Mathematica is going into trouble with large files, datasets and parallelization.

Are there different opinions?

A simple question but hard to answer.

  1. The funtion Dataset is basically a wrapper that puts your data in a form that uses Association to label each value in you dataset such that it becomes more easy to extract data without being bothered to count columns and row or knowing where or what your data is. Basically Mathematica figures it out four you.
  2. Your code has a date format that is ambiguous e.g. 12/10 can be 12th of October or 10th of December therefore it generates warnings that it want to print to the front-end (suppressed by Quiet) making it slow. In my code i explicitly provide the date format to expect such that Mathematica does not has to figure it out by it self, therefore it becomes faster.
  3. Mathematica is a smart language that can figure out a lot about your data and handle it accordingly. However this takes time. Data is just numbers and if you tell Mathematica how to treat these numbers it does not have to figure it out by it self therefore it becomes faster.

So basically the more you think your self and provide Mathemaitca with what it needs to know it becomes faster. If you want Mathematica to figure it out on its own it will most of the time but becomes slower and unexpected things can happen.

Without providing Mathematica with the correct input it interprets you data as year/month/day if i'm not mistaken. But your data is formated as month/day/year. So actually your code gives a wrong output. If you would define a month interpreter yourself you can do it much faster. Observe that this code runs the entire dataset, 190K values, in the same time a the DateString function for only 10K values. But without providing the DateString with the correct input its output is wrong

enter image description here

Posted 3 months ago

In general, I agree to your statement, BUT please have a look on the next screenshot. I used SemanticImport for a good reason. This command creates a dataset from the CSV input and transforms automatically the Date column to a Mathematica like date representation. Therefore, at first I do not have to think about the date and time formats. I have copied the CSV file content format in the notebook as well.

In case we limit the number of rows to a certain amount, all discussed functions work pretty well. There is also no need to apply the Quiet function, because there are no warnings. The Parallel-commands do work without any problem.

So, this lead to the point that I have to repeat my statement “… Mathematica is going into trouble with large files, datasets and parallelization….” enter image description here

My apologies I missed the functionality of SemanticImport to handle the dates correctly.

But yes I agree with your statement and hence my explanation that using all Mathematicas advanced functions can make it very slow indeed! For example your 15MB dataset becomes over 400 MB using SemanticImport and once distributed over all parallel kernels it was able to eat up a whopping 3.5Gb of memory.

enter image description here enter image description here

Processing large datasets is in my opinion possible but in my experience not using the "make my life easy functions". I work with medical imaging data several Gb at a time and Mathematica does the job fine as long is I keep treating the data as plain Numbers and Strings and Lists.

I think you just hit the Mathematica boundary of applicability. In this case it is represented by "large" (15MB... yes fifteen mega bytes!!!??? ) dataset. So, welcome at Wolfram's world of "big data".

This is the main reason why is not possible use Mathematica for nearly any kind of serious work. But on the other hand, you can be happy with all that bells and whistles, which are presented as a result of 30 years of continuous development.

Your problem is typical example of the well known fact, that simple commands (presented at tutorials) are often not suitable for any kind if generalisation. If you need something a bit more faster, bigger,... etc. you must use surprisingly complicated solution.

Posted 3 months ago

I hope your statement is incorrect, because I have made an investment to buy a state of the art software that allows me to do everything up to Deep Learning.

We are still talking here about basics of Data Analysis, not rocket science. The csv file comes from an R course I have taken many years ago. Meantime I have made the same analysis with Matlab and Maple without any problem and I thought I could make this training unit with Mathematica again. Okay, Matlab and Maple do not offer direct Parallel commands, at least is this my current knowledge status.

I have to admit that I have assumed to buy a stable and robust software package, but it seems to me that this assumption is wrong. On the other hand, I am wondering why the Wolfram Team does not raise hands to convince all readers that your (and my) statement is incorrect.

Posted 3 months ago

Personally, I am afraid that you will be disappointed by Wolfram support, which is typically total silence in a case when you hit some real issue.

May be in the next release will be your problem solved, may be not. There are some bugs, which remains unsolved for many years.

Just for your info, Matlab and Maple offer direct parallelization, of course.

As of now (start 2019), Dataset are not really designed for large datasets. It has a lot of overhead, which you are feeling. Having rectangular arrays all of the same type, and using packed arrays is the best way to handle large datasets.

Datasets are convenient and very flexible but that comes at a cost because of that. Same holds for DateObjects. If you leave them as strings, the speed up is as expected:

dates = DateRange["jan. 1st, 2001", "2019"];
Map[AbsoluteTime, dates]; // AbsoluteTiming
ParallelMap[AbsoluteTime, dates, Method -> "CoarsestGrained"]; // AbsoluteTiming

But note that the interpretation from a date-string to an AbsoluteTime is a much tougher operation as compared to converting a DateObject to AbsoluteTime. The heavy lifting (the string interpretation) you had already done using e.g. SemanticImport.

Dataset are not obviously designed for small datasets (15MB CSV file???!!! ), too.

Ones again, welcome to the Wolfram's world of big data processing!

Posted 3 months ago

Bedankt Sander,

I do not have a problem to import and work with data in a different way as to make use of datasets. But as a novice in Mathematica programming I would be happy to get somewhere an official statement that everything related to datasets is still under development and should be only applied on files up to ...kByte? From my point of view it is not good and not fair to avoid all kind of communication about known bugs. Each and every software has bugs, but it is a matter of customer support and finally customer satisfaction either to let them fall into a trap or to make clear what are the product boundaries. But for the Wolfram Team it seems to be impossible until now to admit there is an issue. In Mathematica we would describe this in the follwing way:

Association[Key -> "Communication"]

The result of the entire discussion for me is, that I will simply forget the Wolfram dataset philosophy and I hope the next (for me unknown) known bug is not already waiting for me around the corner.

The higher level of programming you go, the more memory you will need and overhead you need. Dataset is the newest data-storage mechanism and is very high level, so expect very convenient querying but very memory hungry… Would not call it a bug, just the nature of the beast.

Good luck!

Not a bug! Are you kidding? This kind of "new" functionality is completely out of use.

You may find the Wolfram-U material available via the following links helpful.

Advanced programming

Parallel computing

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract