Group Abstract

Message Boards

WOLFRAM COMMUNITY

22.8K Views

24 Replies

10 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science High-Performance Computing Wolfram Language Operating System and Network Access

Use ParallelTable to improve calculation speed?

Jürgen Kanz

Jürgen Kanz, @ JuergenKanz

Posted 6 years ago

Mathematica V12: The snapshot is based on a notebook. I want to improve the calculation speed of my notebook by changing the function Table to ParallelTable. Unfortunately ParallelTable does not come to an end in appropriate time. Mathematica told me that 4 Kernels have been launched, but after more than 30 minutes I aborted the calculation. Why does it take so much time to create the table with several kernels?

POSTED BY: Jürgen Kanz

24 Replies

Sort By:

Sander Huisman

Sander Huisman, University of Twente

Posted 6 years ago

As of now (start 2019), Dataset are not really designed for large datasets. It has a lot of overhead, which you are feeling. Having rectangular arrays all of the same type, and using packed arrays is the best way to handle large datasets. Datasets are convenient and very flexible but that comes at a cost because of that. Same holds for DateObjects. If you leave them as strings, the speed up is as expected: dates = DateRange["jan. 1st, 2001", "2019"]; Map[AbsoluteTime, dates]; // AbsoluteTiming ParallelMap[AbsoluteTime, dates, Method -> "CoarsestGrained"]; // AbsoluteTiming But note that the interpretation from a date-string to an AbsoluteTime is a much tougher operation as compared to converting a DateObject to AbsoluteTime. The heavy lifting (the string interpretation) you had already done using e.g. SemanticImport.

POSTED BY: Sander Huisman

Martijn Froeling

Martijn Froeling, University Medical Center Utrecht

Posted 6 years ago

My apologies I missed the functionality of `SemanticImport` to handle the dates correctly. But yes I agree with your statement and hence my explanation that using all Mathematicas advanced functions can make it very slow indeed! For example your 15MB dataset becomes over 400 MB using `SemanticImport` and once distributed over all parallel kernels it was able to eat up a whopping 3.5Gb of memory. Processing large datasets is in my opinion possible but in my experience not using the "make my life easy functions". I work with medical imaging data several Gb at a time and Mathematica does the job fine as long is I keep treating the data as plain Numbers and Strings and Lists.

POSTED BY: Martijn Froeling

Ian Williams

Ian Williams, GeoConsult Limited

Posted 6 years ago

You may find the Wolfram-U material available via the following links helpful. Advanced programming Parallel computing

POSTED BY: Ian Williams

Martijn Froeling

Posted 6 years ago

In parallel computing much time can be spent distributing the data to all the kernels (making copies in memory for each kernel), so for large datasets this can take a while and sometimes even longer than the computation. Im not sure why Mathematica takes so long for this, but i have experienced minutes for large datasets. And somehow the second run is always faster. Also your code produces errors which have to be printed to screen in parallel computing this can get ugly. suppressing the errors makes ParallelTable faster

POSTED BY: Martijn Froeling

Sjoerd Smit

Sjoerd Smit, Wolfram Research Europe Ltd.

Posted 6 years ago

Yes, doing things in parallel doesn't always make it faster. There's overhead involved with parallelizing the task. By the looks of it, your parallel timing includes the launch time of the kernels. Have you tried evaluating `LaunchKernels[]` first before doing the timing? On my system: dates = DateRange[Now - 200 Quantity[1, "Years"], Now]; LaunchKernels[]; Map[AbsoluteTime, dates]; // AbsoluteTiming ParallelMap[AbsoluteTime, dates]; // AbsoluteTiming Out[12]= {0.220474, Null} Out[13]= {1.38644, Null} So the `ParallelMap` is indeed slower. What I think is going on, is that the date objects are being sent over to the parallel kernels and then the kernels have to re-interpret them (since date objects are both containers and constructors) before they can do anything with them. So the parallel kernels spend a lot of time just receiving the data before they can actually get to work. Overhead like that is why parallelization is not always faster.

POSTED BY: Sjoerd Smit

Sjoerd Smit

Sjoerd Smit, Wolfram Research Europe Ltd.

Posted 6 years ago

Have you tried something like: times = ParallelMap[AbsoluteTime, data[[All, 2]]] I would expect that to work. Also, what happens when you run your `ParallelTable` on a much smaller dataset?

POSTED BY: Sjoerd Smit

Daniel Lichtblau

Daniel Lichtblau, Wolfram Research

Posted 6 years ago

It is pretty much impossible to determine what might be issues without actual input that shows the problem.

POSTED BY: Daniel Lichtblau

Michal Kvasnicka

Posted 6 years ago

Not a bug! Are you kidding? This kind of "new" functionality is completely out of use.

POSTED BY: Michal Kvasnicka

Sander Huisman

Sander Huisman, University of Twente

Posted 6 years ago

The higher level of programming you go, the more memory you will need and overhead you need. Dataset is the newest data-storage mechanism and is very high level, so expect very convenient querying but very memory hungry Would not call it a bug, just the nature of the beast. Good luck!

POSTED BY: Sander Huisman

Jürgen Kanz

Jürgen Kanz, @ JuergenKanz

Posted 6 years ago

Bedankt Sander, I do not have a problem to import and work with data in a different way as to make use of datasets. But as a novice in Mathematica programming I would be happy to get somewhere an official statement that everything related to datasets is still under development and should be only applied on files up to ...kByte? From my point of view it is not good and not fair to avoid all kind of communication about known bugs. Each and every software has bugs, but it is a matter of customer support and finally customer satisfaction either to let them fall into a trap or to make clear what are the product boundaries. But for the Wolfram Team it seems to be impossible until now to admit there is an issue. In Mathematica we would describe this in the follwing way: Association[Key -> "Communication"] The result of the entire discussion for me is, that I will simply forget the Wolfram dataset philosophy and I hope the next (for me unknown) known bug is not already waiting for me around the corner.

POSTED BY: Jürgen Kanz

Michal Kvasnicka

Posted 6 years ago

Dataset are not obviously designed for small datasets (15MB CSV file???!!! ), too. Ones again, welcome to the Wolfram's world of big data processing!

POSTED BY: Michal Kvasnicka

Michal Kvasnicka

Posted 6 years ago

Just for your info, Matlab and Maple offer direct parallelization, of course.

POSTED BY: Michal Kvasnicka

Updating Name

Posted 6 years ago

Personally, I am afraid that you will be disappointed by Wolfram support, which is typically total silence in a case when you hit some real issue. May be in the next release will be your problem solved, may be not. There are some bugs, which remains unsolved for many years.

POSTED BY: Updating Name

Jürgen Kanz

Jürgen Kanz, @ JuergenKanz

Posted 6 years ago

I hope your statement is incorrect, because I have made an investment to buy a state of the art software that allows me to do everything up to Deep Learning. We are still talking here about basics of Data Analysis, not rocket science. The csv file comes from an R course I have taken many years ago. Meantime I have made the same analysis with Matlab and Maple without any problem and I thought I could make this training unit with Mathematica again. Okay, Matlab and Maple do not offer direct Parallel commands, at least is this my current knowledge status. I have to admit that I have assumed to buy a stable and robust software package, but it seems to me that this assumption is wrong. On the other hand, I am wondering why the Wolfram Team does not raise hands to convince all readers that your (and my) statement is incorrect.

POSTED BY: Jürgen Kanz

Jürgen Kanz

Jürgen Kanz, @ JuergenKanz

Posted 6 years ago

In general, I agree to your statement, BUT please have a look on the next screenshot. I used SemanticImport for a good reason. This command creates a dataset from the CSV input and transforms automatically the Date column to a Mathematica like date representation. Therefore, at first I do not have to think about the date and time formats. I have copied the CSV file content format in the notebook as well. In case we limit the number of rows to a certain amount, all discussed functions work pretty well. There is also no need to apply the Quiet function, because there are no warnings. The Parallel-commands do work without any problem. So, this lead to the point that I have to repeat my statement * Mathematica is going into trouble with large files, datasets and parallelization.*

POSTED BY: Jürgen Kanz

Martijn Froeling

Posted 6 years ago

A simple question but hard to answer. The funtion `Dataset` is basically a wrapper that puts your data in a form that uses `Association` to label each value in you dataset such that it becomes more easy to extract data without being bothered to count columns and row or knowing where or what your data is. Basically Mathematica figures it out four you. Your code has a date format that is ambiguous e.g. 12/10 can be 12th of October or 10th of December therefore it generates warnings that it want to print to the front-end (suppressed by `Quiet`) making it slow. In my code i explicitly provide the date format to expect such that Mathematica does not has to figure it out by it self, therefore it becomes faster. Mathematica is a smart language that can figure out a lot about your data and handle it accordingly. However this takes time. Data is just numbers and if you tell Mathematica how to treat these numbers it does not have to figure it out by it self therefore it becomes faster. So basically the more you think your self and provide Mathemaitca with what it needs to know it becomes faster. If you want Mathematica to figure it out on its own it will most of the time but becomes slower and unexpected things can happen. Without providing Mathematica with the correct input it interprets you data as year/month/day if i'm not mistaken. But your data is formated as month/day/year. So actually your code gives a wrong output. If you would define a month interpreter yourself you can do it much faster. Observe that this code runs the entire dataset, 190K values, in the same time a the DateString function for only 10K values. But without providing the DateString with the correct input its output is wrong

POSTED BY: Martijn Froeling

Michal Kvasnicka

Posted 6 years ago

I think you just hit the Mathematica boundary of applicability. In this case it is represented by "large" (15MB... yes fifteen mega bytes!!!??? ) dataset. So, welcome at Wolfram's world of "big data". This is the main reason why is not possible use Mathematica for nearly any kind of serious work. But on the other hand, you can be happy with all that bells and whistles, which are presented as a result of 30 years of continuous development. Your problem is typical example of the well known fact, that simple commands (presented at tutorials) are often not suitable for any kind if generalisation. If you need something a bit more faster, bigger,... etc. you must use surprisingly complicated solution.

POSTED BY: Michal Kvasnicka

Jürgen Kanz

Jürgen Kanz, @ JuergenKanz

Posted 6 years ago

Thank you ! Based on your suggestion it is possible for me to run all Parallel commands for the entire file in appropriate times. So, what is the difference apart from using the Quiet command? I did work completely with Datasets starting with the use of SemanticImport. You just imported the CSV file as is (List). My Mathematica knowledge and experience is still on a low level, but it seems to me now, that Mathematica is going into trouble with large files, datasets and parallelization. Are there different opinions?

POSTED BY: Jürgen Kanz

Jürgen Kanz

Jürgen Kanz, @ JuergenKanz

Posted 6 years ago

No, I did not launch the kernels in front of the calculation, but it is a goog idea especially when there are more tasks that should run in parallel. Nevertheless, I am still seeking for a (fast) ParallelTable solution. With the following lines of code I extract the MonthName and DayName from the dataset. Without other changes only these two lines of code deliver currently the needed results. mvtMonth = Table[DateString[data[[n, 2]], {"MonthName"}], {n, 1, nrows}]; mvtWeekday = Table[DateString[data[[n, 2]], {"DayName"}], {n, 1, nrows}]; Map has difficulties with the Dataset and ParallelTable runs "forever". So, I can continue with my work, but I am still not happy with the calculation speed.

POSTED BY: Jürgen Kanz

Jürgen Kanz

Jürgen Kanz, @ JuergenKanz

Posted 6 years ago

I am rather interested to know why the Dataset is the problem. Anyhow, I have made two tests: with the following command and result with the ParallelMap command My observation is, that this Parallel-Command takes also more time compared to the Single-Command.

POSTED BY: Jürgen Kanz

Sjoerd Smit

Sjoerd Smit, Wolfram Research Europe Ltd.

Posted 6 years ago

Aha! So it's `Dataset` that's being annoying here. Can you try it again with `Normal[data[[All, 2]]]`?

POSTED BY: Sjoerd Smit

Jürgen Kanz

Jürgen Kanz, @ JuergenKanz

Posted 6 years ago

The command: times = Map[AbsoluteTime, data[[All, 2]]] // AbsoluteTiming runs very well and fast 0.927 sec. Regarding the suggested ParallelMap command, I got the following message For your own investigations I have attached the CSV input file (15MB). Please keep me informed about your investigations. Attachments: mvtWeek1.csv

POSTED BY: Jürgen Kanz

Jürgen Kanz

Jürgen Kanz, @ JuergenKanz

Posted 6 years ago

My Desktop is running with 16 GB Ram. The CSV-File has a size of 15 MB and imported via SemanticImport. Until now I have not worked with above mentioned options. I will do later. Attached a screenshot of my Task-Manager:

POSTED BY: Jürgen Kanz

Todd Allen

Todd Allen, Harrisburg Area Community College

Posted 6 years ago

How much memory (ram) does your data consume? How much memory is consumed during the actual computation? One guess is that the computation you are requesting inflates the memory so much that the paging file on your hard drive is being used to complete the computation. Have you tried ParallelTable options such as "FinestGrained" and "CoarsestGrained"? Either way, if the data communication overhead is significant ParallelTable will not be faster than Table.

POSTED BY: Todd Allen

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback