Group Abstract

Message Boards

WOLFRAM COMMUNITY

19.2K Views

9 Replies

29 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Software Development Mathematica Import and Export

Fast CSV reader needed

Szabolcs Horvát

Posted 8 years ago

Has anyone created a fast and reliable CSV reader for Mathematica? I have been meaning to do this for a while, but did not have the time yet. I thought I would post this as a question in case someone else has already done it (or someone is willing to do it in the near future). It would need to be a LibraryLink-based implementation for good performance. Desired features: Reliable, no surprises. I have a strong dislike of the current `Import` implementation because surprises like this have messed up my results in the past. The ability to explicitly specify data types for each column. `SemanticImport` has this, but it is very slow. `Import` is much faster, but it detects the type of each element separately, and does not return type-consistent columns. I don't want mixed data in my columns and I don't want to import `12` as `12.0` when I know that column only has strings. Fast. I was watching this Python visualization tutorial that worked with the NYC taxi dataset, and thought that unfortunately Mathematica just can't do this ... the data is too big for it. Memory efficient. Each CSV column should be imported as a separate array. When the data type allows, it should be a packed array. `Dataset` was not memory efficient last time I checked. It stored each row as a mixed-type association. Handle header lines (each column has a name) Nice to have, but not required: Handle quoting/strings. In automatic data type detection mode, it should know that `"123"` is a string, not a number. To sum up, I want a CSV reader which trades off some of the flexibility of `Import` and `SemanticImport` for performance, memory efficiency and reliability/predictability. R/Pandas/Julia and now even MATLAB all have this. Why doesn't Mathematica? Is there anyone else here who is missing such functionality? More importantly, is there anyone else here who missed it so much that they already implemented it? Perhaps Boost Spirit would be a good starting point. Or am I mistaken and is it already possible to handle very large CSV or TSV files in a type-consistent, performant and reliable way?

POSTED BY: Szabolcs Horvát

9 Replies

Sort By:

Hans Michel

Hans Michel, Michel Information Services

Posted 8 years ago

POSTED BY: Hans Michel

Hans Michel

Hans Michel, Michel Information Services

Posted 8 years ago

Have you considered ODBC OR JDBC Desired features: ?Reliable, no surprises. Weakness due to multiple available drivers and need non-OS specific solution. There are open source Driver and commercial. Commercial ones may not be a solution for distributing a Mathematica package. For example https://docs.microsoft.com/en-us/sql/odbc/microsoft/schema-ini-file-text-file-driver ?The ability to explicitly specify data types for each column. Many drivers allow for a seperate schema definition file to be created when reading that particular csv file. ?Fast. Meh! Needs further testing. Fast loading or fast seaching. Some drivers support creating an index of the file. Depending on system RAM capacity one can load csv file to memory. ?Memory efficient. Can't test all ODBC or JDBC drivers out there. Loading csv file to memory could be a drain on resources. ?Handle header lines (each column has a name) Yes. Through schema definition or other properties of connection string etc. ?Nice to have, but not required: A good driver would take care of this quoted strings. Plus using SQL or Mathematica's wrapper for SQL can give access to column based select statement. On a related solution path. Why not use HSQL. Issues faced using JDBC is to remember to increase heap size. https://mathematica.stackexchange.com/questions/28019/giving-jlink-huge-memory-by-default HSQL supports the csv file format as a database store. Using jdbc or odbc depenting on driver could allow for leaving the data store as csv and do some SQL commands such as INSERT, UPDATE, DELETE not just SELECT.

POSTED BY: Hans Michel

George Wolfe

George Wolfe, Syntax Indices & Data

Posted 8 years ago

ReadList seems many times faster than Import. Is there a reason you don't mention it as an alternative?

POSTED BY: George Wolfe

Szabolcs Horvát

Posted 8 years ago

How would you use `ReadList` to read a CSV that has both numerical and string data? Example: data = "1.2,foo,4.5 6,bar,0.5"; str = StringToStream[data] ReadList[str, ???] Close[str] What if some of the strings include commas? data = "1.2,\"foo\",4.5 4,\"bar,baz\",0.5"; The desired result is ImportString[data, "CSV"] (* {{1.2, "foo", 4.5}, {4, "bar,baz", 0.5}} ) EDIT:* To explain further, the function I am looking for would be able to do exactly what `SemanticImport` does in the following example, but it would perform much better on large data: data = "x,name,y 1.2,foo,4.5 6,\"bar,baz\",0.5 9,666,0"; SemanticImportString[data, {"Number", "String", "Number"}, "NamedColumns", HeaderLines -> 1] (* <\|"x" -> {1.2`, 6, 9}, "name" -> {"foo", "bar,baz", "666"}, "y" -> {4.5`, 0.5`, 0}\|> *)

How would you use ReadList to read a CSV that has both numerical and string data?

Example:

data = "1.2,foo,4.5
6,bar,0.5";

str = StringToStream[data]
ReadList[str, ???]
Close[str]

What if some of the strings include commas?

data = "1.2,\"foo\",4.5
4,\"bar,baz\",0.5";

The desired result is

ImportString[data, "CSV"]
(* {{1.2, "foo", 4.5}, {4, "bar,baz", 0.5}} *)

EDIT:

To explain further, the function I am looking for would be able to do exactly what SemanticImport does in the following example, but it would perform much better on large data:

data = "x,name,y
1.2,foo,4.5
6,\"bar,baz\",0.5
9,666,0";

SemanticImportString[data, {"Number", "String", "Number"}, "NamedColumns", HeaderLines -> 1]

(* <|"x" -> {1.2`, 6, 9}, "name" -> {"foo", "bar,baz", "666"}, "y" -> {4.5`, 0.5`, 0}|> *)

POSTED BY: Szabolcs Horvát

Sean Cheren

Sean Cheren, Wolfram

Posted 8 years ago

Hi Szabolcs, Good news! In the near future, Wolfram Research is working on an updated CSV Import/Export which will fix a number of bugs with escaped characters, as well as provide speed/memory improvements via a LibraryLink paclet. We will consider the suggestions regarding column-wise data types for later releases. Also, Have you seen the option "HeaderLines"? This will skip over header rows, which have things like a column name like you mentioned. Thanks for the suggestions! -S

POSTED BY: Sean Cheren

Szabolcs Horvát

Posted 8 years ago

POSTED BY: Szabolcs Horvát

Alexey Popkov

Posted 8 years ago

Thank you for the good news! I think it is appropriate to mention here this long-standing bug of `"TextDelimiters"`: https://mathematica.stackexchange.com/a/140789/280

POSTED BY: Alexey Popkov

Pedro Fonseca

Pedro Fonseca, SUEZ Treatment Solutions

Posted 8 years ago

POSTED BY: Pedro Fonseca

Sander Huisman

Sander Huisman, University of Twente

Posted 8 years ago

It could indeed be improved, mostly speed is the problem for me. I think because Mathematica allows for ragged arrays which makes it slow in the general case, however I can't confirm this. It would be nice to indeed have more control of the CSV import, try less things (like your currency 'bug'). I never compared with e.g. Python/Matlab what happens is there is a ragged csv file, python can probably handle it fine, Matlab might inject NaN or "" or ... My modus operandi is currently to convert any csv/tsv file (sometimes in pieces) to a HDF5 file(s) if the columns are of the same type (and the data rectangular). Much smaller file size and very fast to read. CSV is kinda poor-man's file type for large files, for small files (let's say few MB) the import is ok I think...

POSTED BY: Sander Huisman

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback