Message Boards Message Boards

16
|
15927 Views
|
9 Replies
|
29 Total Likes
View groups...
Share
Share this post:

Fast CSV reader needed

Posted 7 years ago

Has anyone created a fast and reliable CSV reader for Mathematica? I have been meaning to do this for a while, but did not have the time yet. I thought I would post this as a question in case someone else has already done it (or someone is willing to do it in the near future). It would need to be a LibraryLink-based implementation for good performance.

Desired features:

  • Reliable, no surprises. I have a strong dislike of the current Import implementation because surprises like this have messed up my results in the past.

  • The ability to explicitly specify data types for each column. SemanticImport has this, but it is very slow. Import is much faster, but it detects the type of each element separately, and does not return type-consistent columns. I don't want mixed data in my columns and I don't want to import 12 as 12.0 when I know that column only has strings.

  • Fast. I was watching this Python visualization tutorial that worked with the NYC taxi dataset, and thought that unfortunately Mathematica just can't do this ... the data is too big for it.

  • Memory efficient. Each CSV column should be imported as a separate array. When the data type allows, it should be a packed array. Dataset was not memory efficient last time I checked. It stored each row as a mixed-type association.

  • Handle header lines (each column has a name)

  • Nice to have, but not required: Handle quoting/strings. In automatic data type detection mode, it should know that "123" is a string, not a number.

To sum up, I want a CSV reader which trades off some of the flexibility of Import and SemanticImport for performance, memory efficiency and reliability/predictability.

R/Pandas/Julia and now even MATLAB all have this. Why doesn't Mathematica?

Is there anyone else here who is missing such functionality? More importantly, is there anyone else here who missed it so much that they already implemented it? Perhaps Boost Spirit would be a good starting point.

Or am I mistaken and is it already possible to handle very large CSV or TSV files in a type-consistent, performant and reliable way?

POSTED BY: Szabolcs Horvát
9 Replies

Hi Szabolcs:

Previous reply did not have any examples mostly because you are asking for a fast LibraryLink solution. What comes to mind is a a Mathematica equivalent to csvkit (python project). Expanding on the HSQL JDBC suggestions here is some code.

The following csv file is not large but it is what I could find from quick google search. http://ourairports.com/data/airports.csv

I downloaded the file and save locally.

csvfile = "C:\\stuff\\airports.csv";
namedDir = 
  ToUpperCase[
   StringReplace[Last[FileNameSplit[csvfile]], ".csv" -> ""]];
dir = CreateDirectory[namedDir];
fileDir = CreateDirectory[FileNameJoin[{dir, "file"}]];
CopyFile[csvfile, 
  FileNameJoin[{fileDir, Last[FileNameSplit[csvfile]]}]];
dbfile = Export[
   FileNameJoin[{dir, 
     StringJoin[
      StringReplace[Last[FileNameSplit[csvfile]], ".csv" -> ""], 
      ".hsqldb"]}], ExportString["", "Text"], "Text"];
Needs["DatabaseLink`"]
dbconn = OpenSQLConnection[JDBC["HSQL(Standalone)", dbfile]]
SQLExecute[dbconn, "CREATE TEXT TABLE PUBLIC.AIRPORTS(
                ID VARCHAR(10),
                IDENT VARCHAR(8),
                TYPE VARCHAR(14),
                NAME VARCHAR(256),
                LATITUDE_DEG VARCHAR(16),
                LONGITUDE_DEG VARCHAR(16),
                ELEVATION_FT VARCHAR(8),
                CONTINENT VARCHAR(4),
                ISO_COUNTRY VARCHAR(4),
                ISO_REGION VARCHAR(4),
                MUNICIPALITY VARCHAR(8),
                SCHEDULED_SERVICE VARCHAR(4),
                GPS_CODE VARCHAR(8),
                IATA_CODE VARCHAR(4),
                LOCAL_CODE VARCHAR(10),
                HOME_LINK VARCHAR(256),
                WIKIPEDIA_LINK VARCHAR(256),
                KEYWORDS VARCHAR(256)
 )"]
SQLExecute[dbconn, "SET TABLE PUBLIC.AIRPORTS SOURCE \
\"./file/airports.csv;ignore_first=true;all_quoted=true;encoding=UTF-\
8\";"]

The schema for sql table AIRPORTS was deduced through inspection. But I would believe that a Mathematica csv Importer would need to do some auto-inspection to deduce the datatypes. Unfortunately the csv file chosen did not have any datetime values. If successful then do:

SQLExecute[dbconn, "SELECT * FROM PUBLIC.AIRPORTS"]

When done

CloseSQLConnection[dbconn]
POSTED BY: Hans Michel

Have you considered ODBC OR JDBC

Desired features:

?Reliable, no surprises. Weakness due to multiple available drivers and need non-OS specific solution. There are open source Driver and commercial. Commercial ones may not be a solution for distributing a Mathematica package. For example https://docs.microsoft.com/en-us/sql/odbc/microsoft/schema-ini-file-text-file-driver

?The ability to explicitly specify data types for each column. Many drivers allow for a seperate schema definition file to be created when reading that particular csv file.

?Fast. Meh! Needs further testing. Fast loading or fast seaching. Some drivers support creating an index of the file. Depending on system RAM capacity one can load csv file to memory.

?Memory efficient. Can't test all ODBC or JDBC drivers out there. Loading csv file to memory could be a drain on resources.

?Handle header lines (each column has a name) Yes. Through schema definition or other properties of connection string etc.

?Nice to have, but not required: A good driver would take care of this quoted strings. Plus using SQL or Mathematica's wrapper for SQL can give access to column based select statement.

On a related solution path. Why not use HSQL.

Issues faced using JDBC is to remember to increase heap size. https://mathematica.stackexchange.com/questions/28019/giving-jlink-huge-memory-by-default

HSQL supports the csv file format as a database store. Using jdbc or odbc depenting on driver could allow for leaving the data store as csv and do some SQL commands such as INSERT, UPDATE, DELETE not just SELECT.

POSTED BY: Hans Michel

ReadList seems many times faster than Import. Is there a reason you don't mention it as an alternative?

POSTED BY: George Wolfe

How would you use ReadList to read a CSV that has both numerical and string data?

Example:

data = "1.2,foo,4.5
6,bar,0.5";

str = StringToStream[data]
ReadList[str, ???]
Close[str]

What if some of the strings include commas?

data = "1.2,\"foo\",4.5
4,\"bar,baz\",0.5";

The desired result is

ImportString[data, "CSV"]
(* {{1.2, "foo", 4.5}, {4, "bar,baz", 0.5}} *)

EDIT:

To explain further, the function I am looking for would be able to do exactly what SemanticImport does in the following example, but it would perform much better on large data:

data = "x,name,y
1.2,foo,4.5
6,\"bar,baz\",0.5
9,666,0";

SemanticImportString[data, {"Number", "String", "Number"}, "NamedColumns", HeaderLines -> 1]

(* <|"x" -> {1.2`, 6, 9}, "name" -> {"foo", "bar,baz", "666"}, "y" -> {4.5`, 0.5`, 0}|> *)
POSTED BY: Szabolcs Horvát
Posted 7 years ago

Hi Szabolcs,

Good news! In the near future, Wolfram Research is working on an updated CSV Import/Export which will fix a number of bugs with escaped characters, as well as provide speed/memory improvements via a LibraryLink paclet. We will consider the suggestions regarding column-wise data types for later releases.

Also, Have you seen the option "HeaderLines"? This will skip over header rows, which have things like a column name like you mentioned.

Thanks for the suggestions!

-S

POSTED BY: Sean Cheren

Hi Sean,

This is great news. Personally, I would already be quite happy if I could have the consistency of SemanticImport (column-wise data types, no surprises like when importing 24c) together with the current performance of Import[..., "CSV"]. If the speed of Import will be improved in the future, that is excellent news.

Of course, I understand that there is always a tradeoff between flexibility and performance. SemanticImport (and Interpreter) offer great flexibility at the cost of performance. Handling ragged arrays is always going to be slower than rectangular ones. (And yes, maybe HDF is better for rectangular numerical ones.) But the reality of the situation is that there are many large, rectangular and type-consistent datasets in CSV format out there. This very common type of data should be handled very well. Maybe it should have its own function, which is not flexible and will error out on ragged data. Then we can use more flexible functions for smaller and messier data. Such a set of functions is not as beautiful as "one Import for everything", but it is very practical.

POSTED BY: Szabolcs Horvát
Posted 7 years ago

Thank you for the good news!

I think it is appropriate to mention here this long-standing bug of "TextDelimiters":

POSTED BY: Alexey Popkov

I'm with you Szabolcs!

I definitely prefer the "reliability" version. I have also been previously bitten by the "smart" interpretation of my data.

I would definitely prefer to customize the interpretation I want, eventually having as an option the full semantic interpretation..., but I guess that this is not what I would be choosing for most of my work.

Cheers,

POSTED BY: Pedro Fonseca

It could indeed be improved, mostly speed is the problem for me. I think because Mathematica allows for ragged arrays which makes it slow in the general case, however I can't confirm this. It would be nice to indeed have more control of the CSV import, try less things (like your currency 'bug'). I never compared with e.g. Python/Matlab what happens is there is a ragged csv file, python can probably handle it fine, Matlab might inject NaN or "" or ... My modus operandi is currently to convert any csv/tsv file (sometimes in pieces) to a HDF5 file(s) if the columns are of the same type (and the data rectangular). Much smaller file size and very fast to read.

CSV is kinda poor-man's file type for large files, for small files (let's say few MB) the import is ok I think...

POSTED BY: Sander Huisman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract