# Message Boards

GROUPS:
 Szabolcs Horvát 16 Votes Has anyone created a fast and reliable CSV reader for Mathematica? I have been meaning to do this for a while, but did not have the time yet. I thought I would post this as a question in case someone else has already done it (or someone is willing to do it in the near future). It would need to be a LibraryLink-based implementation for good performance.Desired features: Reliable, no surprises. I have a strong dislike of the current Import implementation because surprises like this have messed up my results in the past. The ability to explicitly specify data types for each column. SemanticImport has this, but it is very slow. Import is much faster, but it detects the type of each element separately, and does not return type-consistent columns. I don't want mixed data in my columns and I don't want to import 12 as 12.0 when I know that column only has strings. Fast. I was watching this Python visualization tutorial that worked with the NYC taxi dataset, and thought that unfortunately Mathematica just can't do this ... the data is too big for it. Memory efficient. Each CSV column should be imported as a separate array. When the data type allows, it should be a packed array. Dataset was not memory efficient last time I checked. It stored each row as a mixed-type association. Handle header lines (each column has a name) Nice to have, but not required: Handle quoting/strings. In automatic data type detection mode, it should know that "123" is a string, not a number. To sum up, I want a CSV reader which trades off some of the flexibility of Import and SemanticImport for performance, memory efficiency and reliability/predictability.R/Pandas/Julia and now even MATLAB all have this. Why doesn't Mathematica? Is there anyone else here who is missing such functionality? More importantly, is there anyone else here who missed it so much that they already implemented it? Perhaps Boost Spirit would be a good starting point.Or am I mistaken and is it already possible to handle very large CSV or TSV files in a type-consistent, performant and reliable way?
10 months ago
9 Replies
 It could indeed be improved, mostly speed is the problem for me. I think because Mathematica allows for ragged arrays which makes it slow in the general case, however I can't confirm this. It would be nice to indeed have more control of the CSV import, try less things (like your currency 'bug'). I never compared with e.g. Python/Matlab what happens is there is a ragged csv file, python can probably handle it fine, Matlab might inject NaN or "" or ... My modus operandi is currently to convert any csv/tsv file (sometimes in pieces) to a HDF5 file(s) if the columns are of the same type (and the data rectangular). Much smaller file size and very fast to read.CSV is kinda poor-man's file type for large files, for small files (let's say few MB) the import is ok I think...
10 months ago
 I'm with you Szabolcs!I definitely prefer the "reliability" version. I have also been previously bitten by the "smart" interpretation of my data.I would definitely prefer to customize the interpretation I want, eventually having as an option the full semantic interpretation..., but I guess that this is not what I would be choosing for most of my work.Cheers,
10 months ago
 Sean Cheren 6 Votes Hi Szabolcs, Good news! In the near future, Wolfram Research is working on an updated CSV Import/Export which will fix a number of bugs with escaped characters, as well as provide speed/memory improvements via a LibraryLink paclet. We will consider the suggestions regarding column-wise data types for later releases. Also, Have you seen the option "HeaderLines"? This will skip over header rows, which have things like a column name like you mentioned. Thanks for the suggestions!-S
10 months ago
 Szabolcs Horvát 2 Votes Hi Sean,This is great news. Personally, I would already be quite happy if I could have the consistency of SemanticImport (column-wise data types, no surprises like when importing 24c) together with the current performance of Import[..., "CSV"]. If the speed of Import will be improved in the future, that is excellent news.Of course, I understand that there is always a tradeoff between flexibility and performance. SemanticImport (and Interpreter) offer great flexibility at the cost of performance. Handling ragged arrays is always going to be slower than rectangular ones. (And yes, maybe HDF is better for rectangular numerical ones.) But the reality of the situation is that there are many large, rectangular and type-consistent datasets in CSV format out there. This very common type of data should be handled very well. Maybe it should have its own function, which is not flexible and will error out on ragged data. Then we can use more flexible functions for smaller and messier data. Such a set of functions is not as beautiful as "one Import for everything", but it is very practical.
10 months ago
 Alexey Popkov 2 Votes Thank you for the good news! I think it is appropriate to mention here this long-standing bug of "TextDelimiters":
9 months ago
 George Wolfe 1 Vote ReadList seems many times faster than Import. Is there a reason you don't mention it as an alternative?
 Szabolcs Horvát 2 Votes How would you use ReadList to read a CSV that has both numerical and string data?Example: data = "1.2,foo,4.5 6,bar,0.5"; str = StringToStream[data] ReadList[str, ???] Close[str] What if some of the strings include commas? data = "1.2,\"foo\",4.5 4,\"bar,baz\",0.5"; The desired result is ImportString[data, "CSV"] (* {{1.2, "foo", 4.5}, {4, "bar,baz", 0.5}} *)  EDIT:To explain further, the function I am looking for would be able to do exactly what SemanticImport does in the following example, but it would perform much better on large data: data = "x,name,y 1.2,foo,4.5 6,\"bar,baz\",0.5 9,666,0"; SemanticImportString[data, {"Number", "String", "Number"}, "NamedColumns", HeaderLines -> 1] (* <|"x" -> {1.2, 6, 9}, "name" -> {"foo", "bar,baz", "666"}, "y" -> {4.5, 0.5, 0}|> *) 
 Hi Szabolcs:Previous reply did not have any examples mostly because you are asking for a fast LibraryLink solution. What comes to mind is a a Mathematica equivalent to csvkit (python project). Expanding on the HSQL JDBC suggestions here is some code.The following csv file is not large but it is what I could find from quick google search. http://ourairports.com/data/airports.csvI downloaded the file and save locally. csvfile = "C:\\stuff\\airports.csv"; namedDir = ToUpperCase[ StringReplace[Last[FileNameSplit[csvfile]], ".csv" -> ""]]; dir = CreateDirectory[namedDir]; fileDir = CreateDirectory[FileNameJoin[{dir, "file"}]]; CopyFile[csvfile, FileNameJoin[{fileDir, Last[FileNameSplit[csvfile]]}]]; dbfile = Export[ FileNameJoin[{dir, StringJoin[ StringReplace[Last[FileNameSplit[csvfile]], ".csv" -> ""], ".hsqldb"]}], ExportString["", "Text"], "Text"]; Needs["DatabaseLink"] dbconn = OpenSQLConnection[JDBC["HSQL(Standalone)", dbfile]] SQLExecute[dbconn, "CREATE TEXT TABLE PUBLIC.AIRPORTS( ID VARCHAR(10), IDENT VARCHAR(8), TYPE VARCHAR(14), NAME VARCHAR(256), LATITUDE_DEG VARCHAR(16), LONGITUDE_DEG VARCHAR(16), ELEVATION_FT VARCHAR(8), CONTINENT VARCHAR(4), ISO_COUNTRY VARCHAR(4), ISO_REGION VARCHAR(4), MUNICIPALITY VARCHAR(8), SCHEDULED_SERVICE VARCHAR(4), GPS_CODE VARCHAR(8), IATA_CODE VARCHAR(4), LOCAL_CODE VARCHAR(10), HOME_LINK VARCHAR(256), WIKIPEDIA_LINK VARCHAR(256), KEYWORDS VARCHAR(256) )"] SQLExecute[dbconn, "SET TABLE PUBLIC.AIRPORTS SOURCE \ \"./file/airports.csv;ignore_first=true;all_quoted=true;encoding=UTF-\ 8\";"] The schema for sql table AIRPORTS was deduced through inspection. But I would believe that a Mathematica csv Importer would need to do some auto-inspection to deduce the datatypes. Unfortunately the csv file chosen did not have any datetime values. If successful then do: SQLExecute[dbconn, "SELECT * FROM PUBLIC.AIRPORTS"] When done CloseSQLConnection[dbconn]