Group Abstract

Message Boards

WOLFRAM COMMUNITY

11.2K Views

12 Replies

3 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Import and Export Wolfram Language

Why are exported files so different in size?

Gregory Lypny

Posted 4 years ago

I created a large matrix (721,314 x 12) in Mathematica and exported it as CSV on my Mac. The file is 47.9MB. I opened the file in Apple's Numbers spreadsheet app and exported the spreadsheet as CSV and the size of the exported file is 33.2MB. Why the 14MB difference in size? I imported both into Mathematica and confirmed that they are identical. Any thoughts? Greg

POSTED BY: Gregory Lypny

12 Replies

Sort By:

Gregory Lypny

Posted 4 years ago

I suppose, although I've always reserved ZIP for delivery or archiving. I've never really considered repeatedly compressing and decompressing thousands of big files that are part of an ongoing workflow or project. Greg

POSTED BY: Gregory Lypny

Daniel Carvalho

Daniel Carvalho, WOLFRAM

Posted 4 years ago

In fact if you ZIP the csv file, it gonna get stuff like 80% < space, and the size difference from CSV styles will be irrelevant https://reference.wolfram.com/language/ref/format/ZIP.html

POSTED BY: Daniel Carvalho

Sean Cheren

Sean Cheren, Wolfram

Posted 4 years ago

POSTED BY: Sean Cheren

Gregory Lypny

Posted 4 years ago

Thanks for the tip, Sean. Looking back, I should have included "CSV" in the subject header of my post, but at the time I thought the issue my be something more than export format.

POSTED BY: Gregory Lypny

Gregory Lypny

Posted 4 years ago

Good points, Hans. I agree that wrapping all fields in quotes is the safest bet for CSV, despite the increase in file size. My students who set their computer's OS to a language other than English often encounter problems opening data files in Excel because their language uses the comma to represent the decimal point in real numbers.

POSTED BY: Gregory Lypny

Hans Michel

Hans Michel, Michel Information Services

Posted 4 years ago

Greg: There is no need to try the CharacterEncoding->"Unicode" as this only proves that one can export the same data and by chosing that character encoding the file will be larger (double byte). But in your case it was the quoted and unquoted fields that made a difference. I would like it if the CSV issues on all systems would be put to bed, but CSV is not yet a true standard (?). [Not looking for a debate]. Many don't implement it well not just on export but on import. But I believe that WRI is doing the export correctly, put quotes around fields. If one reads the RFC for CSV and many other notes and opinions on CSV, it is hard for me to see how one would not come to the conclusion that the safest way to protect fields from confusion is to wrap them in quotes and worry about types later. Also note that the qoute character can be substituted and escape characters can be used. There are many EDI systems (formats) that indicate the record, field, escape separators and delimiters in a header/init or first line of file being transformed/consumed. So CSV is not different just badly implemented in many systems. Now WL provide ways that you can create you own import/export converter. So I think that could be the best approach. You want to export your CSV without quotes. I am not certain of all the pitfalls (if any). When time permits I may come back with code examples.

POSTED BY: Hans Michel

Hans Michel

Hans Michel, Michel Information Services

Posted 4 years ago

POSTED BY: Hans Michel

Gregory Lypny

Posted 4 years ago

Hans to the rescue! Hi Hans. You're right. It is the character encoding. The Numbers app uses Unicode (UTF-8) by default, and strings are not wrapped in quotation marks unless they contain commas. In Mathematica's CSV exported output, all strings and null elements are quoted. The difference in file size is accounted for by the presence of quotation marks in Mathematica's output. I confirmed this by opening Mathematica's CSV in TextEdit and deleting all of the quotation marks using find-and-replace. Incidentally, the help page for Export mentions "options" as an argument to the function but none are discussed or listed in the Details section of the help page, and "Unicode" is not among the encoding lists under $CharacterEncodings. But I will add CharacterEncoding->"Unicode" to Export and see what happens. Best regards, Greg

POSTED BY: Gregory Lypny

Rohit Namjoshi

Posted 4 years ago

POSTED BY: Rohit Namjoshi

Rohit Namjoshi

Posted 4 years ago

While they might be identical when imported into Mathematica are they identical on the filesystem? Clearly not because of the difference in size. Examine the first few lines of each file to see what is different. In Terminal head file1.csv head file2.csv

POSTED BY: Rohit Namjoshi

Gregory Lypny

Posted 4 years ago

Hi Rohit, Thanks for the suggestion. I will compare the files in terminal, but see my response to Hans. Regards, Greg

POSTED BY: Gregory Lypny

Hans Michel

Hans Michel, Michel Information Services

Posted 4 years ago

POSTED BY: Hans Michel

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback