Group Abstract Group Abstract

Message Boards Message Boards

0
|
10.6K Views
|
12 Replies
|
3 Total Likes
View groups...
Share
Share this post:

Why are exported files so different in size?

Posted 4 years ago

I created a large matrix (721,314 x 12) in Mathematica and exported it as CSV on my Mac. The file is 47.9MB. I opened the file in Apple's Numbers spreadsheet app and exported the spreadsheet as CSV and the size of the exported file is 33.2MB. Why the 14MB difference in size? I imported both into Mathematica and confirmed that they are identical.

Any thoughts?

Greg

POSTED BY: Gregory Lypny
12 Replies
Posted 4 years ago

I suppose, although I've always reserved ZIP for delivery or archiving. I've never really considered repeatedly compressing and decompressing thousands of big files that are part of an ongoing workflow or project.

Greg

POSTED BY: Gregory Lypny
Posted 4 years ago

Thanks for the tip, Sean. Looking back, I should have included "CSV" in the subject header of my post, but at the time I thought the issue my be something more than export format.

POSTED BY: Gregory Lypny

In fact if you ZIP the csv file, it gonna get stuff like 80% < space, and the size difference from CSV styles will be irrelevant

https://reference.wolfram.com/language/ref/format/ZIP.html

POSTED BY: Daniel Carvalho
Posted 4 years ago

I agree this isn't related to encoding, it's best to stick to UTF-8. There is a hidden option to CSV and TSV Export "ForceDelimieters" that controls whether the setting for "TextDelimiters" is wrapped around text fields always, or only when needed. Since this isn't documented yet, this might change in future versions. The reason we always wrap non-numeric values in the delimiter is export would be much slower if it has to test each string for the presence of a comma, newline, or delimiter. Compare these outputs:

In[2382]:= ExportString[{{1, "B", "test,comma"}}, "CSV"]

Out[2382]= "1,\"B\",\"test,comma\"
"

In[2383]:= ExportString[{{1, "B", "test,comma"}}, "CSV", 
 "ForceDelimiters" -> False]

Out[2383]= "1,B,\"test,comma\"
"

In[2385]:= ExportString[{{1, "B", "test,comma"}}, "CSV", 
 "ForceDelimiters" -> False, "TextDelimiters" -> "'"]

Out[2385]= "1,B,'test,comma'
"

If your data does not contain any commas or newlines, then the best approach is to just disable text delimiters all together so that the string matching doesn't kick in (notice this mangles the field with a comma):

In[2387]:= ExportString[{{1, "B", "test,comma"}}, "CSV", 
 "TextDelimiters" -> None]

Out[2387]= "1,B,test,comma
"

P.S. I stumbled on this post by accident, and probably would have seen it sooner if CSV was in the title :)

POSTED BY: Sean Cheren
Posted 4 years ago

Good points, Hans. I agree that wrapping all fields in quotes is the safest bet for CSV, despite the increase in file size. My students who set their computer's OS to a language other than English often encounter problems opening data files in Excel because their language uses the comma to represent the decimal point in real numbers.

POSTED BY: Gregory Lypny
POSTED BY: Hans Michel
Posted 4 years ago

Import and Export options for CSV are documented here. UTF-8 encoding is the default.

POSTED BY: Rohit Namjoshi
Posted 4 years ago

Hi Rohit,

Thanks for the suggestion. I will compare the files in terminal, but see my response to Hans.

Regards,

Greg

POSTED BY: Gregory Lypny
Posted 4 years ago
POSTED BY: Gregory Lypny

Greg:

I tried a few things:

outs = Table[{RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}],RandomReal[{-10,10}]},{721314}];
Dimensions[outs]
{721314, 12}
Export["couts.csv", outs]

The file size without any options on my system is 155 MB (163,184,848 bytes) for this default Export. This is Windows 10 M 12.1. The file size is the same even with some additional options which M does not bark at:

CharacterEncoding -> "PrintableASCII", RecordSeparators -> "\r\n"

CharacterEncoding -> "ASCII", RecordSeparators -> "\r\n"

CharacterEncoding -> "ASCII", RecordSeparators -> "\n"

CharacterEncoding -> "UTF8"

CharacterEncoding -> "UTF-8"

"Export encodes line separator characters using the convention of the computer system on which the Wolfram Language is being run." From the Help files

The file size changed to 309 MB (324,927,070 bytes) when CharacterEncoding -> "Unicode" was used.

It could be the RecordSeparators or even quoted vs unquoted CSV format that may account for size dfference.

POSTED BY: Hans Michel
Posted 4 years ago

While they might be identical when imported into Mathematica are they identical on the filesystem? Clearly not because of the difference in size. Examine the first few lines of each file to see what is different. In Terminal

head file1.csv

head file2.csv
POSTED BY: Rohit Namjoshi

Hello Greg:

Please provide some sample code.

Using the Export and Import functions without options is asking for WRI magic to happen. In most cases this works very well if not even powerfull and keeps user from worry on/about details.

But without seeing any code on the Export and knowing what the default export options are on Apple Numbers, I would have to guess, ASCII, UTF-8, Unicode (character encoding) can account for the difference.

https://medium.com/@apiltamang/unicode-utf-8-and-ascii-encodings-made-easy-5bfbe3a1c45a

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

POSTED BY: Hans Michel
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard