Message Boards Message Boards

[?] Avoid XLSX import failure on first row in small file?

Posted 8 years ago

Hello,

I have an Excel workbook with one sheet containing 86370 rows and 28 columns of numeric cells. File size is ~ 27MB.

I cannot successfully import this file, even if I ask for just the first row. The Java heap always explodes, regardless of JVM heap parameters -Xmx and -Xms (or where they're specified).

For example,

Import["file.xlsx",{Data,1,1}]

fails.

It seems that WRI does not take the Excel file format seriously.

My workaround saves to and imports from binary. But since I'm often interacting with internal/external customers via Excel, I'd strongly prefer not to introduce additional data products, to make the workflow simpler, more transparent, and less error-prone.

Here's hoping that XLSX import/export gets some TLC from WRI in the near future, with full/efficient support for its maximum specs (~ 1M rows, 65K cols, unlimited sheets as of ~ Excel 2016).

Am I missing something?

Vince

POSTED BY: Vincent Virgilio
5 Replies

Thank you Mr. Mertig. I'll take this for a spin soonest.

I did dabble with NETLink/COM a few days ago. I wasn't able to get something like "var = sh@Range[...]@Value" to complete on a single column of height 86370; no doubt mostly due to my inexperience with NETLink.

Vince

POSTED BY: Vincent Virgilio

Java is a bit of pain when it comes to efficiency.

On Windows I would use NETLink (which is also painful because of lack of methods documentation).

The code below never needs more than half a GB of RAM (Kernel and FE) and runs in less than half a minute.

Needs["NETLink`"];
InstallNET[];
LoadNETType["System.GC"];

bigExport[fileName_String, data_,chunk_:5000]:=NETBlock@Module[{xl,book,sheet,range,datachunk,writeSubMatrix},
writeSubMatrix[sh_,xl_,{{ri_,ci_},{rf_,cf_}}, subMat_?MatrixQ] :=
sh@Range[xl@Cells[ri,ci],xl@Cells[rf,cf]]@Value=subMat;
xl=CreateCOMObject["Excel.Application"];
book=xl@Workbooks@Add[];
sheet=book@Worksheets@Item[1];
Do[Print["writing rows ", row," to ",Min[row+chunk-1,Length@data]];
datachunk = Take[data, {row,Min[row+chunk-1,Length@data]}];
Print["datachunk Dimensions = ", datachunk//Dimensions];
writeSubMatrix[sheet,xl,{{row,1},{row-1,0}+Dimensions[datachunk]},datachunk],{row,1,Length[data],chunk}
];
book@SaveAs[fileName];book@Close[];xl@Quit[];
fileName
];

$outputFile=FileNameJoin[{NotebookDirectory[],"demo.xlsx"}];
Quiet@DeleteFile@$outputFile;
numRows=86370; numCols=28;
data = RandomReal[1, {numRows,numCols}];
bigExport[$outputFile,data] // AbsoluteTiming
(*SystemOpen @ $outputFile*)
POSTED BY: Rolf Mertig

My machine is Windows 10 with 32 GB RAM and the same version of Mathematica. I'm not sure what is causing the problem for you.

POSTED BY: Chad Knutson

Thank you for the response.

I did try 5g (my machine: 64GB RAM, 16 3.5 GHz cores).

And I did try your code. The Export blows the heap

enter image description here

POSTED BY: Vincent Virgilio

Using my machine, I had to set -Xmx to 5 GB to make it work. I agree that is excessive. What was the max that you tried?

In[138]:= data = RandomReal[1,{86370,28}];
In[142]:= Needs["JLink`"]
In[149]:= ReinstallJava[JVMArguments->"-Xmx5g"]
In[146]:= Export["bigd.xlsx",data]
Out[146]= bigd.xlsx
In[151]:= d2=Import["bigd.xlsx"];
In[152]:= Dimensions[d2]
Out[152]= {1,86370,28}
POSTED BY: Chad Knutson
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract