Message Boards Message Boards

0
|
1816 Views
|
4 Replies
|
1 Total Likes
View groups...
Share
Share this post:

Confusion when exporting TSV file within a zip file if it's called CSV

Posted 8 months ago

Hi There,

The below is a little confusing - to clarify, I have a zip file with a TSV file in it. The TSV file ends in CSV where it should probably end in TSV.

I'm playing with GDELT data and want to do a bulk insert of file data into a SQL database. To do this I'm exporting a CSV file that is inside a zip file (supplied by GDELT). I have included one of these files, just rename the attached file to "20230101.export.CSV.zip".

I use this code to extract the TSV file from the zip file so I can use it in a bulk insert query (the TSV file ends in CSV where it should probably end in TSV):

Export["20230101.export.CSV",  Import[
   "20230101.export.CSV.zip", "20230101.export.CSV"] , "TSV"];

In the resulting file, each row is now enclosed by a " at each end (and the bulk insert fails since the file is not properly formatted).

If I rename the file within the zip file to "20230101.export.TSV" and export it using similar code it works properly and does not enclose each line with "".

Firstly, is this a bug? I would have expected Mathematica to export this properly since I added "TSV" in the export statement.

Since the files (inside the zip files) that are supplied have a slightly misleading "CSV" filetype ending (even though they are really TSV files), I am kind of stuck with this naming convention. Any suggestions on how I can get a properly formatted TSV file that I can use for bulk inserts?

Regards,
Sean.

Attachments:
POSTED BY: Sean G
4 Replies
Posted 8 months ago

Hi Rohit,

many thanks for the reply and solution. I also found another solution to the issue that appears to fix it: If I extract the CSV file within the zip file using the "Byte" option, it doesn't enclose each line with "".

Import["20230101.export.CSV.zip", "20230101.export.CSV", "Byte"]

The resulting code was really fast, I was able to process about 6000 files in a few hours. The bulk inserts also worked really well with only 3 bad files (representing 3 days in about 20 years of data). It took a few hours to do all the bulk inserts. The resulting table has over 1B entries. Before I start playing with the DB, I'm going to look at other potential data sources to add. Thanks again for your suggestion.

POSTED BY: Sean G
Posted 8 months ago

There is no need to do it manually. You could write WL code that generates the series of unzip/mv commands and execute it using RunProcess or ExternalEvaluate.

POSTED BY: Rohit Namjoshi
Posted 8 months ago

The only issue I have with this is that there are hundreds of these files and more are added on a regular basis so I want to have a mathematica script that automatically processes these files rather than doing it manually.

POSTED BY: Sean G
Posted 8 months ago

If the only problem is the wrong extension I think it would be much easier, and much faster to do from the command line

unzip 20230101.export.CSV.zip 20230101.export.CSV
mv 20230101.export.CSV 20230101.export.TSV
POSTED BY: Updating Name
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract