Message Boards Message Boards

0
|
1538 Views
|
3 Replies
|
0 Total Likes
View groups...
Share
Share this post:

Problem doing a bulk import into SQL with GDELT data

Posted 10 months ago

Hi There,

This is a strange one (for me). I'm trying to import data from the GDELT project (http://gdeltproject.org/), one of the files is attached.

I have MS SQL installed on my PC. When I try to do bulk import from these files, I get errors about data truncating. On examining the files, it looks like there is some problems with commas in certain columns (37, 44, and 51).

If I open the files in Excel, Excel misinterprets these columns with a comma (even though this is supposed to be a tab delimited document).

When I try to do things like the following, it still fails to format the files in a way that SQL can read. removeCommasInColumns[rowList, colsList] := MapAt[StringReplace[#, "," -> ""] &, row, List /@ cols] dataworldevents = Import["mypath\20201114.export.CSV", "TSV"]; processedData = removeCommasInColumns[#, {37, 44, 51}] & /@ dataworldevents; Export["mypath\yourfile.csv", processedData, "TSV"];

This is the bulk upload query I'm using: dbQuery = "BULK INSERT worldevents FROM 'mypath\20201114.export.CSV' WITH ( FIRSTROW = 1, FIELDTERMINATOR = '\t', ROWTERMINATOR = '\n', );";

Any help would be appreciated. I'd like to import a lot of data so this seems like it would be the fastest way to do it.

Attachments:
POSTED BY: Sean G
3 Replies
Posted 10 months ago

What is the exact error message from SQL Server? "data truncation" errors are generated when the column type is not wide enough to accommodate the data. E.g. the column is varchar(8) but the corresponding column in the TSV has more than 8 characters.

POSTED BY: Rohit Namjoshi
Posted 10 months ago

Hi Rohit,

I've been playing around with it for another while and it looks like there are a few problems with the table setup. I downloaded the table SQL from https://www.gdeltproject.org/data/lookups/SQL.tablecreate.txt

The column definitions seem to be inconsistent with the data. I started increasing the size of some columns from char(3) to char(6) and got the first 2 rows to insert (I also had to delete all the other rows in the TSV file for it to work). There were no issues with the part of the query that had commas so it looks like its just a badly defined or out-of-date table structure provided by GDELT.

The next issue I'm seeing is that one of the rows has text where it should have a number so it looks like I'll have to do some data cleaning before making the insert. My guess was that the data generated in this project would be a little more robust but maybe its prone to some errors?

Have you worked with GDELT data before?

Regards,
Sean.

POSTED BY: Sean G
Posted 9 months ago

Attached is the modified query that I used to create the daily update tables for GDELT. So far it seems to work (for MS SQL). Looking at the data, it looks like the one supplied by GDELT is not correct as a number of fields appear to be the wrong size. Hopefully this is useful to someone. This seems to fix the issues above.

Attachments:
POSTED BY: Sean G
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract