Group Abstract

Message Boards

WOLFRAM COMMUNITY

1.4K Views

4 Replies

11 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Wolfram Language Geographic Information System

Spatial data in Tabular

Gareth Russell

Gareth Russell, New Jersey Institute of Technology

Posted 5 months ago

As someone who deals with large spatial datasets, I am please to see that GeoPosition is now a supported column type in Tabular. But a couple of words of warning to others: It's not very efficient. In some basic testing, a column of GeoPosition objects takes around ten times as much storage as two machine precision columns just holding the latitudes and longitudes! This seems a bit odd to me, as I thought the point of Tabular was to maximize efficiency by bundling all the type and 'wrapper' information into the header so that raw data are stored in an optimized way. And the raw data are just the two coordinates, so… Only GeoPosition is supported, not GeoGridPosition, so there is no way to do a lot of common geo processing activities within the Tabular structure (talking advantage of its efficiencies). Again, the raw data for GeoGridPosition are just pairs of numbers: everything else (e.g., the projection) is 'wrapper.' So it's a step in the right direction, but I hope that more steps are taken to make it a powerhouse for spatial data. Being able to do, for example, fast conversion between projections would be amazing. Oddly, both GeoPosition and GeoGridPosition can be used as 'wrappers' already in the sense that their arguments can be lists of coordinates, and this makes operations such as projection conversions very more efficient. One final note: it's not necessarily obvious, but you can store lists in Tabular columns, so you can make one column that stores pairs of coordinates. This makes it relatively easy (though not especially fast) to apply geo processing functions to Tabular data:

POSTED BY: Gareth Russell

4 Replies

Sort By:

Gareth Russell

Gareth Russell, New Jersey Institute of Technology

Posted 5 months ago

Thanks again for these clarifications. I think it is important to be transparent about this. I know that the Wolfram philosophy is to do things under the hood to make code 'just work' as expected, which is what seems to be going on with the caching here. But given Tabular's efficiency claims in both speed and memory footprint, actually think it would be better to explain that there may be things like unit conversions that mean that Normal won't return exactly the same expression, rather than some kind of default caching that may subvert the efficiencies. Maybe make "CacheOriginalExpression -> False" the default? And in the case of GeoPosition this argument for caching doesn't make sense to me, as GeoPosition is just a wrapper for a pair of floats. Surely one can 're-wrap' them and get back the exact same expression without having to cache a memory-intensive version in which each pair is an individual GeoPosition object? Please understand that these comments are only because I am super excited about the possibilities Tabular offers for handling big data, including spatial data.

POSTED BY: Gareth Russell

Jose Martin-Garcia

Jose Martin-Garcia, Wolfram Research

Posted 5 months ago

Tabular stores data in highly efficient ways, and sometimes it needs to change the data for homogeneous storage. Then we want to preserve the original input for normalization. For example, columns of quantities store all magnitudes in the same unit. Compare these two cases: In[]:= Normal@ ToTabular[{{Quantity[1, "Kilometers"]}, {Quantity[1, "Meters"]}}] Out[]= {{Quantity[1, "Kilometers"]}, {Quantity[1, "Meters"]}} In[]:= Normal@ ToTabular[{{Quantity[1, "Kilometers"]}, {Quantity[1, "Meters"]}}, Automatic, <\|"CacheOriginalExpression" -> False\|>] Out[]= {{Quantity[1000, "Meters"]}, {Quantity[1, "Meters"]}} In the first case we cached the data, and therefore Normal recovered the original expression, even though internally it was changed to meters. In the second case we didn't cache, and therefore Normal recovered an equivalent expression, but not the same one. I may have misspoken about doing this only for small cases. I will check with my colleagues to see if we currently always do this when Tabular believes Normal may return an output that is equivalent but not identical.

POSTED BY: Jose Martin-Garcia

Jose Martin-Garcia

Jose Martin-Garcia, Wolfram Research

Posted 5 months ago

The difference is size is due to the fact that, for small Tabular objects, we cache the original expression. Try this instead: latlong = Table[{RandomReal[{-80, 80}], RandomReal[{-180, 180}]}, {1000}]; tab1 = ToTabular[latlong, "Rows", <\|"ColumnKeys" -> {"lat", "long"}\|>]; tab2 = ToTabular[Map[{GeoPosition[#]} &, latlong], "Rows", <\|"ColumnKeys" -> {"geo"}, "CacheOriginalExpression" -> False\|>]; In[24]:= N[ByteCount[tab2]/ByteCount[tab1]] Out[24]= 1.10757

The difference is size is due to the fact that, for small Tabular objects, we cache the original expression. Try this instead:

latlong = Table[{RandomReal[{-80, 80}], RandomReal[{-180, 180}]}, {1000}];
tab1 = ToTabular[latlong, "Rows", <|"ColumnKeys" -> {"lat", "long"}|>];
tab2 = ToTabular[Map[{GeoPosition[#]} &, latlong], "Rows", <|"ColumnKeys" -> {"geo"}, "CacheOriginalExpression" -> False|>];

In[24]:= N[ByteCount[tab2]/ByteCount[tab1]]
Out[24]= 1.10757

POSTED BY: Jose Martin-Garcia

Gareth Russell

Gareth Russell, New Jersey Institute of Technology

Posted 5 months ago

Interesting: thanks for that, Can I ask why? Is there some overhead to using the efficient data structures that makes Tabular slower for small datasets? And roughly where is the cutoff of "small"?

POSTED BY: Gareth Russell

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback