Group Abstract Group Abstract

Message Boards Message Boards

4
|
9.3K Views
|
2 Replies
|
8 Total Likes
View groups...
Share
Share this post:

When to use Dataset vs. EntityStore for representing custom curated data

Posted 6 years ago

Dear community,

When pulling in an external dataset into Wolfram Language, there are a couple of steps that help making the data computable with WL. This includes wrapping Longitude/Latitudes with GeoPosition, using Quantities with Units, using DateTime objects, and referring to real-world entities where possible (such as Country names). See WDF guide

However, with the new EntityStore functionality (still flagged as Experimental in WL 11.3) there is a more fundamental decision to made first: Whether to represent the overall data as a structured Dataset, or as an EntityStore.

To explain the problem, consider the two obvious boundary cases: 1) Tabular data, e.g. data that comes in an Excel Sheet -> Use Datasets. 2) Relational data, such as data from an SQL database -> Use EntityStores. Am I getting this right?

However, there are often more murky cases. For example, consider Radiocarbon-date estimates aggregated from several publications, so every row in the dataset contains a date for some archaeological sample, and there is a field citing a publication. Each publication, in turn, has additional properties, such as co-authors, Journal information, Publication date, etc, some of which we might want to co-analyse with the primary data. While it is in principle possible, to include this information as additional columns in a single dataset, that might cause huge redundancies if the number of unique publications is much smaller than the number of rows in the dataset. So it might be better in this case to add an EntityStore for Publications, and then using the corresponding Publication Entity as entry in the primary Dataset. Alternatively, it might be better to use an EntityStore for both primary and Publication data to have things more consistent.

Part of my confusion also comes from the fact that Dataset[]s in the Wolfram Language are much more flexible than the typical two-dimensional layout you get in other systems. For example, a Dataset[] in WL supports arbitrarily hierarchical datasets, not just two-dimensional layouts, which means that it can cater for more complex use cases as well.

Are there best practices about this? Allowing users to custom curate datasets seems to be a heavy focus of development at Wolfram, given the EntityStore functionality, the Wolfram Data Repository, and new exciting developments with WL version 12 revealed in several recent blog posts, such as connecting EntityStores with SQL Databases and such. So I'd like to understand better how to best curate and publish my data for inclusion in the Wolfram Language.

Thanks for comments and ideas about this.

2 Replies

I have only a glancing familiarity with EntityStores (and not much more with DataSets). They resemble object/types to me. DataSets are more how they sound...perhaps repeated instances of a certain data types (though that set of types can be irregular). Semantics of EntityStores would seem to have to bend to apply in the same way. By the same token, DataSets seem much more appropriate for SQL data than EntityStores. I think the DataSet documentation even mentions SQL-like operations in passing.

Thanks for an interesting question. I'd welcome corrections.

POSTED BY: Vincent Virgilio
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard