Group Abstract

Message Boards

WOLFRAM COMMUNITY

9.3K Views

2 Replies

8 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Curated Data Wolfram Language

When to use Dataset vs. EntityStore for representing custom curated data

Stephan Schiffels

Posted 6 years ago

Dear community, When pulling in an external dataset into Wolfram Language, there are a couple of steps that help making the data computable with WL. This includes wrapping Longitude/Latitudes with GeoPosition, using Quantities with Units, using DateTime objects, and referring to real-world entities where possible (such as Country names). See WDF guide However, with the new EntityStore functionality (still flagged as Experimental in WL 11.3) there is a more fundamental decision to made first: Whether to represent the overall data as a structured Dataset, or as an EntityStore. To explain the problem, consider the two obvious boundary cases: 1) Tabular data, e.g. data that comes in an Excel Sheet -> Use Datasets. 2) Relational data, such as data from an SQL database -> Use EntityStores. Am I getting this right? However, there are often more murky cases. For example, consider Radiocarbon-date estimates aggregated from several publications, so every row in the dataset contains a date for some archaeological sample, and there is a field citing a publication. Each publication, in turn, has additional properties, such as co-authors, Journal information, Publication date, etc, some of which we might want to co-analyse with the primary data. While it is in principle possible, to include this information as additional columns in a single dataset, that might cause huge redundancies if the number of unique publications is much smaller than the number of rows in the dataset. So it might be better in this case to add an EntityStore for Publications, and then using the corresponding Publication Entity as entry in the primary Dataset. Alternatively, it might be better to use an EntityStore for both primary and Publication data to have things more consistent. Part of my confusion also comes from the fact that `Dataset[]`s in the Wolfram Language are much more flexible than the typical two-dimensional layout you get in other systems. For example, a `Dataset[]` in WL supports arbitrarily hierarchical datasets, not just two-dimensional layouts, which means that it can cater for more complex use cases as well. Are there best practices about this? Allowing users to custom curate datasets seems to be a heavy focus of development at Wolfram, given the EntityStore functionality, the Wolfram Data Repository, and new exciting developments with WL version 12 revealed in several recent blog posts, such as connecting EntityStores with SQL Databases and such. So I'd like to understand better how to best curate and publish my data for inclusion in the Wolfram Language. Thanks for comments and ideas about this.

POSTED BY: Stephan Schiffels

2 Replies

Sort By:

Stephan Schiffels

Posted 6 years ago

POSTED BY: Stephan Schiffels

Vincent Virgilio

Posted 6 years ago

I have only a glancing familiarity with EntityStores (and not much more with DataSets). They resemble object/types to me. DataSets are more how they sound...perhaps repeated instances of a certain data types (though that set of types can be irregular). Semantics of EntityStores would seem to have to bend to apply in the same way. By the same token, DataSets seem much more appropriate for SQL data than EntityStores. I think the DataSet documentation even mentions SQL-like operations in passing. Thanks for an interesting question. I'd welcome corrections.

POSTED BY: Vincent Virgilio

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback