Dear community,
When pulling in an external dataset into Wolfram Language, there are a couple of steps that help making the data computable with WL. This includes wrapping Longitude/Latitudes with GeoPosition, using Quantities with Units, using DateTime objects, and referring to real-world entities where possible (such as Country names). See WDF guide
However, with the new EntityStore functionality (still flagged as Experimental in WL 11.3) there is a more fundamental decision to made first: Whether to represent the overall data as a structured Dataset, or as an EntityStore.
To explain the problem, consider the two obvious boundary cases: 1) Tabular data, e.g. data that comes in an Excel Sheet -> Use Datasets. 2) Relational data, such as data from an SQL database -> Use EntityStores. Am I getting this right?
However, there are often more murky cases. For example, consider Radiocarbon-date estimates aggregated from several publications, so every row in the dataset contains a date for some archaeological sample, and there is a field citing a publication. Each publication, in turn, has additional properties, such as co-authors, Journal information, Publication date, etc, some of which we might want to co-analyse with the primary data. While it is in principle possible, to include this information as additional columns in a single dataset, that might cause huge redundancies if the number of unique publications is much smaller than the number of rows in the dataset. So it might be better in this case to add an EntityStore for Publications, and then using the corresponding Publication Entity as entry in the primary Dataset. Alternatively, it might be better to use an EntityStore for both primary and Publication data to have things more consistent.
Part of my confusion also comes from the fact that Dataset[]
s in the Wolfram Language are much more flexible than the typical two-dimensional layout you get in other systems. For example, a Dataset[]
in WL supports arbitrarily hierarchical datasets, not just two-dimensional layouts, which means that it can cater for more complex use cases as well.
Are there best practices about this? Allowing users to custom curate datasets seems to be a heavy focus of development at Wolfram, given the EntityStore functionality, the Wolfram Data Repository, and new exciting developments with WL version 12 revealed in several recent blog posts, such as connecting EntityStores with SQL Databases and such. So I'd like to understand better how to best curate and publish my data for inclusion in the Wolfram Language.
Thanks for comments and ideas about this.