Message Boards Message Boards

4
|
8000 Views
|
2 Replies
|
8 Total Likes
View groups...
Share
Share this post:

When to use Dataset vs. EntityStore for representing custom curated data

Posted 6 years ago

Dear community,

When pulling in an external dataset into Wolfram Language, there are a couple of steps that help making the data computable with WL. This includes wrapping Longitude/Latitudes with GeoPosition, using Quantities with Units, using DateTime objects, and referring to real-world entities where possible (such as Country names). See WDF guide

However, with the new EntityStore functionality (still flagged as Experimental in WL 11.3) there is a more fundamental decision to made first: Whether to represent the overall data as a structured Dataset, or as an EntityStore.

To explain the problem, consider the two obvious boundary cases: 1) Tabular data, e.g. data that comes in an Excel Sheet -> Use Datasets. 2) Relational data, such as data from an SQL database -> Use EntityStores. Am I getting this right?

However, there are often more murky cases. For example, consider Radiocarbon-date estimates aggregated from several publications, so every row in the dataset contains a date for some archaeological sample, and there is a field citing a publication. Each publication, in turn, has additional properties, such as co-authors, Journal information, Publication date, etc, some of which we might want to co-analyse with the primary data. While it is in principle possible, to include this information as additional columns in a single dataset, that might cause huge redundancies if the number of unique publications is much smaller than the number of rows in the dataset. So it might be better in this case to add an EntityStore for Publications, and then using the corresponding Publication Entity as entry in the primary Dataset. Alternatively, it might be better to use an EntityStore for both primary and Publication data to have things more consistent.

Part of my confusion also comes from the fact that Dataset[]s in the Wolfram Language are much more flexible than the typical two-dimensional layout you get in other systems. For example, a Dataset[] in WL supports arbitrarily hierarchical datasets, not just two-dimensional layouts, which means that it can cater for more complex use cases as well.

Are there best practices about this? Allowing users to custom curate datasets seems to be a heavy focus of development at Wolfram, given the EntityStore functionality, the Wolfram Data Repository, and new exciting developments with WL version 12 revealed in several recent blog posts, such as connecting EntityStores with SQL Databases and such. So I'd like to understand better how to best curate and publish my data for inclusion in the Wolfram Language.

Thanks for comments and ideas about this.

2 Replies

Well, while Datasets do allow some pretty nice querying functionality similar to SQL, they lack the relational structure needed to map more complex datasets. For example, this blog post about math overflow shows an example for how to represent a very large dataset, with about 3 million entities, in a Wolfram Entity Store. That example is really impressive, with entity types for Votes, Users, Posts, Tags, Comments and more. All these entities contain each other in terms of properties, so you can very quickly query a post's user and its comments, or the users of all the comments for a post, and so on. The possibilities that can be done with such a dataset are quite amazing, reaching from simple descriptive statistics to modelling to graph-based analyses such as User-User interaction networks.

I guess part of my question also relates to the Wolfram Data Repository, which hosts curated datasets, contributed by users. It allows for both tabular datasets and EntityStores. Here is what Stephen Wolfram wrote in his blog about the Wolfram Data Repository:

Many of the data resources currently in the Wolfram Data Repository are quite tabular in nature. But unlike traditional spreadsheets or tables in databases, they’re not restricted to having just one level of rows and columns—because they’re represented using symbolic Wolfram Language Dataset constructs, which can handle arbitrarily ragged structures, of any depth. [...] But what about data that normally lives in relational or graph databases? Well, there’s a construct called EntityStore that was recently added to the Wolfram Language. We’ve actually been using something like it for years inside Wolfram|Alpha. But what EntityStore now does is to let you set up arbitrary networks of entities, properties and values, right in the Wolfram Language. It typically takes more curation than setting up something like a Dataset—but the result is a very convenient representation of knowledge, on which all the same functions can be used as with built-in Wolfram Language knowledge.

So when curating a dataset for releasing on the Wolfram Data Repository, one needs to decide on the basic structure: whether one opts for a tabular dataset (perhaps with nested subtables, which is possible with Datasets unlike for example with R Dataframes), or for an EntityStore, which allows for arbitrarily complex relationships between different types of data, or perhaps for a hybrid approach, with a tabular data table including entities for which you provide an EntityStore. The latter option of a hybrid approach is something that I'm not sure whether it's possible or best practice for the Wolfram Data Repository.

Finally, with WL version 12 things seem to become even more complicated. Consider this blog post where SPARQL queries and graph databases are introduced, which seem to be somewhat similar to Entity Stores, but using a separate GraphStore` package. I don't know its relationship to the Entity Framework, but it looks a bit similar.

I have only a glancing familiarity with EntityStores (and not much more with DataSets). They resemble object/types to me. DataSets are more how they sound...perhaps repeated instances of a certain data types (though that set of types can be irregular). Semantics of EntityStores would seem to have to bend to apply in the same way. By the same token, DataSets seem much more appropriate for SQL data than EntityStores. I think the DataSet documentation even mentions SQL-like operations in passing.

Thanks for an interesting question. I'd welcome corrections.

POSTED BY: Vincent Virgilio
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract