Data transformation workflows
General description
The Data Transformation (DT) lectures outlined below are to be recorded for Wolfram U in October-December 2020.
Here is the Wolfram U recordings playlist at YouTube
Each tutorial lecture is between 25 and 40 minutes.
Most of the workflows in the lectures have (easy to derive) corresponding code in the programming languages: Julia, Python, R, and WL. (In Python and Julia dedicated data transformation packages are utilized.)
For didactic purposes we are going to use the Domain Specific Language (DSL) Raku-programmed system DSL::English::DataQueryWorkflows, [AAp1], that generates code for the programming languages R and WL and the software packages:
- Julia-DataFrames
- Python-pandas
- R-tidyverse (mostly dplyr)
The WL generated code uses functions from the Wolfram Function Repository.
(See the references section "Wolfram Function Repository functions".)
The tutorials sequence has an "inverted" structure: the most conceptually rich and technical sessions are presented first.
Detailed tutorial plan
Introduction through questions and answers
- What are the simplifying assumptions for the family of workflows considered in this tutorial?
- We do not consider hard to generalize- or too specialized transformations.
- For example, we do not consider:
- Data gathering/harvesting
- Data ingestion by parsing of, say, XML structures
- Date-time specific transformations
- Natural text specific transformations
- Etc.
What are the simplifying assumptions about the targeted data and transformations?
- Tabular data and collections of tabular data. (E.g. lists of datasets.)
- Transformation workflows that can be expressed with a certain "standard" or "well-known" subset of SQL.
- The flow chart given below illustrates well the targeted DT.
Do these workflows apply to other programming languages and/or data systems?
- Yes, with the presented DT know-how we target multiple "data science" programming languages.
Are additional packages needed to run the codes?
- Yes and no: depends on the target data transformation system.
- Sometimes the WL code uses Wolfram Function Repository functions.
Are there similar tutorials dedicated to other programming languages or packages?
- Yes, many. Both for WL and the rest (Julia/Python/R.)
- That said, in this tutorial we use a certain simplified and streamlined data-and-transformations model that allows the development of cross-system, transferable know-how and workflows.
What are the most important concepts for a newcomer to data wrangling?
- Cross tabulation (or contingency matrices)
- (Inner) joins
- The so called Split-apply-combine pattern
- Long form and wide form (or data pivoting)
Reshaping of data -- detailed examples
Basic operations and joins
Basic operations
- Data load
- Column selection
- Row filtering
- Summarization
- Ways to summarize data
- "Global" summary
- Summarization per group and per column
- Summarization vs Cross tabulation
Joins
- Types of joins
- Inner joins
- Left joins
- Semi-joins
Additional topics
- Example combinations with ML workflows
- Making a recommender system
- Making a classifier
Visual aids
Here is a flow chart that encompasses a large fraction of typical tabular data transformation workflows:
This call of ToDataQueryWorkflow
through ExternalParsersHookUp
, [AAp2], over a sequence of natural commands:
ToDataQueryWorkflowCode["use dfTitanic;
filter with passengerSurvival is 'survived';
rename passengerClass, passengerSex as class and sex;
cross tabulate class and sex", "Execute" -> False]
generates this WL code:
Hold[obj = dfTitanic;
obj = Select[obj, #1["passengerSurvival"] == "survived" &];
obj = (Join[KeyDrop[#1, {"passengerClass", "passengerSex"}], Association["class" -> #1["passengerClass"], "sex" -> #1["passengerSex"]]] &) /@ obj;
obj = ResourceFunction["CrossTabulate"][({#1["class"], #1["sex"]} &) /@ obj]]
which when executed gives the dataset:
Dataset[<|
"1st" -> <|"female" -> 139, "male" -> 61|>,
"2nd" -> <|"female" -> 94, "male" -> 25|>,
"3rd" -> <|"female" -> 106, "male" -> 75|>|>,
TypeSystem`Assoc[TypeSystem`Atom[String], TypeSystem`Struct[{"female", "male"}, {TypeSystem`Atom[Integer], TypeSystem`Atom[Integer]}], 3],
<|"ID" -> 177940708079520|>]
References
Articles
[HW1] Hadley Wickham, "The Split-Apply-Combine Strategy for Data Analysis", (2011), Journal of Statistical Software.
[AA1] Anton Antonov, "Contingency tables creation examples", (2016), MathematicaForPrediction at WordPress.
[AA2] Anton Antonov, "Pareto principle adherence examples", (2016), MathematicaForPrediction at WordPress.
Packages
[AAp1] Anton Antonov, DSL::English::DataQueryWorkflows Raku package, (2020), GitHub/antononcube.
[AAp2] Anton Antonov, External Parsers Hookup Mathematica package, (2020), ConversationalAgents at GitHub.
[AAp3] Anton Antonov, Cross tabulation implementation in Mathematica, (2017), MathematicaForPrediction at GitHub.
[AAp4] Anton Antonov, Data reshaping Mathematica package, (2018), MathematicaForPrediction at GitHub.
Wolfram Function Repository functions
Anton Antonov, CrossTabulate
.
Anton Antonov, ExampleDataset
. (Submitted)
Anton Antonov, ParetoPrinciplePlot
.
Anton Antonov, RecordsSummary
.
Anton Antonov, ToLongForm
. (Submitted)
Anton Antonov, ToWideForm
. (Submitted)
Seth Chandler, ResetDataset
.
Attachments: