Data transformation workflows
General description
The Data Transformation (DT) lectures outlined below are to be recorded for Wolfram U in October-December 2020.
Here is the Wolfram U recordings playlist at YouTube
Each tutorial lecture is between 25 and 40 minutes.
Most of the workflows in the lectures have (easy to derive) corresponding code in the programming languages: Julia, Python, R, and WL. (In Python and Julia dedicated data transformation packages are utilized.)
For didactic purposes we are going to use the Domain Specific Language (DSL) Raku-programmed system DSL::English::DataQueryWorkflows, [AAp1], that generates code for the programming languages R and WL and the software packages:
- Julia-DataFrames
- Python-pandas
- R-tidyverse (mostly dplyr)
The WL generated code uses functions from the Wolfram Function Repository.
(See the references section "Wolfram Function Repository functions".)
The tutorials sequence has an "inverted" structure: the most conceptually rich and technical sessions are presented first.
Detailed tutorial plan
Introduction through questions and answers
- What are the simplifying assumptions for the family of workflows considered in this tutorial?
- We do not consider hard to generalize- or too specialized transformations.
- For example, we do not consider:
- Data gathering/harvesting
- Data ingestion by parsing of, say, XML structures
- Date-time specific transformations
- Natural text specific transformations
- Etc.
What are the simplifying assumptions about the targeted data and transformations?
- Tabular data and collections of tabular data. (E.g. lists of datasets.)
- Transformation workflows that can be expressed with a certain "standard" or "well-known" subset of SQL.
- The flow chart given below illustrates well the targeted DT.
Do these workflows apply to other programming languages and/or data systems?
- Yes, with the presented DT know-how we target multiple "data science" programming languages.
Are additional packages needed to run the codes?
- Yes and no: depends on the target data transformation system.
- Sometimes the WL code uses Wolfram Function Repository functions.
Are there similar tutorials dedicated to other programming languages or packages?
- Yes, many. Both for WL and the rest (Julia/Python/R.)
- That said, in this tutorial we use a certain simplified and streamlined data-and-transformations model that allows the development of cross-system, transferable know-how and workflows.
What are the most important concepts for a newcomer to data wrangling?
- Cross tabulation (or contingency matrices)
- (Inner) joins
- The so called Split-apply-combine pattern
- Long form and wide form (or data pivoting)
Reshaping of data -- detailed examples
Basic operations and joins
Basic operations
- Data load
- Column selection
- Row filtering
- Summarization
- Ways to summarize data
- "Global" summary
- Summarization per group and per column
- Summarization vs Cross tabulation
Joins
- Types of joins
- Inner joins
- Left joins
- Semi-joins
Additional topics
- Example combinations with ML workflows
- Making a recommender system
- Making a classifier
Visual aids
Here is a flow chart that encompasses a large fraction of typical tabular data transformation workflows:
This call of ToDataQueryWorkflow through ExternalParsersHookUp, [AAp2], over a sequence of natural commands:
ToDataQueryWorkflowCode["use dfTitanic;
filter with passengerSurvival is 'survived';
rename passengerClass, passengerSex as class and sex;
cross tabulate class and sex", "Execute" -> False]
generates this WL code:
Hold[obj = dfTitanic;
obj = Select[obj, #1["passengerSurvival"] == "survived" &];
obj = (Join[KeyDrop[#1, {"passengerClass", "passengerSex"}], Association["class" -> #1["passengerClass"], "sex" -> #1["passengerSex"]]] &) /@ obj;
obj = ResourceFunction["CrossTabulate"][({#1["class"], #1["sex"]} &) /@ obj]]
which when executed gives the dataset:
Dataset[<|
"1st" -> <|"female" -> 139, "male" -> 61|>,
"2nd" -> <|"female" -> 94, "male" -> 25|>,
"3rd" -> <|"female" -> 106, "male" -> 75|>|>,
TypeSystem`Assoc[TypeSystem`Atom[String], TypeSystem`Struct[{"female", "male"}, {TypeSystem`Atom[Integer], TypeSystem`Atom[Integer]}], 3],
<|"ID" -> 177940708079520|>]
References
Articles
[HW1] Hadley Wickham, "The Split-Apply-Combine Strategy for Data Analysis", (2011), Journal of Statistical Software.
[AA1] Anton Antonov, "Contingency tables creation examples", (2016), MathematicaForPrediction at WordPress.
[AA2] Anton Antonov, "Pareto principle adherence examples", (2016), MathematicaForPrediction at WordPress.
Packages
[AAp1] Anton Antonov, DSL::English::DataQueryWorkflows Raku package, (2020), GitHub/antononcube.
[AAp2] Anton Antonov, External Parsers Hookup Mathematica package, (2020), ConversationalAgents at GitHub.
[AAp3] Anton Antonov, Cross tabulation implementation in Mathematica, (2017), MathematicaForPrediction at GitHub.
[AAp4] Anton Antonov, Data reshaping Mathematica package, (2018), MathematicaForPrediction at GitHub.
Wolfram Function Repository functions
Anton Antonov, CrossTabulate.
Anton Antonov, ExampleDataset. (Submitted)
Anton Antonov, ParetoPrinciplePlot.
Anton Antonov, RecordsSummary.
Anton Antonov, ToLongForm. (Submitted)
Anton Antonov, ToWideForm. (Submitted)
Seth Chandler, ResetDataset.
Attachments: