Group Abstract

Message Boards

WOLFRAM COMMUNITY

9K Views

3 Replies

9 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Wolfram Language

Data cleaning, wrangling, munging with Mathematica

David Proffer

Posted 9 years ago

I browsed through Mathematica StackExchange and the Data Science group at the Wolfram Community and was not able to find any comprehensive discussion of this important topic. Clearly all the tools are available in Mathematica and once clean data is in, exploratory data analysis can be done much better than with environments like R and Python. However this critical first step does not seem to have been addressed in comprehensive way such as in books: Data Wrangling with R by Boehmke Data Wrangling with Python by Kazil & Jarmul Having a simple guide for Mathematica operations such as this: Data Wrangling with dplyr and tidyr Cheat Sheet by R Studio seems would be a good start. I have not seen any news or motion from Wolfram on their Data Science Platform. Is there any information on how it might make learning and using data cleaning procedures less 'exploratory'?

POSTED BY: David Proffer

3 Replies

Sort By:

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 9 years ago

Summarization of data Some time ago I programmed a function, `RecordsSummary`, inspired by R's `summary`. Here is an example of its usage: Census data summary . As the name implies, it is assumed that we have a list of records, all with the same length, and we want the columns to be summarized. (Each record is a row.) You can get the package MathematicaForPredictionUtilities.m from MathematicaForPrediction at GitHub or simply run this command: Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MathematicaForPredictionUtilities.m"] Let us create random data. data = RandomInteger[{0, 100}, 100]; dataCat = RandomChoice[Characters["azbuka"], 100]; data2 = RandomInteger[{0, 100}, {100, 4}]; Here are examples of using RecordsSummary over the created data. 1. Call on a 1D array: RecordsSummary[ data ] 2. Summary of a 2D numeric array. The columns are named automatically. RecordsSummary[N[data2]] 3. Fancy output of numerical and categorical data summary. The column names are the second argument given to RecordsSummary. Grid[{RecordsSummary[ Transpose[{N[data], dataCat}], {"Numeric", "Categorical"}]}, Alignment -> Top, Dividers -> All] Mosaic plots Using Mosaic plots for data exploration/visualization was described in these WordPress blog posts and this Community discussion. Here is an example: Import["https://raw.githubusercontent.com/antononcube/MathematicaForPrediction/master/MosaicPlot.m"] titanicDataset = Map[Flatten, List @@@ ExampleData[{"MachineLearning", "Titanic"}, "Data"]]; Dimensions[titanicDataset] (* {1309, 4} ) titanicVarNames = Flatten[List @@ ExampleData[{"MachineLearning", "Titanic"}, "VariableDescriptions"]] ( {"passenger class", "passenger age", "passenger sex", "passenger survival"} *) MosaicPlot[titanicDataset[[All, {1, 3, 4}]], ColorRules -> {3 -> ColorData[7, "ColorList"]}]

POSTED BY: Anton Antonov

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 9 years ago

I think what you are looking for is covered by Sander Huisman's answer, and the documentation and examples of `Dataset`. I browsed through Mathematica StackExchange and the Data Science group at the Wolfram Community and was not able to find any comprehensive discussion of this important topic. In my opinion the reason is that in Mathematica data manipulation and massaging is not that hard compared to, say, R -- this is somewhat similar to "The Lisp Curse". I.e. you see lots of articles for data massaging in R because the base R commands are not that easy or intuitive to use. It seems that plyr was introduced to address those deficiencies. (If one reads the articles introducing that package.)

POSTED BY: Anton Antonov

Sander Huisman

Sander Huisman, University of Twente

Posted 9 years ago

Without an explicit example it is hard to give you some insight in to data cleaning. However, there are common functions that are used to select/convert/transform data: Part (* to select parts of something based on indices ) Select ( to select something base on a True/False criterion ) Cases/FirstCase ( to 'select' something based on its structure ) ToExpression ( to convert a string to an expression ) UnitConvert/Quantity/QuantityMagnitude ( to add/remove/convert quantities ) StringSplit ( split string-data in to parts) StringTake/StringDrop ( take parts of strings ) Map/Apply ( used in conjunction with ToExpression, to convert an entire bunch of items to expression ) Delete ( delete based on indices ) DeleteCases ( delete based on a pattern-match ) DeleteDuplicates(By) ( delete duplicates) ArrayReshape/Flatten/Partition/Transpose/Reverse ( flipping/flattening/changing dimensions et cetera) Replace/ReplaceAll/StringReplace ( replace items based on replacement rules *) I think with those, you can get quite far. Of course then you can 'group' the data using: Gather/GatherBy/GroupBy Split/SplitBy Or count items: Tally/Count/Counts/CountsBy Sort items: Sort/SortBy Then you can do some statistics on it to reduce the data: Min Max Mean/TrimmedMean/Total Median StandardDeviation/Variance/RootMeanSquare Skewness Kurtosis Length

Without an explicit example it is hard to give you some insight in to data cleaning. However, there are common functions that are used to select/convert/transform data:

Part (* to select parts of something based on indices *)
Select (* to select something base on a True/False criterion *)
Cases/FirstCase (* to 'select' something based on its structure *)
ToExpression (* to convert a string to an expression *)
UnitConvert/Quantity/QuantityMagnitude (* to add/remove/convert quantities *)
StringSplit (* split string-data in to parts*)
StringTake/StringDrop (* take parts of strings *)
Map/Apply (* used in conjunction with ToExpression, to convert an entire bunch of items to expression *)
Delete (* delete based on indices *)
DeleteCases (* delete based on a pattern-match *)
DeleteDuplicates(By) (* delete duplicates*)
ArrayReshape/Flatten/Partition/Transpose/Reverse (* flipping/flattening/changing dimensions et cetera*)
Replace/ReplaceAll/StringReplace (* replace items based on replacement rules *)

I think with those, you can get quite far. Of course then you can 'group' the data using:

Gather/GatherBy/GroupBy
Split/SplitBy

Or count items:

Tally/Count/Counts/CountsBy

Sort items:

Sort/SortBy

Then you can do some statistics on it to reduce the data:

Min
Max
Mean/TrimmedMean/Total
Median
StandardDeviation/Variance/RootMeanSquare
Skewness
Kurtosis
Length

POSTED BY: Sander Huisman

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback

Data cleaning, wrangling, munging with Mathematica

Summarization of data

Mosaic plots