Group Abstract Group Abstract

Message Boards Message Boards

3
|
9.6K Views
|
3 Replies
|
9 Total Likes
View groups...
Share
Share this post:

Data cleaning, wrangling, munging with Mathematica

Posted 10 years ago

I browsed through Mathematica StackExchange and the Data Science group at the Wolfram Community and was not able to find any comprehensive discussion of this important topic. Clearly all the tools are available in Mathematica and once clean data is in, exploratory data analysis can be done much better than with environments like R and Python. However this critical first step does not seem to have been addressed in comprehensive way such as in books:

Data Wrangling with R by Boehmke Data Wrangling with Python by Kazil & Jarmul

Having a simple guide for Mathematica operations such as this:

Data Wrangling with dplyr and tidyr Cheat Sheet by R Studio

seems would be a good start. I have not seen any news or motion from Wolfram on their Data Science Platform. Is there any information on how it might make learning and using data cleaning procedures less 'exploratory'?

POSTED BY: David Proffer
3 Replies
POSTED BY: Anton Antonov

Without an explicit example it is hard to give you some insight in to data cleaning. However, there are common functions that are used to select/convert/transform data:

Part (* to select parts of something based on indices *)
Select (* to select something base on a True/False criterion *)
Cases/FirstCase (* to 'select' something based on its structure *)
ToExpression (* to convert a string to an expression *)
UnitConvert/Quantity/QuantityMagnitude (* to add/remove/convert quantities *)
StringSplit (* split string-data in to parts*)
StringTake/StringDrop (* take parts of strings *)
Map/Apply (* used in conjunction with ToExpression, to convert an entire bunch of items to expression *)
Delete (* delete based on indices *)
DeleteCases (* delete based on a pattern-match *)
DeleteDuplicates(By) (* delete duplicates*)
ArrayReshape/Flatten/Partition/Transpose/Reverse (* flipping/flattening/changing dimensions et cetera*)
Replace/ReplaceAll/StringReplace (* replace items based on replacement rules *)

I think with those, you can get quite far. Of course then you can 'group' the data using:

Gather/GatherBy/GroupBy
Split/SplitBy

Or count items:

Tally/Count/Counts/CountsBy

Sort items:

Sort/SortBy

Then you can do some statistics on it to reduce the data:

Min
Max
Mean/TrimmedMean/Total
Median
StandardDeviation/Variance/RootMeanSquare
Skewness
Kurtosis
Length
POSTED BY: Sander Huisman
POSTED BY: Anton Antonov
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard