Group Abstract

Message Boards

WOLFRAM COMMUNITY

7.7K Views

5 Replies

10 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Generating Random Datasets

Mike Besso

Posted 6 years ago

POSTED BY: Mike Besso

5 Replies

Sort By:

Mike Besso

Posted 5 years ago

Anton: Thank you for continuing the discussion and providing resource functions for the more general use case. Have a great and safe holiday.

POSTED BY: Mike Besso

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 5 years ago

Implementations I find this to be a great discussion topic! Of course, it is best to have a Wolfram Function Repository (WFR) function that generates random datasets. I implemented such function for WFR -- see `RandomTabularDataset`. Another, closely related WFR function is `ExampleDataset`. Remark: Note that I prefer the name `RandomTabularDataset` instead of `RandomDataset`. In Mathematica / WL datasets can be (deeply) hierarchical objects. Tabular datasets are simpler than the general WL datasets, but tabular data is very common, easier to explain and to reason with. Motivations My motivations are very similar to those of OP: rapid prototyping (of proof of concepts), thorough testing of algorithms, making unit tests. More specifically I want to: Be able to quickly produce example datasets for my Data Wrangling classes Have a large corpus of datasets to test the Data Transformations Workflows Conversational Agent I develop Have a large corpus of datasets to illustrate data quality verification algorithms or frameworks, like this Data Quality Monitoring Module Demonstration The resource function `ExampleDataset` makes datasets from `ExampleData`. Here is an example dataset: dsAW = ResourceFunction["ExampleDataset"][{"Statistics", "AnimalWeights"}] Here is a similar random dataset: SeedRandom[23]; dsCW = ResourceFunction["RandomTabularDataset"][ {60, {"Creature", "BodyWeight", "BrainWeight"}}, "Generators" -> <\| 1 -> (Table[StringJoin[RandomChoice[CharacterRange["a", "z"], 5]], #] &), 2 -> FindDistribution[Normal@dsAW[All, "BodyWeight"]], 3 -> FindDistribution[Normal@dsAW[All, "BrainWeight"]]\|>]; IQB = Interval[Quartiles[N@Normal[dsAW[All, #BrainWeight/#BodyWeight &]]][[{1, 3}]]]; dsCW[Select[IntervalMemberQ[IQB, #BrainWeight/ #BodyWeight] &]] Remark: Instead of quartile boundaries filtering we can filter with `AnomalyDetection[Normal[dsAW[All, #BrainWeight/#BodyWeight &]]]`, but the latter is prone to produce results that are "too far off." Neat example A random dataset with values produced by resource functions that generate random objects: SeedRandom[3]; ResourceFunction[ "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\ Function/RandomTabularDataset"][{5, {"Mondrian", "Mandala", "Haiku", "Scribble", "Maze", "Fortune"}}, "Generators" -> <\| 1 -> (ResourceFunction["RandomMondrian"][] &), 2 -> (ResourceFunction["RandomMandala"][] &), 3 -> (ResourceFunction["RandomEnglishHaiku"][] &), 4 -> (ResourceFunction["RandomScribble"][] &), 5 -> (ResourceFunction["RandomMaze"][12] &), 6 -> (ResourceFunction["RandomFortune"][] &)\|>, "PointwiseGeneration" -> True]

POSTED BY: Anton Antonov

Anton Antonov

Anton Antonov, Accendo Data LLC

Posted 5 years ago

A couple of updates: The WFR function `RandomTabularDataset` was approved yesterday. (I updated the code and links above accordingly.) I implemented an R package with a very similar functionality: see `RandomDataFrameGenerator`.

POSTED BY: Anton Antonov

Mike Besso

Posted 6 years ago

@Rohit: Thank you for finding my typo and the additional use case suggestions. I have updated the notebook to include the use of specific distributions. Per your feedback, I will add the load and performance testing use cases in the next version. THANKS

POSTED BY: Mike Besso

Rohit Namjoshi

Posted 6 years ago

Hi Mike, Thanks for sharing. Generating test data like this is a great idea not just for TDD but also for using a small data sample to generate a larger sample. For features in the data are are not correlated I use `FindDistribution` or `LearnDistribution` to generate a distribution from the data sample and then use `RandomVariate` to generate additional data according to the distribution. Have used this a few times where a client provided a small data sample and I needed a much larger sample to see how the solution would scale (SQL query performance, ML algorithms, ...). BTW. There is a mismatch between the function name `randomDataset` used in example usage and the function name `dsGenerateRandomDataset`.

POSTED BY: Rohit Namjoshi

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback

Generating Random Datasets

Implementations

Motivations

Demonstration

Neat example