Message Boards Message Boards

4
|
5691 Views
|
5 Replies
|
10 Total Likes
View groups...
Share
Share this post:

Generating Random Datasets

Posted 3 years ago

POSTED BY: Mike Besso
5 Replies
Posted 3 years ago

Anton:

Thank you for continuing the discussion and providing resource functions for the more general use case.

Have a great and safe holiday.

POSTED BY: Mike Besso

Implementations

I find this to be a great discussion topic!

Of course, it is best to have a Wolfram Function Repository (WFR) function that generates random datasets. I implemented such function for WFR -- see RandomTabularDataset.

Another, closely related WFR function is ExampleDataset.

Remark: Note that I prefer the name RandomTabularDataset instead of RandomDataset. In Mathematica / WL datasets can be (deeply) hierarchical objects. Tabular datasets are simpler than the general WL datasets, but tabular data is very common, easier to explain and to reason with.

Motivations

My motivations are very similar to those of OP: rapid prototyping (of proof of concepts), thorough testing of algorithms, making unit tests.

More specifically I want to:

Demonstration

The resource function ExampleDataset makes datasets from ExampleData. Here is an example dataset:

dsAW = ResourceFunction["ExampleDataset"][{"Statistics", "AnimalWeights"}]

enter image description here

Here is a similar random dataset:

SeedRandom[23];
dsCW = ResourceFunction["RandomTabularDataset"][
   {60, {"Creature", "BodyWeight", "BrainWeight"}},
   "Generators" -> <| 
     1 -> (Table[StringJoin[RandomChoice[CharacterRange["a", "z"], 5]], #] &),
     2 -> FindDistribution[Normal@dsAW[All, "BodyWeight"]],
     3 -> FindDistribution[Normal@dsAW[All, "BrainWeight"]]|>];
IQB = Interval[Quartiles[N@Normal[dsAW[All, #BrainWeight/#BodyWeight &]]][[{1, 3}]]];
dsCW[Select[IntervalMemberQ[IQB, #BrainWeight/ #BodyWeight] &]]

enter image description here

Remark: Instead of quartile boundaries filtering we can filter with AnomalyDetection[Normal[dsAW[All, #BrainWeight/#BodyWeight &]]], but the latter is prone to produce results that are "too far off."

Neat example

A random dataset with values produced by resource functions that generate random objects:

SeedRandom[3];
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/RandomTabularDataset"][{5, {"Mondrian", "Mandala", "Haiku", "Scribble", "Maze", "Fortune"}},
 "Generators" ->
  <|
   1 -> (ResourceFunction["RandomMondrian"][] &),
   2 -> (ResourceFunction["RandomMandala"][] &),
   3 -> (ResourceFunction["RandomEnglishHaiku"][] &),
   4 -> (ResourceFunction["RandomScribble"][] &),
   5 -> (ResourceFunction["RandomMaze"][12] &),
   6 -> (ResourceFunction["RandomFortune"][] &)|>,
 "PointwiseGeneration" -> True]

enter image description here

POSTED BY: Anton Antonov

A couple of updates:

POSTED BY: Anton Antonov
Posted 3 years ago

@Rohit:

Thank you for finding my typo and the additional use case suggestions.

I have updated the notebook to include the use of specific distributions.

Per your feedback, I will add the load and performance testing use cases in the next version.

THANKS

POSTED BY: Mike Besso
Posted 3 years ago

Hi Mike,

Thanks for sharing.

Generating test data like this is a great idea not just for TDD but also for using a small data sample to generate a larger sample. For features in the data are are not correlated I use FindDistribution or LearnDistribution to generate a distribution from the data sample and then use RandomVariate to generate additional data according to the distribution.

Have used this a few times where a client provided a small data sample and I needed a much larger sample to see how the solution would scale (SQL query performance, ML algorithms, ...).

BTW. There is a mismatch between the function name randomDataset used in example usage and the function name dsGenerateRandomDataset.

POSTED BY: Rohit Namjoshi
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract