Message Boards Message Boards

0
|
496 Views
|
10 Replies
|
6 Total Likes
View groups...
Share
Share this post:

How to understand the three parameters in Dataset

Posted 25 days ago

Hello,

A dataset is

dataset = Dataset[{
   <|"a" -> 1, "b" -> "x", "c" -> {1}|>,
   <|"a" -> 2, "b" -> "y", "c" -> {2, 3}|>,
   <|"a" -> 3, "b" -> "z", "c" -> {3}|>,
   <|"a" -> 4, "b" -> "x", "c" -> {4, 5}|>,
   <|"a" -> 5, "b" -> "y", "c" -> {5, 6, 7}|>,
   <|"a" -> 6, "b" -> "z", "c" -> {}|>}]

Apply a function f to every element in every row:

dataset[All, All, f]

Partition the dataset based on a column, applying further operators to each group:

dataset[GroupBy["b"], Catenate, "c"]

If Dataset can accept three parameters, the first one is the operation on the row, how do we understand the second and third ones?

And, why does this not work

dataset[GroupBy["b"], "c"]
POSTED BY: Zhenyu Zeng
10 Replies
Posted 23 days ago

Let's start from scratch.

data =
 Dataset[{
   <|"A" -> "a-1", "B" -> "b-1", "C" -> "c-1"|>,
   <|"A" -> "a-2", "B" -> "b-2", "C" -> "c-2"|>}]

enter image description here

Now let's do a query with just one operator.

data[f]
(* f[{<|"A" -> "a-1", "B" -> "b-1", "C" -> "c-1"|>, <|"A" -> "a-2", "B" -> "b-2", "C" -> "c-2"|>}] *)

The function f was applied to the entire collection. It was not applied to each row. That first operator can be used to filter or to aggregate.

Integers are interpreted as a Part specification. Anything that looks like a Part specification will be interpreted as a Part specification.

data[1]
(* Gives you the first "row" *)

You can provide an explicit function.

data[Most]
(* Gives you all but the last "row", which in this case will just be a single row dataset--notice that the result is slightly different than the previous. *)

The function can be a selector.

data[Select[#["A"] == "a-2" &]]
(* Selects the second "row" *)

The function can be an aggregator.

data[Length]
(* 2 *)

Now let's add a second operator.

data[f, g]
(* f[{g[<|"A" -> "a-1", "B" -> "b-1", "C" -> "c-1"|>], g[<|"A" -> "a-2", "B" -> "b-2", "C" -> "c-2"|>]}] *)

Notice that g is applied to each record. Just like with the first operator, g can be a Part or a selector or an aggregator or any function you want.

data[f, "C"]
(* f[{"c-1", "c-2"}] *)

In this case f isn't doing anything. A more typical example would be:

data[All, "C"]

You can keep going deeper with operators.

data[f, g, h]

Notice how h gets applied to each value in each record. Again, we can do whatever we want with that operator:

data[All, "A", Capitalize]
data[All, "A", StringDelete["-"]]
data[All, "A", StringLength]

Here's an example where we aggregate at the top level after selecting at the record level and applying a function at the "field" level:

data[Total, "A", StringLength]
(* 6 *)

That is all very general because your question was very general. Maybe it would be clearer if you told us what kind of query you want to do on your data. Working through a specific example might clarify how this all works.

POSTED BY: Eric Rimbey
Posted 22 days ago

Thanks a lot. In the cases of data[f], what kind of f function can operate on dataset? May you give me an example? Can you give another example with four parameters or five parameters?

POSTED BY: Zhenyu Zeng
Posted 21 days ago

In the cases of data[f], what kind of f function can operate on dataset?

It can be any kind. It just depends on what you're trying to analyze. Let's use a very general dataset:

dataset = 
  Dataset[
    <|"b0a1140" -> <|"a" -> 1, "b" -> "x", "c" -> {1}|>, 
      "a250c" -> <|"a" -> 2, "b" -> "y", "c" -> {2, 3}|>, 
      "d74df75" -> <|"a" -> 3, "b" -> "z", "c" -> {3}|>, 
      "f93bdfe2" -> <|"a" -> 4, "b" -> "x", "c" -> {4, 5}|>, 
      "a78710f" -> <|"a" -> 5, "b" -> "y", "c" -> {5, 6, 7}|>, 
      "976c" -> <|"a" -> 6, "b" -> "z", "c" -> {}|>|>]

Maybe you just want to know how many records there are:

dataset[Length]
(* 6 *)

Maybe you're interested the keys for some reason:

dataset[Keys]

enter image description here

Maybe you want to filter on the keys:

dataset[KeySelect[StringMatchQ["a*"]]]

enter image description here

POSTED BY: Eric Rimbey
Posted 24 days ago

Too difficult for me to understand. May you teach me what is the meaning of this first

dataset[All, "c", 1]
POSTED BY: Zhenyu Zeng
Posted 24 days ago

In this case, the 1 is treated as an index or part so the first element of each c is extracted. Try

dataset[All, "c", 2]

I highly recommend reading Seth Chandler's book Query: Getting Information from Data with the Wolfram Language. A free notebook edition is available for download.

POSTED BY: Rohit Namjoshi
Posted 24 days ago

You should really just play around with it. Try this:

dataset[z, y, x, w]

Notice where each operator is applied. If an operator can be interpreted as a part specification (e.g. All or an integer), then it is applied that way. If an operator can be interpreted as a filter, then it is applied that way (and this is a "descending" operator). If an operator can be interpreted as an aggregator, then it is applied that way after the lower level operators are applied (and this is an "ascending" operator). And there are a few other special forms. But there is no point duplicating the documentation here. You really need to just wrestle with it for awhile.

POSTED BY: Eric Rimbey
Posted 24 days ago

I have tried this

dataset[z, y, x, w]

Very difficult to understand this.

POSTED BY: Zhenyu Zeng
Posted 24 days ago

Don't think of it as "accepting three parameters". You can use as many parameters as you need. Just think of which level each parameter is operating on. The first is on the whole collection (not "the row" as you stated). The next is on each "row", the next is on each "field", etc.

So, when you start with

dataset[GroupBy["b"]]

you're basically doing something equivalent to

Dataset[GroupBy[#["b"] &][Normal[dataset]]]

Notice that this immediately changes the structure of the data:

enter image description here

And so now if you try

dataset[GroupBy["b"], "c"]

you can see why it doesn't work. There is no key "c" available to the next level of the query. You need to first extract the values from the lists.

dataset[GroupBy["b"], All, "c"]

As for

dataset[GroupBy["b"], Catenate, "c"]

you need to understand the difference between ascending operators and descending operators. That is discussed in the documentation. And GroupBy is specifically called out as a special descending operator.

POSTED BY: Eric Rimbey
Posted 24 days ago

I tried and found Normal can be removed in

Dataset[GroupBy[#["b"] &][Normal[dataset]]]
POSTED BY: Zhenyu Zeng
Posted 24 days ago

I think in dataset[GroupBy["b"], "c"], GroupBy is working with rows and c is working with coloumn. What it the meaning of

There is no key "c" available to the next level of the query. You need to first extract the values from the lists.

And why does all in dataset[GroupBy["b"], All, "c"] can extract the values from the list? What is the meaning of the values here.

POSTED BY: Zhenyu Zeng
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract