Message Boards Message Boards

[✓] Simple way to add a column to a 2-D dataset?

GROUPS:

Is there a simpler way to add a column to a Dataset than what I'm doing? I've defined a function to combine two (or more) separately created Datasets

joinDataset[x_List] := Transpose[Join @@ Transpose /@ x]

But all I want to do is to use data from a couple columns of the dataset and append the value to the dataset. The current process seems too laborious:

dataset = Dataset[{
   <|"a" -> 1, "b" -> "x", "c" -> {1}|>,
   <|"a" -> 2, "b" -> "y", "c" -> {2, 3}|>,
   <|"a" -> 3, "b" -> "z", "c" -> {3}|>,
   <|"a" -> 4, "b" -> "x", "c" -> {4, 5}|>}]
dataset = joinDataset[{dataset, dataset[All, <|"d" -> #a + #b|> &]}]

That will add a column "d" that adds "a" and "b" but I suspect there's something built-in that will do this better.

POSTED BY: Eric Smith
Answer
3 months ago

Perhaps this?:

dataset = Join[dataset, dataset[All, <|"d" -> #a + #b|> &], 2]
POSTED BY: Michael Rogers
Answer
3 months ago

Not quite, but what I'm asking for isn't much different than adding a column to a 2-D array which can be a bit tedious with

MapThread[Append,{twoDarray,newColumn}]

I can use MapThread if I convert the datasets to associations, do the MapThread, then turn it back into an association

Dataset@MapThread[Append, 
  Normal /@ {dataset, dataset[All, <|"d" -> #a + #b|> &]}]

Using Transpose instead

The route that seems a little more intuitive is to use Transpose. So for adding a column to a 2-D array

Transpose[Append[Transpose[twoDarray],newColumn]]

Which is analogous to the approach I'm using with "joinDataset". I think Datasets are a great way to keep track of a lot of data in a human-usable form.


Extracting column/row with key

One other thing I wonder about is, how do I retain the key when extracting a single value from the Dataset? So for instance, say I want to take dataset[All,"b"] and add it to a different Dataset dataset2? I can't do this:

dataset2 = dataset[All, {"a", "c"}]
joinDataset[{dataset2, dataset[All, "b"]}]

I have to map an association to each element first

joinDataset[{dataset2, <|"b" -> #|> & /@ dataset[All, "b"]}]
POSTED BY: Eric Smith
Answer
3 months ago

Eric,

joinDataset[{dataset2, dataset[All, {"b"}]}]

will work. you have to use the {} to get the key to stay around -- just as you did with

dataset2 = dataset[All, {"a", "c"}]

Regards,

Neil

POSTED BY: Neil Singer
Answer
3 months ago

This is perfect! Thanks, Neil. I didn't pick this up from the documentation.

Operators using Right Composition

I hope you don't mind me using this thread for another Dataset question. I've struggled with the logic behind

dataset[All, Key["c"] /* <|"ctotal" -> Total, "clength" -> Length|>]

RightComposition is being used so I should be able write this in another form. If f/*/g/*h@x = h[g[f[x]] then I should be able to use

dataset[All, 
 Function[x, <|"ctotal" -> Total, "clength" -> Length|>[Key["c"][x]]]]

But this doesn't work.

Pure functions using "&" instead of Function

Last last question (I think). This doesn't work:

dataset[All, {"a" ->( #["a"] + 1) &, "b" -> g, "c" -> h}]

but this does

dataset[All, {"a" -> Function[x, (x + 1)], "b" -> g, "c" -> h}]

I've been using Datasets for a while but I feel like I'm not using them with full understanding of what's going on. Same goes for &, Function, and RightComposition.

I appreciate all the help I've gotten on this so far.

POSTED BY: Updating Name
Answer
3 months ago

Shortest way of doing it might be:

dataset[All, Append[#, "d" -> #["a"] + #["b"]] &]

but might not be fastest.

Or simpler notation:

dataset[All, <|#, "e" -> #a + #b|> &]
POSTED BY: Sander Huisman
Answer
3 months ago

Good call Sander! It's obvious now. I suppose if the dataset is very big I'm better off doing the operation first and the join later?

POSTED BY: Eric Smith
Answer
3 months ago

if your dataset is very big you probably don't want to use Dataset ;-) It is very handy and flexible but that comes at the expense of memory usage and speed in some cases…

POSTED BY: Sander Huisman
Answer
3 months ago

I see. I was thinking Dataset was similar to a hash table. What structure would you recommend?

POSTED BY: Eric Smith
Answer
3 months ago

It is very similar to a hash table indeed. But finding/manipulating data in "named" columns just takes more time than if it the can simply be accessed by an index. You can store the data as a 'matrix' (list of lists) and 'remembering' the columns yourself, generally much faster. Dataset can also handle data as just matrices, rather than associations.

POSTED BY: Sander Huisman
Answer
3 months ago

That's what I did for years and it was error prone. Associations are probably more efficient, right?

POSTED BY: Eric Smith
Answer
3 months ago

That's the balance: error-prone/flexible/convenience vs speed/less memory. Associations are implemented very efficiently but if you can avoid it and you can use packed arrays than that is (generally) much faster.

POSTED BY: Sander Huisman
Answer
3 months ago

Group Abstract Group Abstract