Message Boards Message Boards

0
|
8249 Views
|
4 Replies
|
0 Total Likes
View groups...
Share
Share this post:

Best way to process data using SemanticImport?

Hi, I am trying to import some large datasets with mixed types using SemanticImport. I'd like to normalise many of the columns that contain numeric data, and I'm wondering what the best way to do this is?

I understand that I can apply a specific function to columns, e.g.:

dataset[All, {"C2" -> f, "C3" -> f, "C4" -> f}]

However the data has hundreds of columns (of which only a large range need transforming), so manually specifying the transformation function for each column is very tedious. I could generate the transformation list first and then apply that. But I'm wondering if there is an easier way to specify a range of columns to apply a single function to, rather than having to specify each individually?

I also tried using MapAt however while:

dataset[All,MapAt[f,{2}]]

applies f to each of the elements in the second column, but

dataset[All,MapAt[f,{{2},{3}}]]

fails. Thanks in advance for any help.

Jon

POSTED BY: Jon McCormack
4 Replies
Posted 9 years ago

Hi Jon, I found a way to do the trick with associations:

MapAt[f, Association[{1 -> a, 2 -> b, 3 -> c, 4 -> d}], {{3}, {4}}]

This returns all elements, for the columns given as argument the function f is applied

POSTED BY: Michael Helmle

Is this what you're aiming for?

dataset[All, MapAt[f, Range[2, 5]]]
POSTED BY: Jesse Friedman
POSTED BY: Jon McCormack
Posted 9 years ago

Hello Jon,

I found one way of doing this (as usual Mathematica language is so rich that there might be other solutions): first I configure my data set

nrows = 3;
ncolumns = 10;

Then the data values are generated:

(data = Table[
    n + 100 (m - 1), {m, 1, nrows}, {n, 1, ncolumns}]) // TableForm

Next the keys are defined:

keys = Array["k" <> ToString[#] &, ncolumns]

Here the dataset is build:

dataset = Dataset[Map[Association[Thread[keys -> #]] &, data]]

Here is the way how to define the columns that need transformation. You can use Range or define them explicitly

col2trf = Join[Range[1, 4], {6}, {8, 9}]

The selected columns will be transformed using f1, the remaining columns are treated with f2 which should just leave the values as they are

f10 = Map[If[MemberQ[col2trf, #], f1, f2] &, Range[ncolumns]] 

f2 can be defined in the following way:

f2[x_] := x

Finally f10 is applied to the dataset (of course you have to define f1 according to your transformation first)

dataset[All, Thread[keys -> f10]]

Regards,

Michael

POSTED BY: Michael Helmle
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract