Message Boards Message Boards

4 Replies
0 Total Likes
View groups...
Share this post:

Best way to process data using SemanticImport?

Hi, I am trying to import some large datasets with mixed types using SemanticImport. I'd like to normalise many of the columns that contain numeric data, and I'm wondering what the best way to do this is?

I understand that I can apply a specific function to columns, e.g.:

dataset[All, {"C2" -> f, "C3" -> f, "C4" -> f}]

However the data has hundreds of columns (of which only a large range need transforming), so manually specifying the transformation function for each column is very tedious. I could generate the transformation list first and then apply that. But I'm wondering if there is an easier way to specify a range of columns to apply a single function to, rather than having to specify each individually?

I also tried using MapAt however while:


applies f to each of the elements in the second column, but


fails. Thanks in advance for any help.


POSTED BY: Jon McCormack
4 Replies
Posted 9 years ago

Hello Jon,

I found one way of doing this (as usual Mathematica language is so rich that there might be other solutions): first I configure my data set

nrows = 3;
ncolumns = 10;

Then the data values are generated:

(data = Table[
    n + 100 (m - 1), {m, 1, nrows}, {n, 1, ncolumns}]) // TableForm

Next the keys are defined:

keys = Array["k" <> ToString[#] &, ncolumns]

Here the dataset is build:

dataset = Dataset[Map[Association[Thread[keys -> #]] &, data]]

Here is the way how to define the columns that need transformation. You can use Range or define them explicitly

col2trf = Join[Range[1, 4], {6}, {8, 9}]

The selected columns will be transformed using f1, the remaining columns are treated with f2 which should just leave the values as they are

f10 = Map[If[MemberQ[col2trf, #], f1, f2] &, Range[ncolumns]] 

f2 can be defined in the following way:

f2[x_] := x

Finally f10 is applied to the dataset (of course you have to define f1 according to your transformation first)

dataset[All, Thread[keys -> f10]]



POSTED BY: Michael Helmle

Is this what you're aiming for?

dataset[All, MapAt[f, Range[2, 5]]]
POSTED BY: Jesse Friedman

Hi Jesse and Michael, Thanks for your suggestions. I tried Jesse's suggestion, but I get "Failure: Part {All, 2, 3, 4, 5} ... does not exist". The association is much longer than 5 elements. My current solution is to do this:

dataset = SemanticImport["filename"]
dataset[All,Map[Rule[#, f ]&, Keys[dataset[1] [[2 ;; 5]]] // Normal]

While this works, it seems an odd and slow way to apply the same function over a large matrix of elements.

The problem seems to me to be a slightly different behaviour between Part selection on lists and associations. Running MapAt on lists just applies the function to the selected elements of the list, but returns all the elements of the list. Part on associations in the dataset only returns the selected associations, so the size changes.

Here's what I mean:

MapAt[f, {a, b, c, d, e}, {2 ;; All}]
{a, f[b], f[c], f[d], f[e]}
MapAt[f, Association[{1 -> a, 2 -> b, 3 -> c, 4 -> d, 5->e}], {2 ;; All}]
<|2 -> f[b], 3 -> f[c], 4 -> f[d], 5-> f[e]|>

Notice how the first element is dropped in the second example.


POSTED BY: Jon McCormack
Posted 9 years ago

Hi Jon, I found a way to do the trick with associations:

MapAt[f, Association[{1 -> a, 2 -> b, 3 -> c, 4 -> d}], {{3}, {4}}]

This returns all elements, for the columns given as argument the function f is applied

POSTED BY: Michael Helmle
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
or Discard

Group Abstract Group Abstract