Input and output variable correspondence of FeatureExtraction

This problem is a problem about FeatureExtraction that arose in the course of training a neural network where the data contains a mixture of nominal and numeric variables. The problem in a nutshell is trying to find a map between the parts of the input before FeatureExtraction and the corresponding parts of the output after FeatureExtraction. If you can't derive that map, then I don't see how you can figure out the meaning of any gradients of the net's output with respect to the net's inputs, which is what I want to do. The issue, however, will likely arise in a variety of contexts.

Here's an example where it's easy to figure out the correspondence without the use of programming.

fe = FeatureExtraction[{{1.4, "A"}, {1.5, "A"}, {2.3, "B"}, {5.4, "B"}}]

If you then run the following code, it's easy to see that the second column in the input corresponds to the second and third columns in the output.

fe[{{2.4, "A"}, {3.7, "B"}, {2.4, "B"}}]

You could say {1->1,2->{2;;3}} or something like that. But if you have more variables, it isn't obvious to me how to figure out the correspondence. I've been hacking away at Normal representations of the FeatureExtractorFunction and Normal representations of its various parts, but even after going pretty deep, the answer isn't obvious to me.

So, is there a programmatic way of taking a FeatureExtractorFunction and figuring out the correspondence between the input parts and the output parts? And, if so, what is it?

POSTED BY: Seth Chandler
Posted 3 years ago

Crossposted here.

POSTED BY: Rohit Namjoshi

I went spelunking for some useful utility, but I could not immediately find one so I put something together quickly. It's bound not to be perfect.

The main idea is to take the FeatureExtractorFunction processor information and graph the feature evolution. Let's start with your example:

fe = FeatureExtraction[{{1.4, "A"}, {1.5, "A"}, {2.3, "B"}, {5.4, "B"}}]

and extract the name and input/output part of each processor

processor$edges = 
 Rule @@@ Keys@Values@fe[[1, "Processor", 2, "Processors", All, 2, {"Input", "Output"}]]
processor$names = fe[[1, "Processor", 2, "Processors", All, 1]]

(* {{"f1", "f2"} -> {"f1", "f2"}, {"f2"} -> {"f2"}, {"f1", 
   "f2"} -> {"(f1f2)"}, {"(f1f2)"} -> {"(f1f2)"}, {"(f1f2)"} -> {"(f1f2)"}} *)

(* {"Threads", "EmbedNominalVector", "MergeVectors", "DimensionReduceNumericalVector", "Standardize"} *)

Now with some replacement we can show how the transformed features are related

   Transpose[{processor$edges, processor$names}],
   {a_ -> b_, name_} :> {Thread[a \[DirectedEdge] name], Thread[name \[DirectedEdge] b]},
Graph[%, VertexLabels -> Automatic]

enter image description here

This tells us something already but it is not easy to follow the pipeline evolution. In order to do that we need to process the feature names a bit to modify them whenever the feature is processed (the internal representation only modify the name when splitting or merging features)

This code adds a subscript to the feature name that is incremented when the same name appears on both sides of a processor

resetIndices := (Clear[index]; index[x_] := 1)
rename[x_List] := rename /@ x
rename[old_List -> x_List] := (rename[#, {}] & /@ old) -> (rename[#, old] & /@ x)
rename[x_, old_List] := Subscript[x, If[MemberQ[old, x], index[x] += 1, index[x]]]

I am also going to throw in a couple of functions to style the graph vertices and keep those long processor names from messing up the graph layout

labelFeature[x_] := Framed[Style[x, Black], Background -> LightBlue]
labelProcessor[x_] := 
 Framed[Pane[Style[StringReplace[a_?LowerCaseQ ~~ b_?UpperCaseQ :> a <> "\[InvisibleSpace]" <> b]@x,
     Black], {{60}, Automatic}], Background -> White, RoundingRadius -> 5]

This is the updated replacement code

edges = Flatten@Replace[
    Transpose[{rename@processor$edges, processor$names}],
     {a_ -> b_, name_} /; Length[a] == Length[b] :> MapThread[
       {labelFeature[#1] \[DirectedEdge] Annotation[labelProcessor@name, "type" -> {##}],
         Annotation[labelProcessor@name, "type" -> {##}] \[DirectedEdge] labelFeature[#2]} &,
       {a, b}],
     {a_ -> b_, 
       name_} :> {Thread[
        labelFeature /@ a \[DirectedEdge] Annotation[labelProcessor@name, "type" -> {a, b}]], 
       Thread[Annotation[labelProcessor@name, "type" -> {a, b}] \[DirectedEdge] labelFeature /@ b]}

And this is the new graph

 VertexShapeFunction -> Function[{center, name, size}, Inset[name, center]],
 EdgeShapeFunction -> Function[{pts, name}, {Arrowheads[{{.01, 0.6}}], Arrow[pts]}],
 GraphLayout -> {"LayeredDigraphEmbedding", "Orientation" -> Left}

enter image description here

In order to track what happens to individual feature components this will have to be expanded a little, but I believe the current version of the code is already useful for getting some insights.

