Message Boards Message Boards

Visualize Machine Learning Data: From Python to Wolfram Language


I was reading an online article about using the Pandas package of Python. I think it might be fun to see how Wolfram Language can handle these tasks in manageable amount of code. Lets begin:

(*code source*)
url = ""

Import Data


import matplotlib.pyplot as plt
import pandas
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)

Wolfram Language

(* $Version 11.1.0 for Mac OS X x86 (64-bit) (March 16, 2017)*)
data = Import[url];
names = {"preg", "plas", "pres", "skin", "test", "mass", "pedi", "age", "class"};
dataset = Dataset[Map[Association @@ Thread[names -> #] &, data]] (* turns {6,148} to <| preg -> 6 ,  plas -> 148 |>, row-wisely *)


Univariate Plots




Wolfram Language

width["class"] = {0.1}; (* Very flexible to adjust the width of the bins on the fly *)
width[item_] := Automatic;
Histogram[dataset[All, #], width[#], PlotLabel -> #] & /@ names



Density Plot

The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin, much like your eye tried to do with the histograms.

So in Wolfram Language this is done by using a automatic smoothing kernel in SmoothHistogram. Just to be clear here, DensityHistogram in Mathematica means something different, which is like a 2D density plot with discrete color scale.


data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)

Wolfram Language

SmoothHistogram[dataset[All, #], PlotRange -> Full, PlotLabel -> #] & /@ names


Box Whisker Plot


data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)

Wolfram Language

BoxWhiskerChart[dataset[All, #], PlotLabel -> #] & /@ names


Multivariate Plot

Correlation Matrix Plot


correlations = data.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
ticks = numpy.arange(0,9,1)

Wolfram Language

You are allowed to tweak the data with simple code here to have more control:

n = Length[names]
corr[dataset_, tuple_] := Correlation[
  N@Normal@dataset[All, tuple[[1]]],
  N@Normal@dataset[All, tuple[[2]]]

Generate some tuple of name pair

grid = Partition[Tuples[names, 2], n]
(*{{{preg,preg},{preg,plas},{preg,pres},{preg,skin},{preg,test} ... }}}*)
res = Map[corr[dataset, #] &, grid, {2}]
(* create correlation matrix : {{1., 0.129459, 0.141282, -0.0816718,...},...} *)

Control label display by yourself

 (* {{1,preg},{2,plas},{3,pres},{4,skin},{5,test},{6,mass},{7,pedi},{8,age},{9,class}} *)

Use MatrixPlot to create a color grid:

 FrameTicks -> {{yLabel, None}, {None, xLabel}},
 FrameStyle -> Directive[14, Italic]


Scatterplot Matrix


scatter_matrix(data) #default settings

Wolfram Language

The plot is broken down to a combo of ListPlot and Histogram according to the python article I was reading. For each factor pair, I use the following function to create data points and plot them together:

pair[dataset_, tuple_] := Transpose[{
   N@Normal@dataset[All, tuple[[1]]],
   N@Normal@dataset[All, tuple[[2]]]

ListPlot[pair[dataset, {"age", "skin"}], PlotStyle -> PointSize[0.02], AspectRatio -> 1] (*customized settings*)


So I create a plot function as a wrapper upon the pair and corr function with the similar signature

plotfun[dataset_, tuple_] := If[tuple[[1]] === tuple[[2]],
  Histogram[dataset[All, tuple[[1]]], Ticks -> None],(*histogram on diagonal*)
  ListPlot[pair[dataset, tuple], PlotStyle -> PointSize[0.03], AspectRatio -> 1, Ticks -> None, Axes -> None, ImageSize -> {80, 80}] 

The ListPlot is somewhat longer than the Python plot function because I think I can show some beautification settings in Mathematica to have the plot a professional look. Python users can point out how to do this with matplotlib in the comment. The action code to implement the plot over a grid is one-liner:

Grid[Map[plotfun[dataset, #] &, grid, {2}]] // Rasterize


The working notebook is attached. Please download and play around. The test data is also provided in testdata.m in the form of Dataset[...] in case the url above is dead. You can just import this file with data=Import["<path>/<to>/testdata.m"] and the dataset is ready to be used in Mathematica.

POSTED BY: Shenghui Yang
1 month ago

I think your solutions are nice, but definitely not the easiest. For example the histogram example can be succinctly written as:

Transpose[dataset][All, Histogram]
POSTED BY: Sander Huisman
1 month ago

This is great! Always learn something new from the community. For this histogram plot for "class" field, it seems Mathematica makes the bin so wide that the result is misleading. I am not sure if options can be passed into your style directly.

POSTED BY: Shenghui Yang
1 month ago

I'm not sure if I entirely see what you are saying, but if you want to pass arguments to Histogram like a bin specification you can do something like this:

Transpose[dataset][All, Histogram[#, {0.1}]&]

where {0.1} is just a random bin specification.

POSTED BY: Christopher Wolfram
1 month ago

If you do

Transpose[dataset][All, f]

you have the following after you click row "class":


Here f does not take the Key["Class"] as a parameter. So you won't have customized output based on the key. In particular, I need width = 0.1 only for the class row and automatic width for the rest.


POSTED BY: Shenghui Yang
1 month ago

enter image description here - Congratulations! This post is now a Staff Pick! Thank you for your wonderful contributions. Please, keep them coming!

POSTED BY: Moderation Team
26 days ago

Group Abstract Group Abstract