Message Boards Message Boards

Visualize Machine Learning Data: From Python to Wolfram Language

GROUPS:

I was reading an online article about using the Pandas package of Python. I think it might be fun to see how Wolfram Language can handle these tasks in manageable amount of code. Lets begin:

(*code source*)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

Import Data

Python

import matplotlib.pyplot as plt
import pandas
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)

Wolfram Language

(* $Version 11.1.0 for Mac OS X x86 (64-bit) (March 16, 2017)*)
data = Import[url];
names = {"preg", "plas", "pres", "skin", "test", "mass", "pedi", "age", "class"};
dataset = Dataset[Map[Association @@ Thread[names -> #] &, data]] (* turns {6,148} to <| preg -> 6 ,  plas -> 148 |>, row-wisely *)

ds

Univariate Plots

Histogram

Python

data.hist()
plt.show()

Wolfram Language

width["class"] = {0.1}; (* Very flexible to adjust the width of the bins on the fly *)
width[item_] := Automatic;
Histogram[dataset[All, #], width[#], PlotLabel -> #] & /@ names

conv

plot0

Density Plot

The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin, much like your eye tried to do with the histograms.

So in Wolfram Language this is done by using a automatic smoothing kernel in SmoothHistogram. Just to be clear here, DensityHistogram in Mathematica means something different, which is like a 2D density plot with discrete color scale.

Python

data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

Wolfram Language

SmoothHistogram[dataset[All, #], PlotRange -> Full, PlotLabel -> #] & /@ names

plt2

Box Whisker Plot

Python

data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

Wolfram Language

BoxWhiskerChart[dataset[All, #], PlotLabel -> #] & /@ names

plt3

Multivariate Plot

Correlation Matrix Plot

Python

correlations = data.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

Wolfram Language

You are allowed to tweak the data with simple code here to have more control:

n = Length[names]
corr[dataset_, tuple_] := Correlation[
  N@Normal@dataset[All, tuple[[1]]],
  N@Normal@dataset[All, tuple[[2]]]
  ]

Generate some tuple of name pair

grid = Partition[Tuples[names, 2], n]
(*{{{preg,preg},{preg,plas},{preg,pres},{preg,skin},{preg,test} ... }}}*)
res = Map[corr[dataset, #] &, grid, {2}]
(* create correlation matrix : {{1., 0.129459, 0.141282, -0.0816718,...},...} *)

Control label display by yourself

 xLabel=yLabel=Transpose[{Range[n],names}]
 (* {{1,preg},{2,plas},{3,pres},{4,skin},{5,test},{6,mass},{7,pedi},{8,age},{9,class}} *)

Use MatrixPlot to create a color grid:

MatrixPlot[res,
 FrameTicks -> {{yLabel, None}, {None, xLabel}},
 FrameStyle -> Directive[14, Italic]
 ]

plt4

Scatterplot Matrix

Python

scatter_matrix(data) #default settings
plt.show()

Wolfram Language

The plot is broken down to a combo of ListPlot and Histogram according to the python article I was reading. For each factor pair, I use the following function to create data points and plot them together:

pair[dataset_, tuple_] := Transpose[{
   N@Normal@dataset[All, tuple[[1]]],
   N@Normal@dataset[All, tuple[[2]]]
   }]

ListPlot[pair[dataset, {"age", "skin"}], PlotStyle -> PointSize[0.02], AspectRatio -> 1] (*customized settings*)

scatter1

So I create a plot function as a wrapper upon the pair and corr function with the similar signature

plotfun[dataset_, tuple_] := If[tuple[[1]] === tuple[[2]],
  Histogram[dataset[All, tuple[[1]]], Ticks -> None],(*histogram on diagonal*)
  ListPlot[pair[dataset, tuple], PlotStyle -> PointSize[0.03], AspectRatio -> 1, Ticks -> None, Axes -> None, ImageSize -> {80, 80}] 
] 

The ListPlot is somewhat longer than the Python plot function because I think I can show some beautification settings in Mathematica to have the plot a professional look. Python users can point out how to do this with matplotlib in the comment. The action code to implement the plot over a grid is one-liner:

Grid[Map[plotfun[dataset, #] &, grid, {2}]] // Rasterize

scattergrid

The working notebook is attached. Please download and play around. The test data is also provided in testdata.m in the form of Dataset[...] in case the url above is dead. You can just import this file with data=Import["<path>/<to>/testdata.m"] and the dataset is ready to be used in Mathematica.

POSTED BY: Shenghui Yang
Answer
2 months ago

I think your solutions are nice, but definitely not the easiest. For example the histogram example can be succinctly written as:

Transpose[dataset][All, Histogram]
POSTED BY: Sander Huisman
Answer
2 months ago

This is great! Always learn something new from the community. For this histogram plot for "class" field, it seems Mathematica makes the bin so wide that the result is misleading. I am not sure if options can be passed into your style directly.

POSTED BY: Shenghui Yang
Answer
2 months ago

I'm not sure if I entirely see what you are saying, but if you want to pass arguments to Histogram like a bin specification you can do something like this:

Transpose[dataset][All, Histogram[#, {0.1}]&]

where {0.1} is just a random bin specification.

POSTED BY: Christopher Wolfram
Answer
2 months ago

If you do

Transpose[dataset][All, f]

you have the following after you click row "class":

plt1

Here f does not take the Key["Class"] as a parameter. So you won't have customized output based on the key. In particular, I need width = 0.1 only for the class row and automatic width for the rest.

hist2

POSTED BY: Shenghui Yang
Answer
2 months ago

enter image description here - Congratulations! This post is now a Staff Pick! Thank you for your wonderful contributions. Please, keep them coming!

POSTED BY: Moderation Team
Answer
2 months ago

Group Abstract Group Abstract