Message Boards Message Boards

GROUPS:

Visualize Machine Learning Data: From Python to Wolfram Language

Posted 1 year ago
5441 Views
|
6 Replies
|
14 Total Likes
|

I was reading an online article about using the Pandas package of Python. I think it might be fun to see how Wolfram Language can handle these tasks in manageable amount of code. Lets begin:

(*code source*)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

Import Data

Python

import matplotlib.pyplot as plt
import pandas
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)

Wolfram Language

(* $Version 11.1.0 for Mac OS X x86 (64-bit) (March 16, 2017)*)
data = Import[url];
names = {"preg", "plas", "pres", "skin", "test", "mass", "pedi", "age", "class"};
dataset = Dataset[Map[Association @@ Thread[names -> #] &, data]] (* turns {6,148} to <| preg -> 6 ,  plas -> 148 |>, row-wisely *)

ds

Univariate Plots

Histogram

Python

data.hist()
plt.show()

Wolfram Language

width["class"] = {0.1}; (* Very flexible to adjust the width of the bins on the fly *)
width[item_] := Automatic;
Histogram[dataset[All, #], width[#], PlotLabel -> #] & /@ names

conv

plot0

Density Plot

The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin, much like your eye tried to do with the histograms.

So in Wolfram Language this is done by using a automatic smoothing kernel in SmoothHistogram. Just to be clear here, DensityHistogram in Mathematica means something different, which is like a 2D density plot with discrete color scale.

Python

data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

Wolfram Language

SmoothHistogram[dataset[All, #], PlotRange -> Full, PlotLabel -> #] & /@ names

plt2

Box Whisker Plot

Python

data.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

Wolfram Language

BoxWhiskerChart[dataset[All, #], PlotLabel -> #] & /@ names

plt3

Multivariate Plot

Correlation Matrix Plot

Python

correlations = data.corr()
# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = numpy.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
plt.show()

Wolfram Language

You are allowed to tweak the data with simple code here to have more control:

n = Length[names]
corr[dataset_, tuple_] := Correlation[
  N@Normal@dataset[All, tuple[[1]]],
  N@Normal@dataset[All, tuple[[2]]]
  ]

Generate some tuple of name pair

grid = Partition[Tuples[names, 2], n]
(*{{{preg,preg},{preg,plas},{preg,pres},{preg,skin},{preg,test} ... }}}*)
res = Map[corr[dataset, #] &, grid, {2}]
(* create correlation matrix : {{1., 0.129459, 0.141282, -0.0816718,...},...} *)

Control label display by yourself

 xLabel=yLabel=Transpose[{Range[n],names}]
 (* {{1,preg},{2,plas},{3,pres},{4,skin},{5,test},{6,mass},{7,pedi},{8,age},{9,class}} *)

Use MatrixPlot to create a color grid:

MatrixPlot[res,
 FrameTicks -> {{yLabel, None}, {None, xLabel}},
 FrameStyle -> Directive[14, Italic]
 ]

plt4

Scatterplot Matrix

Python

scatter_matrix(data) #default settings
plt.show()

Wolfram Language

The plot is broken down to a combo of ListPlot and Histogram according to the python article I was reading. For each factor pair, I use the following function to create data points and plot them together:

pair[dataset_, tuple_] := Transpose[{
   N@Normal@dataset[All, tuple[[1]]],
   N@Normal@dataset[All, tuple[[2]]]
   }]

ListPlot[pair[dataset, {"age", "skin"}], PlotStyle -> PointSize[0.02], AspectRatio -> 1] (*customized settings*)

scatter1

So I create a plot function as a wrapper upon the pair and corr function with the similar signature

plotfun[dataset_, tuple_] := If[tuple[[1]] === tuple[[2]],
  Histogram[dataset[All, tuple[[1]]], Ticks -> None],(*histogram on diagonal*)
  ListPlot[pair[dataset, tuple], PlotStyle -> PointSize[0.03], AspectRatio -> 1, Ticks -> None, Axes -> None, ImageSize -> {80, 80}] 
] 

The ListPlot is somewhat longer than the Python plot function because I think I can show some beautification settings in Mathematica to have the plot a professional look. Python users can point out how to do this with matplotlib in the comment. The action code to implement the plot over a grid is one-liner:

Grid[Map[plotfun[dataset, #] &, grid, {2}]] // Rasterize

scattergrid

The working notebook is attached. Please download and play around. The test data is also provided in testdata.m in the form of Dataset[...] in case the url above is dead. You can just import this file with data=Import["<path>/<to>/testdata.m"] and the dataset is ready to be used in Mathematica.

6 Replies

I think your solutions are nice, but definitely not the easiest. For example the histogram example can be succinctly written as:

Transpose[dataset][All, Histogram]

This is great! Always learn something new from the community. For this histogram plot for "class" field, it seems Mathematica makes the bin so wide that the result is misleading. I am not sure if options can be passed into your style directly.

I'm not sure if I entirely see what you are saying, but if you want to pass arguments to Histogram like a bin specification you can do something like this:

Transpose[dataset][All, Histogram[#, {0.1}]&]

where {0.1} is just a random bin specification.

If you do

Transpose[dataset][All, f]

you have the following after you click row "class":

plt1

Here f does not take the Key["Class"] as a parameter. So you won't have customized output based on the key. In particular, I need width = 0.1 only for the class row and automatic width for the rest.

hist2

enter image description here - Congratulations! This post is now a Staff Pick! Thank you for your wonderful contributions. Please, keep them coming!

I appreciated this post for practicing visual representations in Mathematica. I think I have a simpler way to make the MatrixPlot, however:

MatrixPlot[Correlation[Normal@dataset[Values]]]
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract