Can NetChain and associated layers support 3 dimensional video data (MRI scan) as input or only really support 2D inputs (i.e. images)?
I am trying to apply the new Mathematica 11 deep neural network tools to process cardiac MRI slices from the second annual National Science Data Bowl.
The data is provided as a collection of "studies" (i.e. a unique patient heart) where each study has multiple scans each ~ 30x256x192 in size. That is, each patient heart is scanned multiple times from different angles/positions using MRI. Therefore each scan is a short "video clip" of a single heartbeat 30 frames long (i.e. about 1 second at 30 fps).
The ultimate goal is to predict the ratio of the minimum and maximum volume of blood being pumped in and out of the heart (i.e. (Vmax-Vmin)/Vmax). That is, what percentage of the blood into the heart is pumped out of the heart. Blood volume ratio too high or too low is indicative of a serious heart problem.
There are about 11000 total MRI scans (each 30x256x192 in size) with associated Vmin, Vmax labels.
I preprocessed all the DICOM files for the scans to normalize and resize the scans into 30x32x32.
Just trying to get something going, I used LeNet from the Mathematica docs with Vmin labels
net = NetChain[
{
ConvolutionLayer[20, {5, 5}],
ElementwiseLayer[Ramp],
PoolingLayer[{2, 2}, {2, 2}],
ConvolutionLayer[50, {5, 5}],
ElementwiseLayer[Ramp],
PoolingLayer[{2, 2}, {2, 2}],
FlattenLayer[],
DotPlusLayer[500],
ElementwiseLayer[Ramp],
DotPlusLayer[1]
},
"Input" -> {30, 32, 32}, "Output" -> "Scalar"
]
However, I'm not sure that the dimensions are correct on the layers.
For example, when I just create a convolution layer
ConvolutionLayer[20,{5,5},"Input"->{30,32,32}]
The result says that the weights are a tensor of dimension {20,30,5,5}
which seems right but then the biases are only of dimension {20}
rather than {30,20}
(?).
Then the ConvolutionLayer output is of dimension {20,28,28}
but shouldn't the output be of dimension {30,20,28,28}
?
At any rate, the network trains fine (and fast with GPU as target device) but does not generalize at all (i.e. performs very poorly on hold out test scans). I'm wondering if the network actually has the appropriate configuration to really process the 30x32x32 input or if it's confused and only processing the first of the 30 images or something like that. Or maybe everything is just fine and that's just the way dimensions are being reported. I'm just guessing ...
Probably should rather be using softmax layer to predict "class" of the volume. That is, just bin the volume and predict the bin rather than have the network reproduce the numerical/decimal value for the volume.
Any help/advice/direction would be very much appreciated. Happy to share notebook and ResourceObject if anyone is interested.
Thanks