Great! As a developer in the Wolfram ML team, it's always gratifying to see people doing interesting stuff with what we provide.
There are a couple of comments i'd ike to make about this:
First, you evaluate the final performance using the pixelwise accuracy, but in semantic segmentation there is a more informative measure, namely mean intersection over union (IoU). Geometrically, that corresponds with measuring intersection / union ratio of the "blobs" corresponding to a fixed class in the prediction and ground truth masks, and then averaging those ratios for all classes. In formulas, for a given image:
$$IoU_{c} = \frac{TP_c}{TP_c + FP_c + FN_c}$$
$$IoU = Mean(IoU_c)$$
Where c is a class and TPc, FPc and FN_c are, respectively, the number of true positive, false positive and false negative predictions for class c. The true positives give you the measure of the blob intersections, while the sum gives you the union. A reasonable (but probably not the best) implementation of IoU might be:
classIOU[pred_, gt_, class_] :=
Block[{positionP, positionN, tp, fp, fn},
positionP = Position[pred, class];
positionN = Delete[Range@Length[pred], positionP];
positionP = Flatten[positionP];
tp = Count[gt[[positionP]], class];
fp = Length[positionP] - tp;
fn = Count[gt[[positionN]], class];
N[tp/(tp + fp + fn)]
]
IOU[pred_, gt_, nClasses_] := Mean@Table[classIOU[pred, gt, c], {c, nClasses}]
This assumes that your data is flattened and your classes are identified with integers starting from 1.
In general, IoU is preferable to pixel accuracy because it makes up for class imbalances in the masks by averaging class-wise accuracies. Suppose, in a 1-D example, that "1" is background and "2" is gastruloid, and your prediction and ground truth masks look like this:
pred = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1}
groundTruth = {1, 1, 1, 1, 2, 1, 1, 1, 1, 1}
Pixel-level accuracy would be 90% here, but IoU gives you 45% (90% for background, 0% for gastruloid classes), because you completely failed to segment the gastruloid. Then, in you particular case, looks like you have a good balancing between background and object pixels in your data, so IoU shouldn't be far from pixelwise accuracy.
The second comment is just a technical one: when evaluating the accuracy you run the trained network in a Table, i.e. on each input separately. The framework also supports batch evaluation (or listable, if you want to say it à la WL). In this case, our neural network framework will figure out a suitable parallelization strategy and the computation will be much faster than a serial one. So you could, more efficiently, pre-compute net[data] outside the table and then compare it with the ground truths.
Again, congratulations for your work, the results look very good!