Hope this helps
We have one input channel Nchan = 1
Step 1: split the lables into classes, for this exmaple Nclass = 4 (background, sphere, rectangle and overlap), so the network looks for 4 output lables.

Step 2: The first encoding layer split our input channel into 32 channels 1->32

Step 3: by the time we arrive at the deepest level our channels are downsampled but we have a lot of them

Step 4: After decoding we go back to 32 channels again. Hopefully by now each channel contains unique information about the image.

Step 5: We are looking for 4 labels so the 32 channels are mapped to 4 channels

Step 6: The 4 channels are converted to probability maps.

Step 7: The probability map become our 4 labels. In this case the network fail to correctly segment our example.

Our background and overlap are segmented whit a very high probability because they have a very distinct contrast which is easy to detect. However our circle and rectangle have the same contrast and have to be labeled based on their shape and not contrast. This is much more difficult and as such mistakes are made.