Today I was trying to understand Transformer architectures and as I was watching a video explanation of the seminal paper "attention is all you need", it occurred to me that the most difficult part to understand is the "attention layer". So I thought "wouldn't it be nice if there was an AttentionLayer function available in Mathematica ?", and then I checked if there happens to be one, just in case.
And sure enough, there is one.
I think the way Wolfram Research models machine learning concepts is very cool, and I just wanted to share my enthusiasm. You guys make implementing machine learning clean and elegant, keep it up !