Thanks a lot for this excellent post which stimulated me to try to understand more of the internals of the network.
In this context, I came up with two questions which I was not able to figure out by myself:
1) How can I visualize the word embedding space for the Shakespeare vocabulary in order to test if words which are "semantically near" show up nearby in the word embedding map, similar to what Stephen Wolfram showed in his blogpost "What is ChatGPT doing..."? I have the feeling that it should be quite simple, but all my attempts to access the embedding layer of the trained network failed, probably due to my very limited understanding of the tools of net surgery.
2) Is it easily possible to visualize the attention matrix for a particular input sentence?
Any help is appreciated...