Message Boards Message Boards

GPT-2 NetModel encoder issue?

Posted 3 years ago

It seems that the GPT-2 net model does not encode input words correctly. Trying the official Wolfram examples - in the help section - for word generation mostly gives me random words.

I spent some time and now I believe that the issue is that the encoder for this model does not correctly encode the words. For example, I looked at the encoder vocabulary and made sure that the word "Hitman" is in there. I then gave the encoder the word "hitman". Interestingly, the word vectors are generated for "hit" and "man" separately.

lm = NetModel[{"GPT2 Transformer Trained on WebText Data", "Task" -> "LanguageModeling", "Size" -> "774M"}]

NetExtract[lm, "Input"]["Hitman"]

Output: {17634, 550}

According to decoder, these two indices are associated with the words "Hit" and "man":

NetExtract[lm, {"Output", "Labels"}][[{17634, 550}]]

Output: {"Hit", "man"}

Try other words, and you will see the same type of behavior - like "Gorgeous" splits into {"G", "orge", "ous"}.

I am relatively new to Mathematica... Am I doing something wrong or there is really something wrong with the encoder?

-Ethan

POSTED BY: Ethan H.
2 Replies

You have to pay attention to spaces

Select[NetExtract[lm, {"Output", "Labels"}],  StringContainsQ["hitman", IgnoreCase -> True]]

(* {" Whitman", " Hitman"} *)

The vocabulary token is " Hitman" with capital H and a space at the beginning. In the GPT tokenization the space is important as it allows to distinguish between at token at the beginning of a word from a token in the middle or at the end. You can see that is string correspond to a single token:

NetExtract[lm, "Input"][" Hitman"]

(* {49990} *)
Posted 2 years ago

It is actually not an issue. GPT2 follows a "Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and char- acter level inputs for infrequent symbol sequences." That's why you see that seemingly-strange behaviour.

You can read more here: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

POSTED BY: Test Account
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract