Group Abstract

Message Boards

8.7K Views

2 Replies

5 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Data Science Wolfram Language Natural Language Processing Neural Networks

Posted 5 years ago

POSTED BY: Ethan H.

2 Replies

Sort By:

Posted 2 years ago

You have to pay attention to spaces Select[NetExtract[lm, {"Output", "Labels"}], StringContainsQ["hitman", IgnoreCase -> True]] (* {" Whitman", " Hitman"} ) The vocabulary token is `" Hitman"` with capital H and a space at the beginning. In the GPT tokenization the space is important as it allows to distinguish between at token at the beginning of a word from a token in the middle or at the end. You can see that is string correspond to a single token: NetExtract[lm, "Input"][" Hitman"] ( {49990} *)

POSTED BY: Giulio Alessandrini

Posted 4 years ago

It is actually not an issue. GPT2 follows a "Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a practical middle ground between character and word level language modeling which effectively interpolates between word level inputs for frequent symbol sequences and char- acter level inputs for infrequent symbol sequences." That's why you see that seemingly-strange behaviour. You can read more here: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

POSTED BY: Test Account

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback