"BPESubwordTokens" (Net Encoder)
NetEncoder[{"BPESubwordTokens","param"value,…}]
represents an encoder that segments text with the Byte Pair Encoding (BPE) algorithm, which iteratively partitions the characters in a string to a sequence of integer codes, using a list of tokens.
Details
- NetEncoder[…][input] applies the encoder to a string to produce an output.
- NetEncoder[…][{input1,input2,…}] applies the encoder to a list of strings to produce a list of outputs.
- The output of the encoder is a sequence of integers between 1 and d, where d is the number of elements in the token list.
- BPE is a tokenization scheme between word level and character level in which tokens are usually either full words or word subparts. BPE tokenizers are created by training on a given text corpus.
- The encoder currently does not support training of BPE models. Pre-trained models are typically obtained via NetModel or by importing a model created by the SentencePiece library.
- A BPE model from the SentencePiece library is imported using NetEncoder[{"BPESubwordTokens",assoc}], where the Association assoc has the following keys:
-
"ModelPath" path to a SentencePiece .model file "VocabularyPath" path to a SentencePiece .vocab file "VocabularyThreshold" the threshold for acceptance of vocabulary tokens - If no "Vocabulary" key is specified when importing a SentencePiece model, no vocabulary will be used.
- If a vocabulary is specified, integer codes of tokens not present in the vocabulary will not be produced by the encoder. In case the tokenization produces an out-of-vocabulary token, the BPE merge operations that produced that token will be reversed until the token is split into either in-vocabulary tokens or single characters.
- SentencePiece BPE vocabulary files associate an integer score to each token. The score of each token is associated to its frequency in the training data, with the most frequent token having a score of zero and other tokens having a negative integer score. By setting the key "VocabularyThreshold" to a number n, only tokens with a score of at least n are accepted in the vocabulary.
- If no "VocabularyThreshold" key is specified when importing a SentencePiece model, the entire vocabulary will be imported.
- An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[…] when constructing the net.
- NetDecoder[NetEncoder[{"BPESubwordTokens",…}]] produces a NetDecoder[{"BPESubwordTokens",…}] with settings inherited from those of the given encoder.
- The following parameters are supported:
-
"CombineWhitespace" False whether to combine multiple adjacent whitespace characters "IgnoreCase" False whether to ignore case when matching tokens from the string "IncludeTerminalTokens" False whether to include the StartOfString and EndOfString tokens in the output "TargetLength" All the length of the final sequence to crop or pad to "UnicodeNormalization" None the Unicode normalization scheme to use "WhitespacePadding" None control the insertion of whitespace to the input string - When importing a SentencePiece BPE model file, any parameter specification will override settings from the file (if present).
- With the default parameter setting "TargetLength"->All, all tokens found in the input string are encoded.
- With the parameter "TargetLength"->n, the first n tokens found in the input string are encoded, with padding applied if fewer than n tokens are found. If EndOfString is present in the token list, the padding value is the integer code associated with it; otherwise, the code associated with the last token is used.
- The following settings for the parameter "UnicodeNormalization" are available:
-
"NFKC" NFKC UTF-8 normalization scheme "ModifiedNFKC" NFKC scheme with additional normalization around whitespace characters None No normalization is performed - Unicode normalization is the process of resolving ambiguities in the representation of equivalent characters. For example, the character "Å" can be either encoded by the single decimal code 197 or a combination of the character "A" and the ring character " ̊", of decimal codes 65 and 778.
- The parameter "WhitespacePadding" can be set to Left or Right to add a whitespace to the beginning or to the end of the input string, respectively, before encoding. The default value, None, does not insert additional whitespace.