"BPESubwordTokens" (Net Encoder)

NetEncoder[{"BPESubwordTokens","param"value,}]

represents an encoder that segments text with the Byte Pair Encoding (BPE) algorithm, which iteratively partitions the characters in a string to a sequence of integer codes, using a list of tokens.

Details

  • NetEncoder[][input] applies the encoder to a string to produce an output.
  • NetEncoder[][{input1,input2,}] applies the encoder to a list of strings to produce a list of outputs.
  • The output of the encoder is a sequence of integers between 1 and d, where d is the number of elements in the token list.
  • BPE is a tokenization scheme between word level and character level in which tokens are usually either full words or word subparts. BPE tokenizers are created by training on a given text corpus.
  • The encoder currently does not support training of BPE models. Pre-trained models are typically obtained via NetModel or by importing a model created by the SentencePiece library.
  • A BPE model from the SentencePiece library is imported using NetEncoder[{"BPESubwordTokens",assoc}], where the Association assoc has the following keys:
  • "ModelPath"path to a SentencePiece .model file
    "VocabularyPath"path to a SentencePiece .vocab file
    "VocabularyThreshold"the threshold for acceptance of vocabulary tokens
  • If no "Vocabulary" key is specified when importing a SentencePiece model, no vocabulary will be used.
  • If a vocabulary is specified, integer codes of tokens not present in the vocabulary will not be produced by the encoder. In case the tokenization produces an out-of-vocabulary token, the BPE merge operations that produced that token will be reversed until the token is split into either in-vocabulary tokens or single characters.
  • SentencePiece BPE vocabulary files associate an integer score to each token. The score of each token is associated to its frequency in the training data, with the most frequent token having a score of zero and other tokens having a negative integer score. By setting the key "VocabularyThreshold" to a number n, only tokens with a score of at least n are accepted in the vocabulary.
  • If no "VocabularyThreshold" key is specified when importing a SentencePiece model, the entire vocabulary will be imported.
  • An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[] when constructing the net.
  • NetDecoder[NetEncoder[{"BPESubwordTokens",}]] produces a NetDecoder[{"BPESubwordTokens",}] with settings inherited from those of the given encoder.
  • Parameters
  • The following parameters are supported:
  • "CombineWhitespace"Falsewhether to combine multiple adjacent whitespace characters
    "IgnoreCase"Falsewhether to ignore case when matching tokens from the string
    "IncludeTerminalTokens"Falsewhether to include the StartOfString and EndOfString tokens in the output
    "TargetLength"Allthe length of the final sequence to crop or pad to
    "UnicodeNormalization"Nonethe Unicode normalization scheme to use
    "WhitespacePadding"Nonecontrol the insertion of whitespace to the input string
  • When importing a SentencePiece BPE model file, any parameter specification will override settings from the file (if present).
  • With the default parameter setting "TargetLength"->All, all tokens found in the input string are encoded.
  • With the parameter "TargetLength"->n, the first n tokens found in the input string are encoded, with padding applied if fewer than n tokens are found. If EndOfString is present in the token list, the padding value is the integer code associated with it; otherwise, the code associated with the last token is used.
  • The following settings for the parameter "UnicodeNormalization" are available:
  • "NFKC"NFKC UTF-8 normalization scheme
    "ModifiedNFKC"NFKC scheme with additional normalization around whitespace characters
    NoneNo normalization is performed
  • Unicode normalization is the process of resolving ambiguities in the representation of equivalent characters. For example, the character "Å" can be either encoded by the single decimal code 197 or a combination of the character "A" and the ring character " ̊", of decimal codes 65 and 778.
  • The parameter "WhitespacePadding" can be set to Left or Right to add a whitespace to the beginning or to the end of the input string, respectively, before encoding. The default value, None, does not insert additional whitespace.

Examples

open all close all

Basic Examples  (2)

Create a BPE encoder from a pre-trained SentencePiece model file:

In[1]:=
Click for copyable input
Out[1]=

Create a BPE encoder from a pre-trained SentencePiece model file:

In[1]:=
Click for copyable input
Out[1]=

Encode a string of characters:

In[2]:=
Click for copyable input
Out[2]=

Scope  (2)

Options  (6)

Applications  (1)

Properties & Relations  (1)

Possible Issues  (1)