"BPESubwordTokens" (Net Encoder)

NetEncoder[{"BPESubwordTokens","param"value,}]

represents an encoder that segments text with the Byte Pair Encoding (BPE) algorithm, which iteratively partitions the characters in a string to a sequence of integer codes, using a list of tokens.

Details

  • NetEncoder[][input] applies the encoder to a string to produce an output.
  • NetEncoder[][{input1,input2,}] applies the encoder to a list of strings to produce a list of outputs.
  • The output of the encoder is a sequence of integers between 1 and d, where d is the number of elements in the token list.
  • BPE is a tokenization scheme between word level and character level in which tokens are usually either full words or word subparts. BPE tokenizers are created by training on a given text corpus.
  • The encoder currently does not support training of BPE models. Pre-trained models are typically obtained via NetModel or by importing a model created by the SentencePiece library.
  • A BPE model from the SentencePiece library is imported using NetEncoder[{"BPESubwordTokens",assoc}], where the Association assoc has the following keys:
  • "ModelPath"path to a SentencePiece .model file
    "VocabularyPath"path to a SentencePiece .vocab file
    "VocabularyThreshold"the threshold for acceptance of vocabulary tokens
  • If no "Vocabulary" key is specified when importing a SentencePiece model, no vocabulary will be used.
  • If a vocabulary is specified, integer codes of tokens not present in the vocabulary will not be produced by the encoder. In case the tokenization produces an out-of-vocabulary token, the BPE merge operations that produced that token will be reversed until the token is split into either in-vocabulary tokens or single characters.
  • SentencePiece BPE vocabulary files associate an integer score to each token. The score of each token is associated to its frequency in the training data, with the most frequent token having a score of zero and other tokens having a negative integer score. By setting the key "VocabularyThreshold" to a number n, only tokens with a score of at least n are accepted in the vocabulary.
  • If no "VocabularyThreshold" key is specified when importing a SentencePiece model, the entire vocabulary will be imported.
  • An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[] when constructing the net.
  • NetDecoder[NetEncoder[{"BPESubwordTokens",}]] produces a NetDecoder[{"BPESubwordTokens",}] with settings inherited from those of the given encoder.
  • Parameters
  • The following parameters are supported:
  • "IgnoreCase"Falsewhether to ignore case when matching tokens from the string
    "IncludeTerminalTokens"Falsewhether to include the StartOfString and EndOfString tokens in the output
    "TargetLength"Allthe length of the final sequence to crop or pad to
    "UnicodeNormalization"Nonethe Unicode normalization scheme to use
  • When importing a SentencePiece BPE model file, any parameter specification will override settings from the file (if present).
  • With the default parameter setting "TargetLength"->All, all tokens found in the input string are encoded.
  • With the parameter "TargetLength"->n, the first n tokens found in the input string are encoded, with padding applied if fewer than n tokens are found. If EndOfString is present in the token list, the padding value is the integer code associated with it; otherwise, the code associated with the last token is used.
  • The following settings for the parameter "UnicodeNormalization" are available:
  • "NFKC"NFKC UTF-8 normalization scheme
    "ModifiedNFKC"NFKC scheme with additional normalization around whitespace characters
    NoneNo normalization is performed
  • Unicode normalization is the process of resolving ambiguities in the representation of equivalent characters. For example, the character "Å" can be either encoded by the single decimal code 197 or a combination of the character "A" and the ring character " ̊", of decimal codes 65 and 778.
  • The parameter "WhitespacePadding" can be set to Left or Right to add a whitespace to the beginning or to the end of the input string, respectively, before encoding. The default value, None, does not insert additional whitespace.

Examples

open allclose all

Basic Examples  (2)

Create a BPE encoder from a pre-trained SentencePiece model file:

Create a BPE encoder from a pre-trained SentencePiece model file:

Encode a string of characters:

Scope  (2)

Create a BPE encoder from a SentencePiece BPE model and vocabulary files:

Encode a string of characters with the vocabulary constraint. The word "world" is segmented into subwords because it is not in the vocabulary:

Import a BPE model with a vocabulary and specify a vocabulary threshold:

Encode a string of characters. No compound token will be produced due to the vocabulary restriction:

Options  (6)

"CombineWhitespace"  (1)

Import a SentencePiece BPE model and override its "CombineWhitespace" setting:

Encode a string of characters. Multiple whitespace characters are not combined into one before encoding:

The default setting for this model is "CombineWhitespace"->True:

"IgnoreCase"  (1)

Import a SentencePiece BPE model and override its "IgnoreCase" setting:

Encode a string of characters. The encoder does not distinguish between uppercase and lowercase letters:

"IncludeTerminalTokens"  (1)

SentencePiece BPE models do not include terminal tokens by default. Import a model and enable "IncludeTerminalTokens":

Encode a string of characters. StartOfString and EndOfString tokens are now produced:

"TargetLength"  (1)

Import a SentencePiece BPE model and specify that the sequence should be padded or trimmed to be 12 elements long:

Encode a string of characters. Outputs with fewer than 12 elements are padded with the EndOfString token:

"UnicodeNormalization"  (1)

Import a SentencePiece BPE model that does not perform any Unicode normalization:

This model encodes the capital omega character, whose Unicode decimal is 937, to ID 29:

The ohm symbol, identified by the Unicode decimal 8486, is not recognized and is encoded to the unknown token, whose ID for this model is 1:

By specifying a Unicode normalization setting for the model, the ohm character is normalized to a capital omega before encoding:

"WhitespacePadding"  (1)

Import a SentencePiece BPE model and override its "WhitespacePadding" setting:

Encode a string of characters. The initial word "great" is segmented into subwords:

The default setting for this model is "WhitespacePadding"Left, which inserts a leading whitespace, creating a match for the token "great":

Applications  (1)

Train a classifier that classifies movie review snippets as "positive" or "negative". First, obtain the training and test data:

Using NetModel, obtain an EmbeddingLayer using a "BPESubwordTokens" encoder from the Wolfram Neural Net Repository:

Use the embeddings to define a net that takes a string of words as input and returns either "positive" or "negative":

Train the net for five training rounds. Keep the weights of the EmbeddingLayer fixed using the option LearningRateMultipliers:

Evaluate the trained net on a example from the test set, obtaining the probabilities:

Properties & Relations  (1)

Create a "BPESubwordTokens" decoder with analogous specifications to a given encoder:

Encode and decode a string of characters:

Possible Issues  (1)

Most of the parameters of the encoder are not needed by the "BPESubwordTokens" decoder. As a result, settings may be lost in a round trip. The following encoder performs Unicode normalization:

The setting is lost in a round trip: