"Tokens" (Net Encoder)
NetEncoder["Tokens"]
represents an encoder that converts the words in a string to a sequence of integer codes using a standard English vocabulary.
NetEncoder[{"Tokens","language"}]
represents an encoder that uses a standard vocabulary for the given language.
NetEncoder[{"Tokens",{token1,token2,…}}]
represents an encoder that uses a specified list of tokens as the vocabulary.
NetEncoder[{"Tokens",…,"param"value}]
represents an encoder in which additional parameters have been specified.
Details
- NetEncoder[…][input] applies the encoder to an input to produce an output.
- NetEncoder[…][{input1,input2,…}] applies the encoder to a list of strings to produce a list of outputs.
- The input to the encoder must be a string or a TextElement with a sequence of strings that represents tokens. If it is a string, the segmentation into tokens will be done using a regular expression based on the value of "SplitPattern".
- The output of the encoder is a sequence of integers between 1 and d+1, where d is the number of tokens in the vocabulary. The integer d+1 is used to signify tokens in the input that do not occur in the dictionary.
- The type of the output NumericArray is the smallest unsigned integer that can represent all possible output integer values.
- An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[…] when constructing the net.
- The following parameters can be specified:
-
"IgnoreCase" True whether to ignore case when matching tokens from the string "SplitPattern" the string pattern to use in order to split the input string into tokens "TargetLength" All the length of the final sequence to crop or pad to - With the parameter "IgnoreCase"->True, tokens are effectively converted to lowercase before encoding.
- With the parameter "TargetLength"->All, all tokens found in the input string are encoded.
- With the parameter "TargetLength"->n, the first n tokens found in the input string are encoded, with padding applied if fewer than n tokens are found. The padding value is d+1, where d is the number of tokens in the vocabulary.
- With the parameter "SplitPattern"->None, the input to the encoder is assumed to be a pre-tokenized list of strings of the form {"token1","token2",…}.
Parameters
Examples
open allclose allBasic Examples (1)
Create a token encoder for English text:
Out-of-vocabulary words are encoded as the maximum code:
By default, words are detected using a simple regular expression:
The list of words can be explicitly passed using TextElement:
Scope (6)
Use the default token encoder to encode a sentence:
Give a specific list of tokens:
Give a specific list of tokens, including a split pattern:
Specify that the sequence should be padded or trimmed to be 4 elements long:
Use a built-in dictionary for a specific language:
Use a custom tokenization with TextElement:
Use the output of TextStructure to compute a list of token indices:
Parameters (3)
"IgnoreCase" (1)
"SplitPattern" (2)
Create an encoder that isolates digit characters, using "SplitPattern":
The encoder outputs one token for each digit character:
It is different from the default behavior, which gathers all consecutive digit characters together:
Create an encoder with "SplitPattern"->None and two tokens:
The encoder now expects a list of tokens as input: