Wolfram Language & System Documentation Center

"Tokens" (Net Encoder)

See Also
- NetEncoder
- NetDecoder
- NetChain
- NetGraph
- TextElement
- Net Encoders
- Class
- Characters
- SubwordTokens
- Net Decoders
- Tokens
- Characters
- Class
- SubwordTokens
Related Guides
- Neural Networks
Tech Notes
- Neural Networks in the Wolfram Language
- See Also
  - NetEncoder
  - NetDecoder
  - NetChain
  - NetGraph
  - TextElement
  - Net Encoders
  - Class
  - Characters
  - SubwordTokens
  - Net Decoders
  - Tokens
  - Characters
  - Class
  - SubwordTokens
- Related Guides
  - Neural Networks
- Tech Notes
  - Neural Networks in the Wolfram Language

"Tokens" (Net Encoder)

represents an encoder that converts the words in a string to a sequence of integer codes using a standard English vocabulary.

NetEncoder[{"Tokens","language"}]

represents an encoder that uses a standard vocabulary for the given language.

NetEncoder[{"Tokens",{token₁,token₂,…}}]

represents an encoder that uses a specified list of tokens as the vocabulary.

NetEncoder[{"Tokens",…,"param"value}]

represents an encoder in which additional parameters have been specified.

Details

NetEncoder[…][input] applies the encoder to an input to produce an output.
NetEncoder[…][{input₁,input₂,…}] applies the encoder to a list of strings to produce a list of outputs.
The input to the encoder must be a string or a TextElement with a sequence of strings that represents tokens. If it is a string, the segmentation into tokens will be done using a regular expression based on the value of "SplitPattern".
The output of the encoder is a sequence of integers between 1 and d+1, where d is the number of tokens in the vocabulary. The integer d+1 is used to signify tokens in the input that do not occur in the dictionary.
The type of the output NumericArray is the smallest unsigned integer that can represent all possible output integer values.
An encoder can be attached to an input port of a net by specifying "port"->NetEncoder[…] when constructing the net.

Parameters

The following parameters can be specified:

"IgnoreCase"	True	whether to ignore case when matching tokens from the string
"SplitPattern"		the string pattern to use in order to split the input string into tokens
"TargetLength"	All	the length of the final sequence to crop or pad to

With the parameter "IgnoreCase"->True, tokens are effectively converted to lowercase before encoding.
With the parameter "TargetLength"->All, all tokens found in the input string are encoded.
With the parameter "TargetLength"->n, the first n tokens found in the input string are encoded, with padding applied if fewer than n tokens are found. The padding value is d+1, where d is the number of tokens in the vocabulary.
With the parameter "SplitPattern"->None, the input to the encoder is assumed to be a pre-tokenized list of strings of the form {"token₁","token₂",…}.

Examples

open all close all

Basic Examples (1)

Create a token encoder for English text:

Encode an English sentence:

Out-of-vocabulary words are encoded as the maximum code:

By default, words are detected using a simple regular expression:

The list of words can be explicitly passed using TextElement:

Scope (6)

Use the default token encoder to encode a sentence:

Give a specific list of tokens:

Give a specific list of tokens, including a split pattern:

Specify that the sequence should be padded or trimmed to be 4 elements long:

Use a built-in dictionary for a specific language:

Use a custom tokenization with TextElement:

Use the output of TextStructure to compute a list of token indices:

A tree structure gets flattened:

Parameters (3)

"IgnoreCase" (1)

An encoder with "IgnoreCase"->True treats tokens that differ only by the case of their constituent characters as equivalent:

An encoder with "IgnoreCase"->False does not do this:

"SplitPattern" (2)

Create an encoder that isolates digit characters, using "SplitPattern":

The encoder outputs one token for each digit character:

It is different from the default behavior, which gathers all consecutive digit characters together:

Create an encoder with "SplitPattern"->None and two tokens:

The encoder now expects a list of tokens as input:

The encoder still maps across a batch of examples:

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

"Tokens" (Net Encoder)

Details

Parameters

Examples

Basic Examples (1)

Scope (6)

Parameters (3)

"IgnoreCase" (1)

"SplitPattern" (2)

"Tokens" (Net Encoder)

Details

Parameters

Examples

Basic Examples (1)

Scope (6)

Parameters (3)

"IgnoreCase" (1)

"SplitPattern" (2)

See Also

Tech Notes

Related Guides

History