AttentionLayer
✖
AttentionLayer
represents a trainable net layer that learns to pay attention to certain portions of its input.
Details and Options




- AttentionLayer[net] takes a set of key vectors, a set of value vectors and one or more query vectors and computes for each query vector the weighted sum of the value vectors using softmax-normalized scores from net[<"Input"key,"Query"query >].
- In its general single-head form, AttentionLayer[net] takes a key array K of dimensions d1×…×dn×k, a value array V of dimensions d1×…×dn×v and a query array Q of dimensions q1×…×qm×q. The key and value arrays can be seen as arrays of size d1×…×dn whose elements are vectors. For K, these vectors are denoted k and have size k, and for V these vectors are denoted v and have size v, respectively. Similarly, the query array can be seen as an array of size q1×…×qm whose elements are vectors denoted q of size q. Note that the query array can be a single query vector of size q if m is 0. Then, the scoring net f is used to compute a scalar score s=f(k,q) for each combination of the d1×…×dn key vectors k and q1×…×qm query vectors q. These scalar scores are used to produce an output array O of size q1×…×qm containing weighted sums o=
wivi, where the weights are w=softmax(S), and S is the array of d1×…×dn scalar scores produced for a given query vector.
- A common application of AttentionLayer[net] is when the keys are a matrix of size n×k, the values are a matrix of size n×v and the query is a single vector of size q. Then AttentionLayer will compute a single output vector o that is the weighted sum of the n-value row vectors: o=
, where zi=f(ki,q). In the Wolfram Language, this can be written Total[values*weights], where weights is SoftmaxLayer[][Map[net[<"Input"#,"Query"query >]&,keys]].
- In AttentionLayer[net], the scoring network net can be one of:
-
"Dot" a NetGraph computing s=Dot[k,q] "Bilinear" a NetGraph computing s=Dot[k,W,q], where W is a learnable matrix (default) NetGraph[…] a specific NetGraph that takes "Input" and "Query" vectors and produces a scalar "Output" value - The following optional parameters can be included:
-
"Dropout" 0 dropout rate for the attention weights LearningRateMultipliers Automatic learning rate multipliers for the scoring network "Mask" None prevent certain patterns of attention "MultiHead" False whether to perform multi-head attention, where the penultimate dimension corresponds to different heads "ScoreRescaling" None method to scale the scores - Possible settings for "Mask" are:
-
None no masking "Causal" causal masking "Causal"n local causal masking with a window of size n - Specifying "Dropout"p applies dropout with probability p on the attention weights, where p is a scalar between 0 included and 1 excluded.
- With the setting "Mask""Causal", the query input is constrained to be a sequence of vectors with the same length as the key and the value inputs, and only positions t'<=t of the key and the value inputs are used to compute the output at position t.
- With the setting "Mask""Causal"n, where n is a positive integer, only positions t-n<t'<=t of the key and the value inputs are used to compute the output at position t.
- With the setting "MultiHead"True, key and value inputs must be at least of rank three, the query input must be at least of rank two, and the penultimate dimension should be the same for all inputs, representing the number of attention heads. Each attention head corresponds to a distinct attention mechanism, and the outputs of all heads are joined.
- With the setting "ScoreRescaling""DimensionSqrt", the scores are divided by the square root of the key's input dimension before being normalized by the softmax:
.
- AttentionLayer is typically used inside NetGraph.
- AttentionLayer exposes the following input ports for use in NetGraph etc.:
-
"Key" an array of size d1×…×dn×k (or d1×…×dn×h×k with multi-head attention) "Value" an array of size d1×…×dn×v (or d1×…×dn×h×k with multi-head attention) "Query" an array of size q1×…×qm×q (or q1×…×qm×h×k with multi-head attention) - AttentionLayer exposes an output port for use in NetGraph etc.:
-
"Output" an array of outputs with dimensions q1×…×qm×v (or q1×…×qm×h×v with multi-head attention) - AttentionLayer exposes an extra port to access internal attention weights:
-
"AttentionWeights" an array of weights with dimensions d1×…×dn×q1×…×qm (or d1×…×dn×h×q1×…×qm with multi-head attention) - AttentionLayer[…,"Key"shape1,"Value"shape2,"Query"shape3] allows the shapes of the inputs to be specified. Possible forms for shapei include:
-
NetEncoder[…] encoder producing a sequence of arrays {d1,d2,…} an array of dimensions d1×d2×… {"Varying",d1,d2,…} an array whose first dimension is variable and remaining dimensions are d1×d2×… {Automatic,…} an array whose dimensions are to be inferred {"Varying",Automatic,…} a varying number of arrays each of inferred size - The sizes of the key, value and query arrays are usually inferred automatically within a NetGraph.
- AttentionLayer[…][<"Key"key,"Value"value,"Query"query >] explicitly computes the output from applying the layer.
- AttentionLayer[…][<"Key"{key1,key2,…},"Value"{value1,value2,…},"Query"{query1,query2,…} >] explicitly computes outputs for each of the keyi, valuei and queryi in a batch of inputs.
- AttentionLayer[…][input,NetPort["AttentionWeights"]] can be used to access the softmax-normalized attention weights on some input.
- When given a NumericArray in the input, the output will be a NumericArray.
- NetExtract[…,"ScoringNet"] can be used to extract net from an AttentionLayer[net] object.
- Options[AttentionLayer] gives the list of default options to construct the layer. Options[AttentionLayer[…]] gives the list of default options to evaluate the layer on some data.
- Information[AttentionLayer[…]] gives a report about the layer.
- Information[AttentionLayer[…],prop] gives the value of the property prop of AttentionLayer[…]. Possible properties are the same as for NetGraph.
Examples
open allclose allBasic Examples (2)Summary of the most common use cases
Create an AttentionLayer:

https://wolfram.com/xid/0cpsy2ycnqxv9u-ztseuk

Create a randomly initialized AttentionLayer that takes a sequence of two-dimensional keys, three-dimensional values and a sequence of one-dimensional queries:

https://wolfram.com/xid/0cpsy2ycnqxv9u-icc83o


https://wolfram.com/xid/0cpsy2ycnqxv9u-gl237w

The layer threads across a batch of sequences of different lengths:

https://wolfram.com/xid/0cpsy2ycnqxv9u-42njd7

Scope (4)Survey of the scope of standard use cases
Scoring Net (2)
Create an AttentionLayer using a "Dot" scoring net:

https://wolfram.com/xid/0cpsy2ycnqxv9u-obppxt

Extract the "Dot" scoring net:

https://wolfram.com/xid/0cpsy2ycnqxv9u-dwv4nl

Create a new AttentionLayer explicitly specifying the scoring net as a NetGraph object:

https://wolfram.com/xid/0cpsy2ycnqxv9u-g6y5et

Create a custom scoring net with trainable parameters:

https://wolfram.com/xid/0cpsy2ycnqxv9u-vw4k89

Create and initialize an AttentionLayer that makes use of the custom scoring net:

https://wolfram.com/xid/0cpsy2ycnqxv9u-hcx59j

Apply the layer with a single query vector:

https://wolfram.com/xid/0cpsy2ycnqxv9u-pcugbj

Apply the layer with a sequence of queries:

https://wolfram.com/xid/0cpsy2ycnqxv9u-x84ea6

Attention Weights (2)
Create an AttentionLayer:

https://wolfram.com/xid/0cpsy2ycnqxv9u-1evex7

Compute attention weights on a given input:

https://wolfram.com/xid/0cpsy2ycnqxv9u-8a4pso

https://wolfram.com/xid/0cpsy2ycnqxv9u-t8dcdj

In this case, the weights correspond to:

https://wolfram.com/xid/0cpsy2ycnqxv9u-ly4eku

Compute both attention weights and outputs of the layer:

https://wolfram.com/xid/0cpsy2ycnqxv9u-ekwihf

Take a model based on AttentionLayer:

https://wolfram.com/xid/0cpsy2ycnqxv9u-ypfm3g

This net contains several multi-head self-attention layers with 12 heads, for instance:

https://wolfram.com/xid/0cpsy2ycnqxv9u-4dhue9

Extract the attention weights of this layer for a given input to the net:

https://wolfram.com/xid/0cpsy2ycnqxv9u-clo9qe

https://wolfram.com/xid/0cpsy2ycnqxv9u-lu6cbo

Represent the weights of the first attention head as connection strengths between input tokens:

https://wolfram.com/xid/0cpsy2ycnqxv9u-yfug31


https://wolfram.com/xid/0cpsy2ycnqxv9u-w8wl3w

Options (7)Common values & functionality for each option
"Dropout" (1)
Define an AttentionLayer with dropout on attention weights masking:

https://wolfram.com/xid/0cpsy2ycnqxv9u-0z8bek

Without training-specific behavior, the layer returns the same result as without dropout:

https://wolfram.com/xid/0cpsy2ycnqxv9u-2a0r31

https://wolfram.com/xid/0cpsy2ycnqxv9u-lpvq3y


https://wolfram.com/xid/0cpsy2ycnqxv9u-nvn0ct

With NetEvaluationMode"Train", the layer returns different results:

https://wolfram.com/xid/0cpsy2ycnqxv9u-08sj4a


https://wolfram.com/xid/0cpsy2ycnqxv9u-4s3zu4

Dropout is applied directly on attention weights:

https://wolfram.com/xid/0cpsy2ycnqxv9u-tpri2c

LearningRateMultipliers (1)
Make a scoring net with arbitrary weights:

https://wolfram.com/xid/0cpsy2ycnqxv9u-vt6hyo

https://wolfram.com/xid/0cpsy2ycnqxv9u-60m0c6

Use this scoring net in AttentionLayer, freezing its weights with the option LearningRateMultipliers:

https://wolfram.com/xid/0cpsy2ycnqxv9u-flo561

A zero learning rate multiplier will be used for the weights of the scoring net when training:

https://wolfram.com/xid/0cpsy2ycnqxv9u-cf9g8f


https://wolfram.com/xid/0cpsy2ycnqxv9u-c0dwhi

"Mask" (2)
Define an AttentionLayer with causal masking:

https://wolfram.com/xid/0cpsy2ycnqxv9u-hr9get

Apply the attention layer with one query vector and a sequence of length five:

https://wolfram.com/xid/0cpsy2ycnqxv9u-nc72ll

https://wolfram.com/xid/0cpsy2ycnqxv9u-lmgw7f

The output at a given step depends only on the keys and the values up to this step. In particular, the first output vector is the first vector of values.
The attention weights form a lower-triangular matrix:

https://wolfram.com/xid/0cpsy2ycnqxv9u-mdvpot

Define an AttentionLayer with local causal masking of window size 3:

https://wolfram.com/xid/0cpsy2ycnqxv9u-rsvnur

Apply the attention layer with one query vector and a sequence of length five:

https://wolfram.com/xid/0cpsy2ycnqxv9u-r1thpd

https://wolfram.com/xid/0cpsy2ycnqxv9u-wzuibc

The output at a given step depends only on the keys and the values from the last three steps.
This can be seen in the matrix of attention weights that contains zeros:

https://wolfram.com/xid/0cpsy2ycnqxv9u-63zmvk

"MultiHead" (2)
Define an AttentionLayer with two heads:

https://wolfram.com/xid/0cpsy2ycnqxv9u-g8cf3x

Apply multi-head attention on one query vector and a sequence of length three:

https://wolfram.com/xid/0cpsy2ycnqxv9u-8fmf8o

https://wolfram.com/xid/0cpsy2ycnqxv9u-73apd3

The result is the same as applying single-head attention separately on each head and joining the result:

https://wolfram.com/xid/0cpsy2ycnqxv9u-e2ruy4


https://wolfram.com/xid/0cpsy2ycnqxv9u-tdwnwx

Define a NetGraph to perform multi-head self-attention with six heads:

https://wolfram.com/xid/0cpsy2ycnqxv9u-3ue78t

Apply to a NumericArray with a sequence of length three:

https://wolfram.com/xid/0cpsy2ycnqxv9u-deq4r7

"ScoreRescaling" (1)
Create an AttentionLayer that rescales attention scores with respect to the input dimension:

https://wolfram.com/xid/0cpsy2ycnqxv9u-8ukesf

Evaluate the layer on an input:

https://wolfram.com/xid/0cpsy2ycnqxv9u-u7x781

https://wolfram.com/xid/0cpsy2ycnqxv9u-0zc3ky

The output is less contrasted than without score rescaling:

https://wolfram.com/xid/0cpsy2ycnqxv9u-2fej5n

The attention weights are also less contrasted, even if their ordering remains the same:

https://wolfram.com/xid/0cpsy2ycnqxv9u-tw3vj2

Applications (1)Sample problems that can be solved with this function
To sort lists of numbers, generate a test and training set consisting of lists of integers between 1 and 6:

https://wolfram.com/xid/0cpsy2ycnqxv9u-njdc95
Display three random samples drawn from the training set:

https://wolfram.com/xid/0cpsy2ycnqxv9u-qfv88q

Define a NetGraph with an AttentionLayer:

https://wolfram.com/xid/0cpsy2ycnqxv9u-wvkti7


https://wolfram.com/xid/0cpsy2ycnqxv9u-kwgxp9

Use the net to sort a list of integers:

https://wolfram.com/xid/0cpsy2ycnqxv9u-zm2ga9

Properties & Relations (4)Properties of the function, and connections to other functions
If the query, key and value inputs are matrices, AttentionLayer[net] computes:

https://wolfram.com/xid/0cpsy2ycnqxv9u-otwpwr
Define an AttentionLayer and extract the scoring subnet:

https://wolfram.com/xid/0cpsy2ycnqxv9u-1vejum


https://wolfram.com/xid/0cpsy2ycnqxv9u-d56ric

Evaluate AttentionLayer on some test data:

https://wolfram.com/xid/0cpsy2ycnqxv9u-2s45m5

https://wolfram.com/xid/0cpsy2ycnqxv9u-iowbak


https://wolfram.com/xid/0cpsy2ycnqxv9u-xih28g

AttentionLayer[net,"ScoreRescaling""DimensionSqrt"] computes:

https://wolfram.com/xid/0cpsy2ycnqxv9u-e0wuwl
Define an AttentionLayer and extract the scoring subnet:

https://wolfram.com/xid/0cpsy2ycnqxv9u-6cpqak


https://wolfram.com/xid/0cpsy2ycnqxv9u-m8cjbk

Evaluate AttentionLayer on some test data:

https://wolfram.com/xid/0cpsy2ycnqxv9u-85sxge

https://wolfram.com/xid/0cpsy2ycnqxv9u-y80vjp


https://wolfram.com/xid/0cpsy2ycnqxv9u-nyaxgw

AttentionLayer[scorer,"Mask""Causal","ScoreRescaling""DimensionSqrt"] computes:

https://wolfram.com/xid/0cpsy2ycnqxv9u-4if406
Define an AttentionLayer and extract the scoring subnet:

https://wolfram.com/xid/0cpsy2ycnqxv9u-jbvp0l


https://wolfram.com/xid/0cpsy2ycnqxv9u-eqyk34

Evaluate AttentionLayer on some test data:

https://wolfram.com/xid/0cpsy2ycnqxv9u-iaief0

https://wolfram.com/xid/0cpsy2ycnqxv9u-4pp0ef


https://wolfram.com/xid/0cpsy2ycnqxv9u-i5f01w

If "Key" and "Value" inputs are the same, AttentionLayer is equivalent to the deprecated SequenceAttentionLayer:

https://wolfram.com/xid/0cpsy2ycnqxv9u-ih6gjm


https://wolfram.com/xid/0cpsy2ycnqxv9u-cgkw08


Possible Issues (1)Common pitfalls and unexpected behavior
When using the setting "Dot" for the scoring net net in AttentionLayer[net], the input key and query vectors cannot be different sizes:

https://wolfram.com/xid/0cpsy2ycnqxv9u-fraijv



https://wolfram.com/xid/0cpsy2ycnqxv9u-meoevd

This restriction does not apply to using a "Bilinear" scoring net:

https://wolfram.com/xid/0cpsy2ycnqxv9u-z64492

Wolfram Research (2019), AttentionLayer, Wolfram Language function, https://reference.wolfram.com/language/ref/AttentionLayer.html (updated 2022).
Text
Wolfram Research (2019), AttentionLayer, Wolfram Language function, https://reference.wolfram.com/language/ref/AttentionLayer.html (updated 2022).
Wolfram Research (2019), AttentionLayer, Wolfram Language function, https://reference.wolfram.com/language/ref/AttentionLayer.html (updated 2022).
CMS
Wolfram Language. 2019. "AttentionLayer." Wolfram Language & System Documentation Center. Wolfram Research. Last Modified 2022. https://reference.wolfram.com/language/ref/AttentionLayer.html.
Wolfram Language. 2019. "AttentionLayer." Wolfram Language & System Documentation Center. Wolfram Research. Last Modified 2022. https://reference.wolfram.com/language/ref/AttentionLayer.html.
APA
Wolfram Language. (2019). AttentionLayer. Wolfram Language & System Documentation Center. Retrieved from https://reference.wolfram.com/language/ref/AttentionLayer.html
Wolfram Language. (2019). AttentionLayer. Wolfram Language & System Documentation Center. Retrieved from https://reference.wolfram.com/language/ref/AttentionLayer.html
BibTeX
@misc{reference.wolfram_2025_attentionlayer, author="Wolfram Research", title="{AttentionLayer}", year="2022", howpublished="\url{https://reference.wolfram.com/language/ref/AttentionLayer.html}", note=[Accessed: 01-April-2025
]}
BibLaTeX
@online{reference.wolfram_2025_attentionlayer, organization={Wolfram Research}, title={AttentionLayer}, year={2022}, url={https://reference.wolfram.com/language/ref/AttentionLayer.html}, note=[Accessed: 01-April-2025
]}