represents a trainable net layer that learns to pay attention to certain portions of its input.


specifies a particular net to gives scores for portions of the input.


includes options for weight normalization, masking and other parameters.

Details and Options

  • AttentionLayer[net] takes a set of key vectors, a set of value vectors and one or more query vectors and computes for each query vector the weighted sum of the value vectors using softmax-normalized scores from net[<|"Input"->key,"Query"->query|>].
  • In its most general form, AttentionLayer[net] takes a key array K of dimensions d1××dn×k, a value array V of dimensions d1××dn×v, and a query array Q of dimensions q1××qm×q. The key and value arrays can be seen as arrays of size d1××dn whose elements are vectors. For K, these vectors are denoted k and have size k, and for V these vectors are denoted v and have size v, respectively. Similarly, the query array can be seen as an array of size q1××qm whose elements are vectors denoted q of size q. Note that the query array can be seen to consist of a single query vector of size q, i.e. if m is 0. Then, the scoring net f is used to compute a scalar score s=f(k,q) for each combination of the d1××dn key vectors k and q1××qm query vectors q. These scalar scores are used to produce an output array O of size q1××qm containing weighted sums o=wivi, where the weights are w=softmax(S), and S is the array of d1××dn scalar scores produced for a given query vector.
  • A common application of AttentionLayer[net] is when the keys are a matrix of size n×k, the values are a matrix of size n×v and the query is a single vector of size q. Then AttentionLayer will compute a single output vector o that is the weighted sum of the n-value row vectors: o=, where zi=f(ki,q). In the Wolfram Language, this can be written Total[values*weights], where weights is SoftmaxLayer[][Map[net[<|"Input"#,"Query"->query|>]&,keys]].
  • In AttentionLayer[net], the scoring network net can be one of:
  • "Dot"a NetGraph computing s=Dot[k,q]
    "Bilinear"a NetGraph computing s=Dot[k,W,q], where W is a learnable matrix
    NetGraph[]a specific NetGraph that takes "Input" and "Query" vectors and produces a scalar "Output" value
  • NetExtract[,"ScoringNet"] can be used to extract net from an AttentionLayer[net] object.
  • The following optional parameters can be included:
  • "Mask"Noneprevent certain patterns of attention
    "ScoreRescaling"Nonemethod to scale the scores
  • With the setting "Mask"->"Causal", the query input is constrained to be a sequence of vectors with the same length as the key and the value inputs, and only positions t'<t of the key and the value inputs are used to compute the output at position t.
  • With the setting "ScoreRescaling"->"LengthSqrt", the scores are divided by the square root of the input size before being normalizing by the softmax: .
  • AttentionLayer is typically used inside NetGraph.
  • AttentionLayer exposes the following input ports for use in NetGraph etc.:
  • "Key"an array of size d1××dn×k
    "Value"an array of size d1××dn×v
    "Query"an array of size q1××qm×q
  • AttentionLayer exposes an output port for use in NetGraph etc.:
  • "Output"an array of outputs with dimensions q1××qm×v
  • AttentionLayer[][<|"Key"key,"Value"value,"Query"query|>] explicitly computes the output from applying the layer.
  • AttentionLayer[][<|"Key"{key1,key2,},"Value"{value1,value2,},"Query"{query1,query2,}|>] explicitly computes outputs for each of the keyi, valuei and queryi.
  • When given a NumericArray as input, the output will be a NumericArray.
  • The sizes of the key, value and query arrays are usually inferred automatically within a NetGraph.
  • AttentionLayer[,"Key"shape1,"Value"->shape2,"Query"->shape3] allows the shapes of the inputs to be specified. Possible forms for shapei include:
  • NetEncoder[]encoder producing a sequence of arrays
    {d1,d2,}an array of dimensions d1×d2×
    {"Varying",d1,d2,}an array whose first dimension is variable and remaining dimensions are d2×d3×
    {Automatic,}an array whose dimensions are to be inferred
    {"Varying",Automatic,}a varying number of arrays each of inferred size


open all close all

Basic Examples  (2)

Create an AttentionLayer:

Click for copyable input

Create a randomly initialized AttentionLayer that takes a sequence of two-dimensional keys, three-dimensional values and a sequence of one-dimensional queries:

Click for copyable input

Apply the layer to an input:

Click for copyable input

The layer threads across a batch of sequences of different lengths:

Click for copyable input

Scope  (2)

Options  (2)

Applications  (1)

Properties & Relations  (4)

Possible Issues  (1)

Introduced in 2019