# AttentionLayer

represents a trainable net layer that learns to pay attention to certain portions of its input.

AttentionLayer[net]

specifies a particular net to gives scores for portions of the input.

AttentionLayer[net,opts]

includes options for weight normalization, masking and other parameters.

# Details and Options    • AttentionLayer[net] takes a set of key vectors, a set of value vectors and one or more query vectors and computes for each query vector the weighted sum of the value vectors using softmax-normalized scores from net[<|"Input"->key,"Query"->query|>].
• In its most general form, AttentionLayer[net] takes a key array K of dimensions d1××dn×k, a value array V of dimensions d1××dn×v, and a query array Q of dimensions q1××qm×q. The key and value arrays can be seen as arrays of size d1××dn whose elements are vectors. For K, these vectors are denoted k and have size k, and for V these vectors are denoted v and have size v, respectively. Similarly, the query array can be seen as an array of size q1××qm whose elements are vectors denoted q of size q. Note that the query array can be seen to consist of a single query vector of size q, i.e. if m is 0. Then, the scoring net f is used to compute a scalar score s=f(k,q) for each combination of the d1××dn key vectors k and q1××qm query vectors q. These scalar scores are used to produce an output array O of size q1××qm containing weighted sums o= wivi, where the weights are w=softmax(S), and S is the array of d1××dn scalar scores produced for a given query vector.
• A common application of AttentionLayer[net] is when the keys are a matrix of size n×k, the values are a matrix of size n×v and the query is a single vector of size q. Then AttentionLayer will compute a single output vector o that is the weighted sum of the n-value row vectors: o= , where zi=f(ki,q). In the Wolfram Language, this can be written Total[values*weights], where weights is SoftmaxLayer[][Map[net[<|"Input"#,"Query"->query|>]&,keys]].
• In AttentionLayer[net], the scoring network net can be one of:
•  "Dot" a NetGraph computing s=Dot[k,q] "Bilinear" a NetGraph computing s=Dot[k,W,q], where W is a learnable matrix NetGraph[…] a specific NetGraph that takes "Input" and "Query" vectors and produces a scalar "Output" value
• NetExtract[,"ScoringNet"] can be used to extract net from an AttentionLayer[net] object.
• The following optional parameters can be included:
•  "Mask" None prevent certain patterns of attention "ScoreRescaling" None method to scale the scores
• With the setting "Mask"->"Causal", the query input is constrained to be a sequence of vectors with the same length as the key and the value inputs, and only positions t'<t of the key and the value inputs are used to compute the output at position t.
• With the setting "ScoreRescaling"->"LengthSqrt", the scores are divided by the square root of the input size before being normalizing by the softmax: .
• AttentionLayer is typically used inside NetGraph.
• AttentionLayer exposes the following input ports for use in NetGraph etc.:
•  "Key" an array of size d1×…×dn×k "Value" an array of size d1×…×dn×v "Query" an array of size q1×…×qm×q
• AttentionLayer exposes an output port for use in NetGraph etc.:
•  "Output" an array of outputs with dimensions q1×…×qm×v
• AttentionLayer[][<|"Key"key,"Value"value,"Query"query|>] explicitly computes the output from applying the layer.
• AttentionLayer[][<|"Key"{key1,key2,},"Value"{value1,value2,},"Query"{query1,query2,}|>] explicitly computes outputs for each of the keyi, valuei and queryi.
• When given a NumericArray as input, the output will be a NumericArray.
• The sizes of the key, value and query arrays are usually inferred automatically within a NetGraph.
• AttentionLayer[,"Key"shape1,"Value"->shape2,"Query"->shape3] allows the shapes of the inputs to be specified. Possible forms for shapei include:
•  NetEncoder[…] encoder producing a sequence of arrays {d1,d2,…} an array of dimensions d1×d2×… {"Varying",d1,d2,…} an array whose first dimension is variable and remaining dimensions are d2×d3×… {Automatic,…} an array whose dimensions are to be inferred {"Varying",Automatic,…} a varying number of arrays each of inferred size

# Examples

open all close all

## Basic Examples(2)

Create an AttentionLayer:

 In:= Out= Create a randomly initialized AttentionLayer that takes a sequence of two-dimensional keys, three-dimensional values and a sequence of one-dimensional queries:

 In:= Out= Apply the layer to an input:

 In:= Out//MatrixForm= The layer threads across a batch of sequences of different lengths:

 In:= Out= ## Possible Issues(1)

Introduced in 2019
(12.0)