Transformers

Under the the hood, seq2seq model is composed of encoder and decoder.

The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.

Let the dive to a rabbit hole begin!

http://jalammar.github.io/illustrated-transformer/

points to

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

which leads to

https://www.youtube.com/watch?v=UNmqTiOnRfg and https://www.youtube.com/watch?v=BR9h47Jtqyw that lead to https://www.youtube.com/watch?v=WCUNPb-5EYI

Study notes

https://www.youtube.com/watch?v=WCUNPb-5EYI

Back to the study mateerial

The context vector is basically the number of hidden units in the encoder RNN.

The last hidden state of the enocder is the context we pass along to the decoder! In the models without attention!!

. An attention model differs from a classic sequence-to-sequence model in two main ways:

First, the encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder:

The attention decoder RNN takes in the embedding of the token, and an initial decoder hidden state.
The RNN processes its inputs, producing an output and a new hidden state vector (h4). The output is discarded.
Attention Step: We use the encoder hidden states and the h4 vector to calculate a context vector (C4) for this time step.
We concatenate h4 and C4 into one vector.
We pass this vector through a feedforward neural network (one trained jointly with the model).
The output of the feedforward neural networks indicates the output word of this time step.
Repeat for the next time steps

Illustrated transformer

http://jalammar.github.io/illustrated-transformer/

http://nlp.seas.harvard.edu/2018/04/03/attention.html