Mathematical Framework of Transformers;

Notes: No MLP layers No bias No layer norm

”attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in

Residual stream: The residual stream is simply the sum of the output of all the previous layers and the original embedding.

transformer

Only linear operations are done to the Residual stream Virtual weights