Attention Is All You Need

"Attention Is All You Need"^[1] is a 2017 landmark^[2]^[3] research paper in machine learning authored by eight scientists working at Google. The paper introduced a new deep learning architecture known as the transformer, based on the attention mechanism proposed in 2014 by Bahdanau et al. It is considered a foundational^[4] paper in modern artificial intelligence, as the transformer approach has become the main architecture of large language models like those based on GPT.^[5]^[6] At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but the authors go further in the paper, foreseeing the technique's potential for other tasks like question answering and what is now known as multimodal Generative AI.^[1]

Thumb — An illustration of main components of the transformer model from the paper

The paper's title is a reference to the song "All You Need Is Love" by the Beatles.^[7] The name "Transformer" was picked because Uszkoreit liked the sound of that word.^[8]

An early design document was titled "Transformers: Iterative Self-Attention and Processing for Various Tasks", and included an illustration of six characters from the Transformers animated show. The team was named Team Transformer.^[7]

Some early examples that the team tried their Transformer architecture on included English-to-German translation, generating Wikipedia articles on "The Transformer", and parsing. These convinced the team that the Transformer is a general purpose language model, and not just good for translation.^[8]

As of 2024,^[update] the paper has been cited more than 100,000 times.^[9]

For their 100M-parameter Transformer model, they suggested learning rate should be linearly scaled up from 0 to maximal value for the first part of the training (i.e. 2% of the total number of training steps), and to use dropout, to stabilize training.

The authors of the paper are: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. All eight authors were "equal contributors" to the paper; the listed order was randomized. The Wired article highlights the group's diversity:^[7]

Six of the eight authors were born outside the United States; the other two are children of two green-card-carrying Germans who were temporarily in California and a first-generation American whose family had fled persecution, respectively.

By 2023, all eight authors had left Google and founded their own AI start-ups (except Łukasz Kaiser, who joined OpenAI).^[7]^[9]

Modern Transformers require computation time that is quadratic in the size of the context window. Jürgen Schmidhuber's linearly scaling fast weight controller (1992) learns to compute a weight matrix for further processing depending on the input.^[10] One of its two feedforward networks has "fast weights" or "dynamic links" (1981).^[11]^[12]^[13] A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries.^[10] In the 2020s, this was shown to be equivalent to the unnormalized linear Transformer.^[14]^[15] It scales linearly because of its "linearized attention."

[1]
Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
[2]
Love, Julia (10 July 2023). "AI Researcher Who Helped Write Landmark Paper Is Leaving Google". Bloomberg News. Retrieved 1 April 2024.
[3]
Goldman, Sharon (20 March 2024). "'Attention is All You Need' creators look beyond Transformers for AI at Nvidia GTC: 'The world needs something better'". VentureBeat. Retrieved 1 April 2024.
[4]
Shinde, Gitanjali; Wasatkar, Namrata; Mahalle, Parikshit (6 June 2024). Data-Centric Artificial Intelligence for Multidisciplinary Applications. CRC Press. p. 75. ISBN 9781040031131.
[5]
Toews, Rob (3 September 2023). "Transformers Revolutionized AI. What Will Replace Them?". Forbes. Archived from the original on 26 September 2023. Retrieved 3 December 2023.
[6]
Murgia, Madhumita (23 July 2023). "Transformers: the Google scientists who pioneered an AI revolution". Financial Times. Archived from the original on 28 December 2023. Retrieved 22 March 2024.
[7]
Levy, Steven. "8 Google Employees Invented Modern AI. Here's the Inside Story". Wired. ISSN 1059-1028. Retrieved 20 March 2024.
[8]
Marche, Stephen (23 August 2024). "Was Linguistic A.I. Created by Accident?". The New Yorker. ISSN 0028-792X. Retrieved 24 August 2024.
[9]
"Meet the $4 Billion AI Superstars That Google Lost". Bloomberg. 13 July 2023 – via www.bloomberg.com.
[10]
Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets" (PDF). Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.
[11]
Christoph von der Malsburg: The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry, 1981. http://cogprints.org/1380/1/vdM_correlation.pdf See Reprint in Models of Neural Networks II, chapter 2, pages 95-119. Springer, Berlin, 1994.
[12]
Jerome A. Feldman, "Dynamic connections in neural networks," Biological Cybernetics, vol. 46, no. 1, pp. 27-39, Dec. 1982.
[13]
Hinton, Geoffrey E.; Plaut, David C. (1987). "Using Fast Weights to Deblur Old Memories". Proceedings of the Annual Meeting of the Cognitive Science Society. 9.
[14]
Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, François (2020). "Transformers are RNNs: Fast autoregressive Transformers with linear attention". ICML 2020. PMLR. pp. 5156–5165.
[15]
Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.

Uszkoreit, Jakob (31 August 2017). "Transformer: A Novel Neural Network Architecture for Language Understanding". research.google. Retrieved 9 August 2024. A concurrent blog post on Google Research blog.

This Google-related article is a stub. You can help Wikipedia by expanding it.

This artificial intelligence-related article is a stub. You can help Wikipedia by expanding it.

[:0-1] [1]
Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.

[2] [2]
Love, Julia (10 July 2023). "AI Researcher Who Helped Write Landmark Paper Is Leaving Google". Bloomberg News. Retrieved 1 April 2024.

[3] [3]
Goldman, Sharon (20 March 2024). "'Attention is All You Need' creators look beyond Transformers for AI at Nvidia GTC: 'The world needs something better'". VentureBeat. Retrieved 1 April 2024.

[4] [4]
Shinde, Gitanjali; Wasatkar, Namrata; Mahalle, Parikshit (6 June 2024). Data-Centric Artificial Intelligence for Multidisciplinary Applications. CRC Press. p. 75. ISBN 9781040031131.

[Forbes-5] [5]
Toews, Rob (3 September 2023). "Transformers Revolutionized AI. What Will Replace Them?". Forbes. Archived from the original on 26 September 2023. Retrieved 3 December 2023.

[Financial_Times-6] [6]
Murgia, Madhumita (23 July 2023). "Transformers: the Google scientists who pioneered an AI revolution". Financial Times. Archived from the original on 28 December 2023. Retrieved 22 March 2024.

[wired-7] [7]
Levy, Steven. "8 Google Employees Invented Modern AI. Here's the Inside Story". Wired. ISSN 1059-1028. Retrieved 20 March 2024.

[:1-8] [8]
Marche, Stephen (23 August 2024). "Was Linguistic A.I. Created by Accident?". The New Yorker. ISSN 0028-792X. Retrieved 24 August 2024.

[bloomberg-9] [9]
"Meet the $4 Billion AI Superstars That Google Lost". Bloomberg. 13 July 2023 – via www.bloomberg.com.

[transform19922-10] [10]
Schmidhuber, Jürgen (1992). "Learning to control fast-weight memories: an alternative to recurrent nets" (PDF). Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. S2CID 16683347.

[malsburg1981-11] [11]
Christoph von der Malsburg: The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry, 1981. http://cogprints.org/1380/1/vdM_correlation.pdf See Reprint in Models of Neural Networks II, chapter 2, pages 95-119. Springer, Berlin, 1994.

[feldman1982-12] [12]
Jerome A. Feldman, "Dynamic connections in neural networks," Biological Cybernetics, vol. 46, no. 1, pp. 27-39, Dec. 1982.

[13] [13]
Hinton, Geoffrey E.; Plaut, David C. (1987). "Using Fast Weights to Deblur Old Memories". Proceedings of the Annual Meeting of the Cognitive Science Society. 9.

[fastlinear20202-14] [14]
Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, François (2020). "Transformers are RNNs: Fast autoregressive Transformers with linear attention". ICML 2020. PMLR. pp. 5156–5165.

[schlag20212-15] [15]
Schlag, Imanol; Irie, Kazuki; Schmidhuber, Jürgen (2021). "Linear Transformers Are Secretly Fast Weight Programmers". ICML 2021. Springer. pp. 9355–9366.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Wikiwand in your browser!

Attention Is All You Need

Wikiwand in your browser!

Authors

Previous related work

References

External links