Is "attention" really all we need?

With their milestone publication "Attention Is All You Need", Vaswani et al. introduced the Transformer architecture in 2017, heralding the era of large language models(LLM). Although transformers are still the core component of all LLMs today, some central concepts have been added that make a modern LLM what we know it to be today.

Instruction Tuning and RLHF

An important step on the way to today's chatbots such as ChatGPT was to not only train models on huge amounts of text, but also to teach them to follow human instructions. Instruction tuning plays a role here: the model is given many examples of tasks in the form of instructions ("Translate this sentence into French", "Write a short summary") and learns to adopt the desired structure.

Building on this came reinforcement learning from human feedback (RLHF). People rate different model responses according to which is more helpful or friendly. This feedback is used to adapt the model even better to our expectations. The result: answers that are not only correct, but also helpful, respectful and comprehensible (Ouyang et al., 2022).

FlashAttention

The larger the language models became, the greater the computational effort required. Especially the calculation of the so-called attention layer. With FlashAttention, a clever technical solution was introduced in 2022: no complete matrix multiplication is performed. The clever use of "tiling" and "streaming" makes optimum use of the memory bandwidth and reduces memory complexity (Dao et al., 2022).

This sounds very technical - but in practice it means that models can be trained more quickly and operated more cheaply. This makes it realistic to use LLMs with longer contexts (e.g. entire books or long conversations) on standard hardware.

RoPE

In order for a language model to not only recognize words but also understand their order, it needs a "feeling" for positions in the text. Originally, simple patterns such as sine and cosine waves were used for this purpose. Later, Rotary Position Embedding (RoPE) was added: a more elegant method that represents position information as rotations in mathematical space and thus naturally encodes relative distances (Su et al., 2021).

The result: the model can handle long texts better and understands that a word plays a different role at the beginning of a sentence than at the end. Many modern LLMs - including GPT-4 or LLaMA - use RoPE to significantly extend their memory span.

Synthetic data generation

Training data is an often overlooked but hugely important component. At some point, you reach the limit of what the internet has to offer - or there is a lack of examples for specific tasks. This is where synthetic data generation comes into play: existing models are used to generate new, artificial training examples.

One example: researchers have a strong model invent many question-answer pairs and use them to train a smaller model. This creates powerful systems without the need for millions of expensive human annotations. This technique also makes it possible to extend models to rare languages or niche areas. For the open source community, synthetic data generation has been crucial in developing competitive models such as Alpaca or Mistral (Wang et al., 2022; Taori et al., 2023).

There are indications that in future LLMs will no longer be used for their internal knowledge, but primarily for the orchestration of agent systems. Specific skills such as the use of tools, API calls and the like will be central to this. Synthetic data is ideally suited for training such specific interactions.

Conclusion

Even though the title of Vaswani et al. was provocative: "Attention is all you need" (2017), it is now clear that although we continue to build on the Transformer architecture, many other pieces of the puzzle are needed to make modern language models work - we have presented four key ones today.

Attention is still the centerpiece, but not everything. The combination of architecture, clever training methods and ever-improving data sets has turned language models from mere text prediction engines into versatile, useful and surprisingly competent partners

References

  1. Vaswani, A. et al. (2017): Attention Is All You Need. NeurIPS.

  2. Ouyang, L. et al. (2022): Training language models to follow instructions with human feedback. arXiv:2203.02155.

  3. Dao, T. et al. (2022): FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS.

  4. Su, J. et al (2021): RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.

  5. Wang, Y. et al. (2022): Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560.

  6. Taori, R. et al. (2023): Alpaca: A Strong, Replicable Instruction-Following Model. Stanford CRFM Blog.

Next
Next

AI agents: Please no more proof of concepts but real process automation, this is how it works