Definition of LLM

Gen AI

What is exactly an LLM ? Which models fit into this term ? Is a generative AI model an LLM ?

Author

Rafa Sanchez

Published

January 22, 2023

Language Models

A Language Model (LM) is a model that try to predict the next word given the prior words in the sequence.

The term Large refers to two concepts: 1. Number of parameters (large amount of parameters) 2. Amount of data (large amount of data)

While the term Large may refer only to a size (parameters/data), a threshold is not clearly defined for a model to be considered as LLM. We propose the following two criteria: 1. An LM model is an LLM if it needs model paralellism during fine-tuning. 2. An LM model is an LLM if it is trained with more than 1 trillion tokens.

Both criteria must be met for a model to be considered as LLM.

Definiton of LLM:

An LLM is a type of deep learning algorithm that uses the transformer architecture to effectively perform language tasks like summarizion, translation or text generation based on knowledge gained from massive datasets.

Are bigger LLMs always better?

Bigger and bigger models try to work better and better. If you can just keep making models bigger and they keep getting better, then AI is just an engineering problem: just figure out how to properly use your hardware acceleators and scale it.

But is this always the case ? What is the limit ?

In 2020, OpenAI wrote a paper that showed that model performance depends mainly on model size, not model shape, but the most important thing was that the paper showed that model loss scaled as a power law with model size. OpenAI concluded that model size was essentially the most important thing. Everybody accepted that and tried to make make bigger and bigger models.

In 2022, however, Deep Mind (Alphabet subsidiary) showed that OpenAI was wrong. They published a paper with aa new model/paper called Chinchilla (blog post), where they showed that while making bigger models was better, the impact of model size was strongly influenced by the impact of data size. Chinchilla showed how to pick the optimal number of parameters in a model given a fixed amount of compute and data. So, it demonstrated that making bigger and bigger models was suboptimal, since Chinchilla achieved better performance than GPT-3 being much smaller (70B parameters vs 175B parameters). Chinchilla too confirmed the power law relationship in model performance as a function of model and data size.

So, summarizing:

All the previous superbig models were suboptimal and had too many parameters, given their compute and data size. Smaller models could do better than bigger models in some cases. It’s true that model performance does increase with more parameters/data/compute. But, you get diminishing returns to scale. LLMs exhibit emergent abilities (new skills) as they get larger (slide deck).
Data size is likely to dominate model performance. Thus, a very relevant question is: How much data is there (at Google, in the world), and is there any way of generating more? For example, OpenAI’s Whisper is a highly robust speech-to-text model (likely) aimed at extracting more text tokens out of video data. Extracting text from video is one way that we could further scale the amount of data available to LLMs.

References

[1] Wikipedia’s article, Nothing-up-my-sleeve number, has a good discussion of manipulated seeds in particular. Especially the section on Limitations: “Bernstein and coauthors demonstrate that use of nothing-up-my-sleeve numbers as the starting point in a complex procedure for generating cryptographic objects, such as elliptic curves, may not be sufficient to prevent insertion of back doors. If there are enough adjustable elements in the object selection procedure, the universe of possible design choices and of apparently simple constants can be large enough so that a search of the possibilities allows construction of an object with desired backdoor properties.”
[2] Bernstein, Daniel J., Tung Chou, Chitchanok Chuengsatiansup, Andreas Hülsing, Eran Lambooij, Tanja Lange, Ruben Niederhagen, and Christine van Vredendaal. 2015. “How to Manipulate Curve Standards: A White Paper for the Black Hat http://bada55.cr.yp.to.” In Security Standardisation Research, 109–39. Springer International Publishing. https://doi.org/10.1007/978-3-319-27152-1_6.
[3] Goldwasser, Shafi, Michael P. Kim, Vinod Vaikuntanathan, and Or Zamir. 2022. “Planting Undetectable Backdoors in Machine Learning Models.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2204.06974.