About stochasticity in LLMs

Gen AI

Autoregression, Generation and Autoencoding differences, as well as determinism vs stochasticity in LLMs

Author

Rafa Sanchez

Published

February 22, 2023

Seq2Seq models

A sequence-to-sequence model (or Seq2Seq) is capable of getting a sequence (for example, text) as input, converting into a high-dimensional representation, and output another sequence of another kind. There are many applications of performing sequence-to-sequence learning.

Sequence to sequence learning has been successful in many tasks such as machine translation, speech recognition (…) and text summarization (…) amongst others.

Gehring et al. (2017)

Autoregressive models

An autoregressive model is a model that produces a prediction based on its own previous predictions. This is typically what a Language models does, i.e., predict the next word (token) based on the previously predicted tokens.

In a transformer architecture, autoregressive is typically includer in the decoder, that predicts tokens autoregressively.

So, models like T5 (and MUM which uses a similar architecture), GPT3 and LaMDA are all auto-regressive.

Generative models

Generative models (GAN-based) are usually not autoregressive since they generate the image in a single forward-pass and not iteratively/in steps. So, GAN-based models are usually not paired with autoregressive generation, but they could be. Parti for example is an autoregressive image generation model.

Autoencoding models

Autoregressive models are very good when you need to generate text. Howeverm there is a task, which is Natural Language Understanding, which does not benefit either from autoregressive models or from Seq2Seq models.

Seq2Seq models are required to understand language, they use this understanding to perform a different task (usually, translation). Natural Language Generation tasks and hence autoregressive models do not necessarily require to understand language if generation can be performed successfully.

Autoencoding models can help here.

The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”. Along with the reduction side, a reconstructing side is learnt, where the autoencoder tries to generate from the reduced encoding a representation as close as possible to its original input, hence its name.

Autoregressive vs autoencoding depends on the task and training, not on the architecture

Whether a model is Seq2Seq, autoregressive or autoencoding does not depend on the architecture. It depends on the task and the type of training.

If autoencoder models learn an encoding, why can autoregressive models then be used for fine-tuning as well?

The answer is simple. The decoder segment of the original Transformer, traditionally being used for autoregressive tasks, can also be used for autoencoding (but it may not be the smartest thing to do, given the masked nature of the segment). The same is true for the encoder segment and autoregressive tasks. Then what makes a model belong to a particular type?

Determinism vs stochasticity of LLM models

A typical ARIMA model is auto-regressive, but once the training is done, its outputs for a given input will always be the same. LLM models like Bard and ChatGPT need to be deterministic for regulated industries (like finance), how we can achieve that ? How is the stochasticity in responses that Bard and ChatGPT generate come from ?

In the case of auto-regressive models, some ‘decoding’ algorithms are deterministic wrt the model, some are ‘sampled’/random. Even with randomly sampled responses, you can make it deterministic by fixing the random seed. However, the deterministic decoders that are part of LLMs are complex enough that there’s ample opportunity to manipulate them too. (Reference below.)

Deep learning models can be forced to be deterministic. This means that if an op is run multiple times with the same inputs on the same hardware, it will have the exact same outputs each time. This is useful for debugging models. Note that determinism in general comes at the expense of lower performance and so your model may run slower when op determinism is enabled.

References

[1] Wikipedia’s article, Nothing-up-my-sleeve number, has a good discussion of manipulated seeds in particular. Especially the section on Limitations: “Bernstein and coauthors demonstrate that use of nothing-up-my-sleeve numbers as the starting point in a complex procedure for generating cryptographic objects, such as elliptic curves, may not be sufficient to prevent insertion of back doors. If there are enough adjustable elements in the object selection procedure, the universe of possible design choices and of apparently simple constants can be large enough so that a search of the possibilities allows construction of an object with desired backdoor properties.”
[2] Bernstein, Daniel J., Tung Chou, Chitchanok Chuengsatiansup, Andreas Hülsing, Eran Lambooij, Tanja Lange, Ruben Niederhagen, and Christine van Vredendaal. 2015. “How to Manipulate Curve Standards: A White Paper for the Black Hat http://bada55.cr.yp.to.” In Security Standardisation Research, 109–39. Springer International Publishing. https://doi.org/10.1007/978-3-319-27152-1_6
[3] Goldwasser, Shafi, Michael P. Kim, Vinod Vaikuntanathan, and Or Zamir. 2022. “Planting Undetectable Backdoors in Machine Learning Models.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/2204.06974