Are LLMs deterministic?
No. You can see for yourself: setting the temperature
variable to 0 (meaning you always sample the most likely token from the output distribution), you'd expect GPT-3.5 and 4 to produce the same output every time you call them. However, they don't.
Why do LLMs produce different outputs across different runs? I've heard two plausible reasons -- the details of which I barely grasp but believe to be true.
First, race conditions in GPU FLOPs. When making parallel calculations in the GPU, the order of arithmetic operations can differ depending on which branch finishes first. Order of operations does not matter for arbitrary-precision arithmetic, but it does for the finite (32, 16, or whatever-bit) precision floating point operations. So in theory, the same forward pass can produce a very slightly different output distribution. If the top two tokens' likelihoods are very close, small differences in the log-prob value can make the LLM sometimes output one and sometimes the other. Since each token depends on the previous tokens, the remainder of the output can then differ. Note that this can happen, in theory, with any model if inference is done on the GPU.
Second, sparse mixture-of-experts (MoE) models' capacity constraints across a batch. This blog post and its follow-up explain this in detail, but my hand-wavy understanding is this. The feedforward pass takes a potentially different path through the compute graph every time, based on which experts get called. For practical reasons, the amount of tokens routed to an expert is limited, and the limit applies across a batch. So if you send the same prompt twice but the remaining queries in that batch are different (which they certainly are for most public APIs and inference servers), you may get different results.
This second reason applies to GPT-3.5 and GPT-4 and Mixtral, but I think not the Llama-2 family as that is not a MoE model. The explanation above implies that non-batching would solve this -- at the cost of much lower throughput, i.e. higher cost per LLM query, which I think is not a trade-off 99.99% of users would want to make.
I am not sure what the relative importance of each reason is. If I had to guess, I'd say the second one dominates because this has not historically been an issue with most GPU-accelerated models.
Given this, how do you get deterministic LLM results?
1. Definitely set temperature=0
, or if you want non-greedy sampling, set the random seed (OpenAI has introduced an API parameter for this.)
2. and if you believe in the GPU race condition theory, then run the LLM on CPU
3. and if your model is a mixture of experts, don't batch requests (or always run every query as part of the exact same batch).