Making GPT API responses faster
GPT APIs are slow. Just in the past week, the OpenAI community has had 20+ questions around that. And not only is it rare for users to tolerate 30-second response times in any app, it is also extremely annoying to develop when even basic tests take several minutes to run.
(Before diving into these tricks, you might want to understand why LLM latency is linear in output token count, and see my measurements of GPT and Llama API response times.)
Most of the below also applies to Llama family models and any other LLM, regardless of how it is deployed.
0. Make fewer calls
This is not what you wanted to hear, but... if you can avoid making the API call at all, that's the biggest win. If building an Agent, try to reduce the number of loop iterations taken to the goal. If building something else... can you get the same done with deterministic code, or an out-of-the-box fast NLP model, e.g. from AWS? Even if your app needs intelligence, perhaps some tasks are beneath GPT. Or perhaps you can make a first attempt with regex, then fall back to an LLM if needed?
1. Output fewer tokens
If there's one thing you remember from this blog post, this is it. Output fewer tokens.
Due to the linearity of response time you add 30-100ms of latency for every token you generate. For example, the paragraph you're reading right now has about 80 tokens, so takes about 2.8 seconds to generate on GPT-3.5 or 7.5 seconds on GPT-4 (when hosted on OpenAI).
Counter-intuitively, that means you can have as many input tokens as you like with relatively little effect. So if latency is crucial, use as much Context as you like but generate as little as possible.
What are some easy ways to reduce output size?
The simplest is to make the model concise with "be brief" or similar in the prompt. This typically gives a roughly 1/3 reduction in output token count. Explicitly giving word counts has also worked for me: Respond in 5-10 words.
If you are generating structured output (e.g. JSON), you can do even more:
- Consider whether you really need all the output. Test removing irrelevant values.
- Remove newlines and other whitespace from the output format if not relevant.
- Replace JSON keys with compressed versions. Instead of
{"comment": "cool!"}
use{"c":"cool"}
. If needed, explain the semantics of the keys in the prompt.
Another option is to avoid Chain-of-thought prompting. This may come with a severe performance penalty because LLMs need Space to think, but that's easy to evaluate offline. To make the hit less bad, you may want to put more examples into the prompt, in more detail.
2. Output only one token
This feels like a gimmick, but it can be really useful: if you output only one token you're effectively doing classification with the LLM. With OpenAI's default cl100k_base
tokenizer you can theoretically do up to 100,000-class classification this way, though prompt limits might get in the way.
3. Switch from GPT-4 to GPT-3.5
Some applications don't need the extra capability that GPT-4 provides, so the ~3x speed-up you get is worth the lower output quality. In your app it might be feasible to send only a subset of all API calls to the best model; the simplest use cases might do equally well with 3.5. Alternatively, perhaps even Llama-2-7B is enough?
Either one, when fine-tuned, can do very well on limited-domain tasks, so it may be worth a shot.
4. Switch to Azure API
For GPT-3.5, Azure is 20% faster for the exact same query. Seems easy?
In theory, Azure GPT APIs are a drop-in replacement. In practice, there are some idiosyncracies:
- Azure usage limits are lower and GPT-4 access takes longer to get (I was recently approved for GPT-4 after waiting several months).
- Azure doesn't support batching in the embedding endpoint (as of May 2023).
- Azure configuration is slightly different. It has the concept of a deployment which OpenAI-hosted models do not have. In addition you need to override some default parameters like the API URL.
Lack of one-to-one compatibility of the APIs is not a problem if you're making HTTP calls directly. But if you rely on a third party library like langchain or a tool like Helicone to make the calls for you, you might need to hack support for Azure configuration. For us, it took an hour or two of debugging to understand which parameters to override where.
5. Parallelize
This is obvious, but if your use case is embarrassingly parallel, then you know what you have to do. You may hit rate limits, but that's a separate worry, and for smaller models (e.g. GPT-3.5) it's not difficult to get rate limits bumped.
6. Stream output and use stop sequences
If you receive the output as a stream and display it to the user right away, the perceived speed of your app is going to improve. But that's not what we're after. We want actual speedup.
There is a way to use stop sequences to get faster responses in edge cases. For example, you could prompt the model to output UNSURE
if it is not going to be able to produce a useful answer. If you also set this string to be a stop sequence, you can get faster answers, guaranteed.
If stop sequences (which are essentially just substring matching on OpenAI's side) are not enough, you could stream the answer and stop the query as soon as you receive a particular sequence by regex match or even some more complicated logic.
7. Use microsoft/guidance
?
I have not tried it myself, but the microsoft/guidance library lets you do "guidance acceleration". With it you define a template and ask the model to fill the blanks. When generating a JSON, you could have it output only the values and none of the JSON syntax.
The impact of this will depend on the ratio of dynamic vs static tokens you are generating. If you are using an LLM to fill in one blank with 3 tokens in a 300-token document, it's gonna be worth it (but you'd probably do better optimizing this ratio in other ways). The implied trade-off here is: should I have the LLM generate the static tokens (which is wasteful), or should I just take the hit of doing another round-trip to the API (wasteful, too).
Overall I don't really recommend this approach in most cases, but I wanted to mention it as an option for some niche situations.
8. Fake it
Without changing anything about the LLM call itself, you can make the app feel faster even when the response time is exactly the same. I don't have much experience on this so will only mention the most basic ideas:
- Stream the output. This is a common and well-established way to make your app feel faster.
- Generate in pieces. Instead of generating a whole email, generate it in pieces. If the prompt is "Write an email to Heiki to cancel our dinner" you might generate up to the first line break ("Hey Heiki,") and then wait for the user to continue. If the user continues, the next line ("I unfortunately cannot make it on Sunday."), etc.
- Pre-generate. If you don't actually need interactive user input, you can just magically pre-generate whatever response the user would get. This can get expensive very fast, but combined with premium pricing and clever heuristics around when to pre-generate, you may be able to make this work.
As I discover new ways to optimize GPT speed I'll share more. If you have interesting tricks I should add, please send them to me!