Things I've underestimated - Sep 2023

After attending the Ray Summit in San Francisco this week, I realized I had previously discounted several interesting things. Here's what I now want to explore more.

Semantic Kernel

I've gotten so used to langchain that I haven't really considered switching... all the while loving to hate it. As I was ranting with someone about the problems of taking langchain to production, he recommended I check out Microsoft's semantic kernel which according to him is a nicer and more reliable developer experience.

Anyscale & Ray

Going to the conference centered around Ray (an open-source framework) and sponsored primarily by Anyscale (the company behind Ray) it is no surprise I came away impressed. However, in addition to general intrigue about Ray, there is a specific Anyscale products which I want to use more: fine-tuning endpoints. These are in preview (meaning there's a waitlist) but I'm sure a general release is coming soon.

Fine-tuning Llama-2 family models (which is what the announcement was about) is possible but takes annoying infra & compute work when done from scratch. Reducing this effort from "ugh" to "cool, just an API call" might make developers fine-tune 10x or 100x more often. Obviously this has to compete with (and actually is a drop-in replacement for!) OpenAI fine-tuning endpoints, including the just-released GPT-3.5-turbo finetuning service and upcoming GPT-4 finetuning service, but Anyscale seems likely to become the leader on simplicity and cost once this fully launches.

Open models in general

(I think we should call them "open-weight models", or just "Open LLMs"? Because the source is often released anyway -- and what matters is the license, and availability of weights?)

Related to the above, Anyscale seems to be reducing the barrier to relying on Llama-2, which is a very welcome development. Fine-tuning is an important upgrade after you've exhausted prompting and Retrieval-augmented generation and still are not at the desired level of quality. Easier fine-tuning means the value/effort calculation is now more favorable for open models, and so on the margin, these will get more use.

One tentative support point for this comes from an experiment Anyscale did, where fine-tuned Llama-2-7B matches vanilla GPT-4's performance on a SQL-generation task. I'm not fully convinced of the generality of this result, but it's promising for LLMs anyway - especially given that the 7B Llama gives faster responses and is >100x cheaper to run inference on, compared to GPT-4.

Another reason to believe is that "model routing" -- the idea of routing an LLM call to the cheapest/fastest feasible LLM at runtime -- creates a clear niche for open models. Whatever your domain, it's plausible that a large chunk (say 80%) of requests are simple enough for a Llama to handle, and you can delegate the rest to a high-quality, slow, expensive model like GPT-4. This assumes you can route with decent accuracy at runtime, but this classification problem seems feasible.

Agents

In our experiments, the Achilles heel of Agents is that current LLMs (includingGPT-4) are not good enough at selecting the correct Tool at every iteration. I've heard the same from others.

Gorilla is a Llama-family model fine-tuned for making API calls, enhanced with Retrieval-augmented generation. It makes decent progress towards the exact problem that's needed to make agents reliable, so I'm now more bullish on agents.

LlamaIndex

This is only barely related to the conference but it seems that LlamaIndex has narrowed in on what IMO is an underinvested problem in LLM apps: Retrieval. With perfect retrieval the LLM's job is very easy, and with bad retrieval it is often impossible to do the job we expect of the LLM. I last used LlamaIndex more than 6 months ago (which is an eternity in this space right now) and want to understand where the library has gone since.