Experiment With Different LLM Models with an AI Gateway

Creating a high-performing LLM application requires finding the right model for your specific use case. The best way to find the right model is through experimenting with multiple LLM models. This is where AI Gateway comes in, providing a universal API gateway to communicate with any LLM model. By combining AI Gateway with the right application architecture, you can establish an iterative process to systematically improve your LLM application's performance.

Challenges with using multiple LLM models

There are hundreds of LLM models each with unique API signatures. When building an application layer that can use multiple LLM models, some of the challenges developers run into includes:

Defining an uniform API signature: There needs to be a consistent way to communicate with different models
Load Balancing: Most models have rate-limiting constraints and creative solutions are needed to overcome them
Caching: LLM calls are expensive and slow. Caching helps you solve these issues

AI Gateway acts as the central service for you to tackle all the challenges involved in being able to communicate with different LLM models

How to use an AI Gateway

One popular open-source AI Gateway is Portkey. After you have an AI Gateway up and running, you can easily communicate with different LLM models in your code with the following syntax:

// OpenAI
const response = await portkey.chat.completions.create({
  messages: [{ role: 'user', content: 'Hello' }],
  model: 'gpt-4',
});

// Anthropic
const response = await portkey.chat.completions.create({
  messages: [{ role: 'user', content: 'Hello' }],
  model: 'claude-3-opus-20240229',
});

The benefit here is that you have an uniform way to communicate with any models, and Portkey handles a lot of heavy-lifting when it comes to connecting to each models, load balancing, caching, and more.

Swap models with feature-flags

You are often not only testing different LLM models, but also different prompts, RAG datasets, or completely different architecture. A common pattern for swapping these different component in your application is feature-flagging. If you are building your application with Palico, you can create a feature-flag for your model using the appConfig property:

interface AppConfig {
  model: string;
}

const handler: Chat<unknown, AppConfig> = async ({
  userMessage,
  appConfig,
}): Promise<ChatHandlerResponse> => {
  const response = await portkey.chat.completions.create({
	  messages: [{ role: 'user', content: 'Hello' }],
	  // FEAUTRE-FLAG-EXAMPLE
	  model: appConfig.model,
	});
	
  return {
    message: response.choices[0].message.content,
  };
};

You can then call your application with different models as such:

await palico.agent.chat({
  name: "math_agent",
  userMessage: "What's 2+2?",
  appConfig: {
    model: "gpt-4o-mini",
  },
});

Feature-flagging helps you create a modular application layer such that you can easily experiment with different variations of your application.

Evaluating performance of different models

How can you know that switching from one model to another had a positive impact on your application performance? This requires a two pronged approached: qualitative evaluation, and quantitive evaluation.

Qualitative evaluation involves having a someone manually checking the responses from your application across a list of test-cases, and using their best judgement to determine if the performance of the application has improved.

Quantitive evaluation involves creating a list of test-cases that models the expected behavior of your application. For example, in Palico you can create a test-case using the following syntax:

{
  input: {
    userMessage: "What is the capital of France?",
  },
  tags: {
    intent: "historical_accuracy",
  },
  metrics: [
    containsEvalMetric({
      substring: "Paris",
    }),
    levensteinEvalMetric({
      expected: "Paris",
    }),
    rougeSkipBigramSimilarityEvalMetric({
      expected: "Paris",
    }),
  ],
}

You can create these test-cases using any evaluation frameworks. Once you have a large number of test-cases defined, you can aggregate the results to get a more accurate picture of what’s working and what’s not.

Combining qualitative and quantitive evaluation will help you iteratively improve the performance of your application over-time.

Next Steps

Improving the performance of your LLM Application is all about being able to experiment with lots of different combinations of components in your application layer. Combining AI Gateway with feature-flagging and evaluation helps you significantly expand the experiment-ability of your application. If you are looking for a framework that’s maximize experiment-ability, checkout Palico.