Conversational chatbot

An early stage startup building a chatbot came to us looking to increase margins. We were able to reduce costs by 73.4% by routing to different models while still performing on par to their existing state of the art LLM.

Partner DETAILS

One of the top 5 most used AI chat company, serving billions of tokens.

‍

GoalS

→ Improve quality while maintaining unit economics of 70% margin
‍
→ Maximize number of conversions from free plan to paid plans^‍

RESULTS

→ 85% cost reduction compared to GPT-4turbo
‍
→ 60% quality improvement compared to GPT-3.5

DETAILS

Martian collaborated with one of the 5 most used AI chatbot companies to route each query to the best LLM. They were previously using GPT-3.5 Turbo for all requests and wanted to outperform other leading chatbots by providing a quality higher than any other chat application on the market while staying within their target 70% margin. GPT-4 turbo, OpenAI’s flagship model, is 20x as expensive as GPT-3.5-turbo, but is currently the single best chat model. With routing, we get 60% of the way to 4-turbo’s quality improvement, at only 15% of the cost of GPT-4 turbo. This made their free plan chatbot the top performing free AI chatbot on the market and kept them within their unit economics.

Challenges and Goals

The AI Chatbot company faced signiﬁcant challenges in maintaining high-quality responses while managing costs in its free offering. The goals were to:

Optimize Performance: Improve the quality of chatbot responses to enhance customer satisfaction.
Cost Reduction: Reduce operational costs for delivering AI-driven responses.
Increase Conversions: Drive higher conversion rates from free to paid plans by delivering superior value.
Subscription Model Optimization: Develop differentiated offerings to ensure proﬁtability across various customer tiers.

‍

Implementation

Routing System Integration

Martian’s routing technology was integrated to dynamically select the most cost-efective AI model that could deliver high-quality responses. This system allowed the chatbot to operate with enhanced eficiency, balancing quality and cost effectively.

Measuring Quality and Performance

Martian established a rigorous process to measure the performance of chatbot responses using a judge-model that simulates human preferences. This approach enabled precise quality assessments, facilitating the optimization of responses delivered to users.

How we measure quality

Most benchmarks fail to actually measure the quality of model outputs. They often rely on simple metrics or on the fact that data is constructed synthetically, in a way that is much simpler than real world user data. So, how can we confidently measure the performance of different models in the customer’s use case?

‍

Measuring Quality For General Queries

The ultimate arbiter of whether an output is good or bad is the user. Ideally, we could get our users to rank all the outputs from all the models in order to determine what users care about. However, this is prohibitively expensive.

Instead, we want a way to reliably tell us when an LLM’s output is higher quality that can stand-in for a user.

In order to simulate user preferences, we created a judge-model that simulates the preferences of human annotators. As long as the judge-model gives the same preferences as human-annotators, this lets us annotate quality at scale. To validate that the judge-model has preferences that match human-annotators, we manually annotated a sample of prompts, then computed the correlation between our rankings of model output and the rankings from the judge-model. We observed a strong correlation (ρ=0.8), giving us conﬁdence that the model-judgements are accurate.

Establishing A Baseline

With a judge-model, we can compare the performance of any two models on a given prompt. This lets us compare gpt-4-turbo (the current best-in-class model) with gpt-3.5-turbo (the model currently used by the customer).

This comparison provides a baseline, which can help us understand how well the routing system performs. The judge-model tells us that output from one model is better than another, but the way in which it does scoring may not be easily interpretable – how much better or worse is a score that is 5% higher for example. Knowing that substantial difference between 4-turbo and 3.5-turbo, we can look at that diference and measure all other diferences as a percentage of that.

Results and Outcomes

Measuring Routing Results On Objective Topics

Many of the customer’s users are looking for help with educational material. As a result, their questions tend to have objective answers. We can conﬁrm that the router works well by examining our performance on those questions.

In order to do so, we separated out math and physics questions from all the user requests (which made up about 7% of the total request volume). We then identiﬁed an answer to each question by using consensus voting among multiple generations from three large models: gpt-4-turbo, gpt-4, and

claude-3-opus. If the majority of the generations across all the models were equivalent, as judged by a judge-llm (ρ=0.99), then we took that to be the correct answer to the question.

This improvement can also be illustrated clearly through a series of examples. See examples 3 & 4 in the “Examples” section at the end of this document.

Overall results

Quality Improvements: The routing system achieved 60% of the quality improvement of responses from GPT-3.5-turbo.
Cost Eiciency: Achieved an order of magnitude cost reduction compared to previous models by utilizing Martian's routing capabilities.
Conversion Optimization: By improving response quality during critical interactions, user satisfaction increased, leading to a higher conversion rate from free to paid plans.

‍

Business Impact

Driving Usage through Enhanced Engagement: By strategically deploying high-quality responses at moments of high user engagement, the Martian router efectively increased overall platform usage.
Cost Management: The use of a cost-effective routing solution allowed the company to maintain a competitive free ofering while reducing operational costs.
Increasing Conversions: By strategically offering high-quality responses at key interaction points, the company successfully encouraged users to switch to paid plans.

Conclusion

The integration of Martian's routing technology dramatically improved the AI Chatbot company's ability to deliver high-quality, cost-efective services. This case study demonstrates the power of advanced AI routing systems in enhancing user satisfaction, optimizing costs, and ultimately driving business growth through increased conversion rates. Companies looking to leverage similar technology can see substantial beneﬁts in customer engagement and proﬁtability.

‍