Approximating Human Preferences Using a Multi-Judge Learned System

This month we are highlighting research projects from the recent Apart x Martian Mechanistic Router Interpretability Hackathon. Last week we featured Guardian Loop, a pre-inference safety filter.

Today’s project is Approximating Human Preferences Using a Multi-Judge Learned System.

The Project in a nutshell

Can a single human-generated quality score be predicted from a set of LLM-generated single-dimension quality scores?

The gold standard for subjective quality assessment of LLM output is human labelling. Unfortunately, this doesn’t scale well. This is why LLM-as-judge has become a popular approach to quality assessment. Yet a gap remains; LLM judges are not perfect replacements for human judges.

There are two main challenges to bridging the gap between human and LLM judges.:

  • Making a dataset that model the diversity of preferences that would be observed in human-generated preference data.
  • Learning an interpretable reward signal for the preference data. That is, we want to be able to know how much did quality safety, factual correctness, or any other criteria weighted in the answer.

The project team we are highlighting today — Jose Faustino, Eitan Sprejer, Fernando Avalos, and Augusto Bernardi — tries to make progress in these two areas with their hackathon project. To address the first, they simulate human feedback with LLM judges randomly selected for each judgement from a pool of personas. To address the second, they aggregate multiple LLM judges focused on different aspects of quality and train a model to predict human judge scores based on the aggregated LLM scores.

What are human judges doing?

A human judge, considering the output of some LLM (or, really, any sort of writing) is implicitly evaluating along multiple dimensions of quality. Is the text factually correct? Is it interesting? Is it grammatically correct? Is it clear? Is it compelling? Some of these evaluation dimensions are known to the human evaluators (indeed, they may have a clear rubric enumerating them), but others are hidden or unconscious.

Humans are also (implicitly) weighing these different criteria against each other, prioritizing them in sometimes opaque ways. For example — is it more important that the writing is clear, or interesting?

How could you reproduce human evaluation behavior?

You could — and many have — attempt to enumerate a complex, multi-dimensional rubric in your prompt, along with a detailed description of a particular human evaluator persona.

Or you could break your complex, multi-dimensional rubric into a set of simple, single-dimension rubrics and aggregate the scores from those evaluations.

After that you could — and, again, many have — take an average of those evaluations as a final score. You could even weight the average, based on your sense (or business requirements) that some metrics are more or less important (factual correctness is probably more important than, say, how interesting a response is).

Or you could use multi-variable regression.

In other words, it’s a two step problem

Here’s what the team wants.

$$(x)⟶y$$

where $x$ = (prompt,response) and $y$ is the human rating.

They turn this into a two-step prediction problem:

  1. Judge step:

$$x \;\longrightarrow\; \{ s_1, s_2, \dots, s_m \}$$

where $s_i$ is the score from the $i$th single dimension judge.

  1. Aggregation step:

$$\hat{y} = f(s_1, s_2, \dots, s_m)$$

where $f$ is learned function that predicts the overall rating, $\hat{y}$, from the set of judge scores.

Details, details

  • The team started with 10,000 (prompt, answer) pairs from the UltraFeedback dataset, which contains diverse prompts and LLM-generated responses. (It also includes human ratings, which they did not use).
  • They generated simulated human feedback for each request-response pair, based on 8 distinct personas. Each pair was assigned only one rating, from a randomly selected persona.
    • This was used instead of the available real human ratings in UltraFeedback because the human ratings were based on a clearly defined rubrics and four categorical metrics, while the goal of this project was, in a way, to see if such metrics could be reverse engineered.
    • They trained two models — a Generalized Additive Model (GAM) and a single-layer MLP with 16 hidden units — to predict the simulated human rating based on the ratings from the ten rubric judges.
  • They created 10 Rubric judges (using the Martian API), each tasked with evaluating chat responses along a single metric (correctness, conciseness, style, etc.)
  • They compared this to the much simpler, and widely used, approach of taking the average of judge scores.

Results

The MLP and GAM models showed much better predictive power than the Naive Mean Baseline.

                                                                                                                                       
ModelMSEMAE
NN Model (MLP)3.061.350.57
Model (GAM)3.111.360.56
Naive Mean Baseline5.361.830.24

The models capture roughly 56–57% of the variance in human ratings—far from production-ready, but a clear improvement over naive baselines. For an early experiment, this is a promising proof-of-concept that validates the approach and points toward fertile ground for future refinement.

Model Interpretability: How much did each feature contribute to the predicted score?

The team also ran a partial dependence analysis on the GAM judge aggregator to see how strongly each of the 10 rubric judges influenced the predicted overall rating. The most influential features were factual accuracy and instruction-following, each showing a clear positive correlation with the final score: higher judge ratings on these dimensions reliably pushed the predicted rating upward. Clarity and conciseness also had moderate positive effects.

By contrast, metrics related to safety—such as harmlessness, privacy, and bias—had only small, flat partial dependence curves, suggesting they contributed very little to the aggregated score in this setup. This likely reflects the nature of the simulated human ratings, which were not strongly conditioned on safety attributes, rather than a universal lack of importance for those metrics. (Though this may point to a need to explicitly add weight to safety metrics in evaluations, if further research shows that human judges do not routinely consider these aspects of quality.)

Why is Martian excited about this research?

Martian’s core product — model routing — optimizes for quality responses. In other words, we route each request to the model that is predicted to provide the highest quality within given cost constraints.

To do this, we need to have ways of judging and predicting the quality of model outputs. When building custom routers we often use rubric-based LLM judges to rate responses and train our routing models. We do this for obvious reasons; human annotation on thousands of responses across multiple models simply doesn’t scale.This work advances that approach by aggregating multiple rubric judges to arrive at results that are (potentially) much closer to human preferences than individual rubric judges, and does so in an interpretable way.

We feel this research direction has the potential to greatly improve our ability to measure and predict model response quality, directly impacting the quality of our routers.