Discussion about this post

User's avatar
Byrel Mitchell's avatar

I'm kind of underwhelmed by the study on LLM judging. It's got one major issue, and some important limitations. First, there's a problem with his inputs. He gives very few examples of the randomized 'equivalent' prompts, and doesn't claim to have hand-reviewed them for actually having the characteristics he imputes to them. But let's assume that he did, and they're all of similar quality to the two examples he gives:

> Does the contractual term “other affiliate[s]” from 1961 apply only

> to affiliates existing at that time, or does it automatically include

> companies that later become affiliates? Answer only “Existing

> Affiliates” or “Future Affiliates”, without any other explanation.

And

> Should the phrase “other affiliate[s]” in a 1961 contract about royalty

> splitting be read to include entities that become affiliates after 1961,

> or only those existing then? Answer only “Existing Affiliates” or

> “Future Affiliates”, without any other explanation.

These are not simple rephrasings of each other, and the differences would (contra the article) affect the substantive analysis. The second specifies that the contract is about royalty splitting, providing some important additional context to the interpretive question.

More importantly... does anyone doubt that this question is wildly underspecified? Someone using those words in completely standard ways could intend to include or exclude later affiliates. This isn't a question resolvable by a dictionary. When a judge determines whether "other affiliates" includes later affiliates, the first thing they will do is not consult a dictionary or some preconceived standard meaning for the term; they will first consult the context in which the language was used to get clues as to what was meant in this particular situation. By depriving the LLM of the very context a judge would use to make his judgement, we ensure that it cannot exercise similar judgement.

Indeed, without even giving a full sentence of context from the contract, could human judges (blinded from each other's answers) give you a consistent answer to this question? I doubt it, unless they were familiar with this case and were able to incorporate the relevant context from memory. In the end, when the author removed all sound bases for making this judgement from the prompt and then *required* a simple yes/no from the LLM, he guaranteed that its judgement would be made on unsound bases.

As the ancient lore of the programmer says, Garbage In, Garbage Out.

Less importantly, his model choice limits how relevant this is. First off, I don't think there's any value in the data from the tiny, open-source models. They're years behind, in a field where capabilites have (thus far) been increasing exponentially. If you're using Llama for anything other than experimenting on training LLMs, you're doing it wrong. The only open source model that anyone could seriously consider as a professional aid in any field is DeepSeek. And I don't *think* anyone is proposing using them as an aid for judges at this point; everyone seems to be using OpenAI's or Anthropic's models.

But even for his large 5-question test of major models, he's still using old ones. He uses Claude-3.5 Sonnet for generating the prompts, and GPT-4o for answering them, both of which are a year old, and explicitly avoids the newer reasoning and deep research models from both vendors because their outputs are less useful for his research. Fair enough; but there's a limit on how much your characterizations of older one-shot models can be extrapolated to reasoning models.

All that said, I'd like to make every researcher in this field read his section 4.1 on why repeatability of response to the same prompt is a meaningless statistic. Excellently well-done, and an important caveat for readers and researchers alike.

Expand full comment

No posts