Things to Read This Week (5/5)
Historical oddities and historical revelations, LLMs, and the freedom of the press
A Vow of Silence: Vice-Presidential Inaugural Addresses, by Gerard Magliocca, who has a real knack for topics. “This is the first (and probably the last) article on the (largely) abandoned tradition of vice-presidential inaugural speeches.”
The Press Clause: Important, Remembered, and Equally Shared, by Eugene Volokh — a response to this 100-page report from the Abrams Institute.
Large Language Models Are Unreliable Judges, by Jonathan Choi. Contains a very illuminating discussion of prompt phrasing and reliability that made me reevaluate some of the recent LLM judging literature.
Some Traditional Questions about History and Tradition, by Jonathan Green. (“If the Constitution’s legal content resides in the original meaning of its terms, how might a tradition of political practice that arose long after a constitutional provision’s adoption be legally relevant? Eighteenth-century English jurists had an answer to that question. . . . . The codification of an unwritten right in written law did not alter its status as a customary right, whose limits were set by a tradition that preceded and succeeded the text’s enactment.”) A good addition to the “general law” revival . . .
Reconstructing Section 1983, by Tyler Lindley. Something of a sequel to his Anachronistic Readings of Section 1983. I commented on this paper at a lunch talk in February, and my biggest questions were about Federal Rule of Civil Procedure 2, an issue discussed at pp. 26-29 of the paper.
I'm kind of underwhelmed by the study on LLM judging. It's got one major issue, and some important limitations. First, there's a problem with his inputs. He gives very few examples of the randomized 'equivalent' prompts, and doesn't claim to have hand-reviewed them for actually having the characteristics he imputes to them. But let's assume that he did, and they're all of similar quality to the two examples he gives:
> Does the contractual term “other affiliate[s]” from 1961 apply only
> to affiliates existing at that time, or does it automatically include
> companies that later become affiliates? Answer only “Existing
> Affiliates” or “Future Affiliates”, without any other explanation.
And
> Should the phrase “other affiliate[s]” in a 1961 contract about royalty
> splitting be read to include entities that become affiliates after 1961,
> or only those existing then? Answer only “Existing Affiliates” or
> “Future Affiliates”, without any other explanation.
These are not simple rephrasings of each other, and the differences would (contra the article) affect the substantive analysis. The second specifies that the contract is about royalty splitting, providing some important additional context to the interpretive question.
More importantly... does anyone doubt that this question is wildly underspecified? Someone using those words in completely standard ways could intend to include or exclude later affiliates. This isn't a question resolvable by a dictionary. When a judge determines whether "other affiliates" includes later affiliates, the first thing they will do is not consult a dictionary or some preconceived standard meaning for the term; they will first consult the context in which the language was used to get clues as to what was meant in this particular situation. By depriving the LLM of the very context a judge would use to make his judgement, we ensure that it cannot exercise similar judgement.
Indeed, without even giving a full sentence of context from the contract, could human judges (blinded from each other's answers) give you a consistent answer to this question? I doubt it, unless they were familiar with this case and were able to incorporate the relevant context from memory. In the end, when the author removed all sound bases for making this judgement from the prompt and then *required* a simple yes/no from the LLM, he guaranteed that its judgement would be made on unsound bases.
As the ancient lore of the programmer says, Garbage In, Garbage Out.
Less importantly, his model choice limits how relevant this is. First off, I don't think there's any value in the data from the tiny, open-source models. They're years behind, in a field where capabilites have (thus far) been increasing exponentially. If you're using Llama for anything other than experimenting on training LLMs, you're doing it wrong. The only open source model that anyone could seriously consider as a professional aid in any field is DeepSeek. And I don't *think* anyone is proposing using them as an aid for judges at this point; everyone seems to be using OpenAI's or Anthropic's models.
But even for his large 5-question test of major models, he's still using old ones. He uses Claude-3.5 Sonnet for generating the prompts, and GPT-4o for answering them, both of which are a year old, and explicitly avoids the newer reasoning and deep research models from both vendors because their outputs are less useful for his research. Fair enough; but there's a limit on how much your characterizations of older one-shot models can be extrapolated to reasoning models.
All that said, I'd like to make every researcher in this field read his section 4.1 on why repeatability of response to the same prompt is a meaningless statistic. Excellently well-done, and an important caveat for readers and researchers alike.