The rapid advancement of Large Language Models (LLMs) necessitates robust and challenging benchmarks.
To address the challenge of ranking LLMs on highly subjective tasks such as emotional intelligence, creative writing, or persuasiveness, the Language Model Council (LMC) operates through a democratic process to: 1) formulate a test set through equal participation, 2) administer the test among council members, and 3) evaluate responses as a collective jury.
Our initial research deploys a council of 20 newest LLMs on an open-ended emotional intelligence task: responding to interpersonal dilemmas. Our results show that the LMC produces rankings that are more separable, robust, and less biased than those from any individual LLM judge, and is more consistent with a human-established leaderboard compared to other benchmarks like Chatbot Arena or MMLU.
Roadmap: