LLMs as Judges: Break Hard Labels into Checklists, Reason, and Combine

LLMs can act like flexible, text-reading “judges” — but they work best when you give them clear rules, break big judgments into simple pieces, and combine their answers smartly. Use them to estimate things people disagree about (mental health scores, whether a text is hateful, how well a child keeps a conversation going) by turning messy labels into concrete questions, letting the model answer each, and then combining those answers with a small, interpretable rule or model.

Concrete examples (what this approach looks like in practice):

Mental health scoring (PTSD): Give LLMs short narratives plus clear definitions of symptom subscales and interview context. Models do much better when they have those definitions and when you ask them to reason step-by-step. Open models (Llama, Deepseek) top out around ~70B parameters; closed models (e.g., o3-mini, gpt-5 in the study) tended to improve across generations. Best results come from ensembling a supervised model with zero-shot LLM answers.
Hate speech diagnosis (xList-Hate): Instead of one yes/no label, break hate into a checklist of concept-level questions (targeted group, intent, dehumanization, slur use, etc.). Ask an LLM each question, then feed the binary answers into a small decision tree. This gives transparency, improves cross-dataset robustness, and handles inconsistent labels better than a single black-box classifier.
Child utterance assessment (LLM-as-a-judge): First classify the adult’s previous utterance, then score the child reply on two simple axes: Expansion (deepening or elaborating the topic) and Independence (taking initiative and reducing adult scaffolding). These scores match human judgments, track development with age, and detect discourse-level differences that raw length metrics miss.

Why this pattern works (in plain language):

Complex labels are noisy and context-dependent. Asking many small, concrete questions is easier for a model than forcing it to guess one big label.
Definitions and context anchor the model to the exact criteria you care about — it stops guessing which rulebook to use.
Asking for more reasoning (step-by-step explanations or examples) gives the model space to produce better intermediate judgments.
Combining multiple signals (LLM answers + a small supervised model) reduces mistakes that any single approach would make.

Practical pipeline — what to build, step by step:

Define the construct clearly. Write short, plain-language definitions and subscale items (e.g., symptoms, hate-features, conversational moves).
Turn definitions into checklist questions. Make each item answerable with yes/no or a small ordinal scale.
Prompt the LLM to answer each checklist item. Include context: the original text, the definition for that item, and an instruction to explain reasoning if you want higher-quality outputs.
Aggregate answers with a small, interpretable model. Use a decision tree, simple rules, or a tiny supervised learner to map checklist answers to the final judgment.
Optionally ensemble with supervised models. Combine LLM-based diagnostics with in-domain classifiers for best reliability.
Evaluate broadly. Test across datasets, measure cross-domain robustness, and inspect disagreement cases.
Deploy carefully. Keep human oversight, logging, and privacy safeguards, especially for sensitive tasks like mental health or moderation.

Key modeling knobs (what to try and why):

Context size: Provide definitions, sample distributions, and interview questions — this often boosts accuracy more than raw model size.
Zero-shot vs few-shot: Few-shot examples can help when the task format is unfamiliar; zero-shot works well when you give clear definitions and structure.
Reasoning effort: Instructions that ask the model to “think step-by-step” or to explain improve final estimates in many cases.
Model choice and scale: Open models may plateau after a certain size (~70B in the reported work); closed or newer-generation models often show gains across releases.
Structured subscales vs direct scalar prediction: Predicting sub-components first and then combining them beats asking the model to produce a single score directly.
Output rescaling and calibration: Map model outputs into the same numeric range used by your labels; calibrate probabilities if you need reliable confidence scores.
Ensembling methods: Average, majority vote, stacking, and supervised blending can all help; the best-performing setup in the study combined a supervised model with model-derived diagnostics.

Example prompt style (short):

“Read the text below. For each of these symptom questions, answer Yes or No and give one short sentence of why: (1) intrusive memories? (2) avoidance? (3) negative mood? ... Then map Yes count to this severity scale: 0–3 low, 4–7 moderate, 8–12 high.”

Caveats, limits, and ethical checks you must run:

Bias and fairness: LLMs reflect training data. Check performance across demographic groups and for systematic errors.
Annotation noise: Gold labels are often inconsistent. Diagnostic checklists help, but don’t remove the need for careful human validation.
Domain shift: Models that do well on one dataset may fail on another unless you test cross-dataset robustness.
Privacy & safety: Particularly for mental health or child data, protect privacy and avoid automated high-stakes decisions without human oversight.
Hallucination & overconfidence: Ask for explanations and calibrate confidence; don’t trust raw model probabilities without checks.
Legal & policy mismatch: Hate speech definitions vary — pick the normative criteria (legal, platform, researcher) and make them explicit in your checklist.

Quick takeaways you can use right away:

Give clear definitions and context. That single step often improves accuracy more than switching models.
Break hard tasks into many simple yes/no questions. Simpler pieces are easier for models to answer and easier for humans to audit.
Ask for reasoning when the decision is complex. Step-by-step responses tend to be higher quality.
Combine LLM diagnostics with small, interpretable aggregators. Decision trees give transparency and often better robustness.
Ensemble where possible. Mixing supervised signals with LLM judgments gives more reliable final outputs.
Always validate across datasets and include human review for sensitive uses.

Bottom line: Treat LLMs as expert assistants that answer many clear, contextual questions — then combine those answers with simple, transparent rules. That balance gives better accuracy, more robustness across datasets, and explanations you can inspect and improve.

LLMs as Judges: Break Hard Labels into Checklists, Reason, and Combine

Related Papers

Hiring AI researchers or engineers?