greenfieldintermediate1-2 weeks

Building LLM-as-a-Judge for AI Product Evaluation Using Critique Shadowing

A systematic approach to creating LLM-based evaluation systems for AI products by shadowing domain expert critiques. This workflow helps teams avoid common pitfalls like metric sprawl and build evaluation systems that drive real business value through iterative refinement with domain experts.

Hamel Husain

ML engineer focused on practical LLM evaluation systems

1827 views0 saves

Tools & Prerequisites

Required Tools

Claude/GPT-4(AI Assistant)

Spreadsheet Software(Data Analysis)

Optional Tools

Python(Programming Language)

Web Framework(Development Framework)

Step-by-Step Guide

Find The Principal Domain Expert

Identify the one or two key individuals whose judgment is crucial for your AI product's success. This person should have deep domain expertise or represent your target users. Examples include:

A psychologist for a mental health AI assistant
A lawyer for legal document analysis AI
A customer service director for support chatbots
A lead teacher for educational AI tools

Pro Tip

In smaller companies, this might be the CEO/founder. For independent developers, you should be the domain expert but validate assumptions with real users. Avoid using convenient proxies like your superior - find the actual expert.

Create a Diverse Dataset

Build a dataset that captures the range of problems your AI will encounter. Structure it across dimensions relevant to your use case:

Common Dimensions for B2C Applications:

Features: Specific functionalities (e.g., email summarization, meeting scheduler)
Scenarios: Situations the AI must handle (e.g., multiple matches found, ambiguous requests)
Personas: User profiles with distinct needs (e.g., new user, expert user, non-native speaker)

Generate data through:

Sampling real user interactions
Creating synthetic user inputs using LLMs
Incorporating system information (APIs, databases) for realism

Prompt Template

Generate a user input from someone who is clearly irritated and impatient, using short, terse language to demand information about their order status for order number #1234567890. Include hints of previous negative experiences.

Pro Tip

Generate enough data to cover all dimension combinations. Keep generating until you stop seeing new failure modes. Synthetic data works surprisingly well for user inputs.

Direct Domain Expert to Make Pass/Fail Judgments with Critiques

Have the domain expert evaluate each AI interaction with:

A binary pass/fail judgment answering: "Did the AI achieve the desired outcome?"
A detailed critique explaining their reasoning

Provide the expert with:

User input and AI response
Relevant context (user metadata, system state)
Easy-to-use interface (spreadsheet or simple web app)

Critiques should be detailed enough to use in few-shot prompts later.

Pro Tip

Resist pressure to use complex scoring scales (1-5). Binary judgments force clarity on what matters. Make reviewing data frictionless - present everything on one screen. Start with ~30 examples and expand as needed.

Fix Obvious Errors

After initial review, fix any pervasive errors in your AI system before building the judge. This might include:

Incorrect API calls
Missing error handling
Poor prompt engineering
Inadequate context retrieval

Iterate between fixing errors and expert review until the system stabilizes.

Pro Tip

Don't skip this step - it's easier to fix errors now than after building the judge. If you have Level 1 evals (unit tests), you shouldn't have many pervasive errors.

Build Your LLM Judge Iteratively

Create an LLM judge prompt using the expert's examples:

Start with a base prompt explaining the evaluation task
Include few-shot examples from the expert's critiques
Test the judge against new examples
Compare judge outputs with expert judgments
Refine the prompt based on disagreements
Iterate until achieving >90% agreement

Prompt Template

You are a [domain] evaluator with advanced capabilities to judge if [output] is good or not.

Here are guidelines for evaluation:
{{guidelines}}

Example evaluations:
<example-1>
<input>{{user_input}}</input>
<output>{{ai_output}}</output>
<critique>
{
  "critique": "{{expert_critique}}",
  "outcome": "pass|fail"
}
</critique>
</example-1>

For the following interaction, write a detailed critique and provide a pass/fail judgment:
<input>{{new_input}}</input>
<output>{{new_output}}</output>

Pro Tip

Track agreement rates over time. Consider dynamic example selection based on the item being judged. The process often helps domain experts clarify their own criteria.

Perform Error Analysis

Apply the judge to a larger dataset and analyze failures:

Calculate error rates by dimensions (feature, scenario, persona)
Examine failed interactions to identify patterns
Classify root causes (e.g., missing user education, poor error messages)
Prioritize fixes based on frequency and impact

Example analysis structure:

Error rates table showing failure percentages by dimension
Root cause distribution showing most common failure types
Specific examples of each failure pattern

Code Example

# Example error analysis structure
error_analysis = {
    "by_dimension": {
        "feature": {"order_tracking": 0.35, "contact_search": 0.22},
        "scenario": {"no_matches": 0.68, "multiple_matches": 0.21},
        "persona": {"new_user": 0.45, "expert_user": 0.20}
    },
    "root_causes": {
        "missing_user_education": 0.40,
        "authentication_issues": 0.30,
        "poor_context_handling": 0.20,
        "inadequate_error_messages": 0.10
    }
}

Pro Tip

You can get valuable insights in just 15 minutes of analysis. Focus on high-frequency, high-impact errors first. Create test cases for each error type you fix.

Create Specialized Judges (If Needed)

Based on error analysis, create targeted judges for specific issues:

Citation accuracy checker
Response completeness validator
Tone appropriateness judge
Technical accuracy verifier

Only create specialized judges after understanding the main failure modes through the general judge.

Pro Tip

Don't jump to specialized judges too early. Some errors might be better caught with simple code-based assertions rather than LLM judges.

Building LLM-as-a-Judge Using Critique Shadowing

This workflow addresses a critical problem in AI development: teams drowning in metrics that don't reflect real user needs. The solution is Critique Shadowing - a process where you build an LLM judge by learning from domain expert evaluations.

You can read Hamal's original post here

Common Pitfalls to Avoid

Too Many Metrics: Creating numerous measurements that become unmanageable
Arbitrary Scoring Systems: Using uncalibrated scales (1-5) where score differences are unclear
Ignoring Domain Experts: Not involving people who understand the subject matter deeply
Unvalidated Metrics: Using measurements that don't reflect what matters to users

The Critique Shadowing Process

The process involves finding a principal domain expert, creating a diverse dataset, having the expert make pass/fail judgments with detailed critiques, and iteratively building an LLM judge that aligns with their expertise.

Key Principles

Binary Pass/Fail Judgments: Force clarity on what truly matters
Detailed Critiques: Capture nuanced reasoning behind judgments
Iterative Refinement: Continuously improve alignment with domain expert
Error Analysis: Systematically identify and fix failure patterns

Expected Outcomes

Clear, actionable evaluation metrics aligned with business goals
Reduced evaluation overhead through automated LLM judges
Better understanding of product strengths and weaknesses
Improved AI system performance through targeted fixes

Discussion (0)

Comments coming soon!