Building LLM-as-a-Judge for AI Product Evaluation Using Critique Shadowing
A systematic approach to creating LLM-based evaluation systems for AI products by shadowing domain expert critiques. This workflow helps teams avoid common pitfalls like metric sprawl and build evaluation systems that drive real business value through iterative refinement with domain experts.

Tools & Prerequisites
Required Tools
Optional Tools
Step-by-Step Guide
Find The Principal Domain Expert
Identify the one or two key individuals whose judgment is crucial for your AI product's success. This person should have deep domain expertise or represent your target users. Examples include:
- A psychologist for a mental health AI assistant
- A lawyer for legal document analysis AI
- A customer service director for support chatbots
- A lead teacher for educational AI tools
Pro Tip
In smaller companies, this might be the CEO/founder. For independent developers, you should be the domain expert but validate assumptions with real users. Avoid using convenient proxies like your superior - find the actual expert.
Create a Diverse Dataset
Build a dataset that captures the range of problems your AI will encounter. Structure it across dimensions relevant to your use case:
Common Dimensions for B2C Applications:
- Features: Specific functionalities (e.g., email summarization, meeting scheduler)
- Scenarios: Situations the AI must handle (e.g., multiple matches found, ambiguous requests)
- Personas: User profiles with distinct needs (e.g., new user, expert user, non-native speaker)
Generate data through:
- Sampling real user interactions
- Creating synthetic user inputs using LLMs
- Incorporating system information (APIs, databases) for realism
Prompt Template
Generate a user input from someone who is clearly irritated and impatient, using short, terse language to demand information about their order status for order number #1234567890. Include hints of previous negative experiences.
Pro Tip
Generate enough data to cover all dimension combinations. Keep generating until you stop seeing new failure modes. Synthetic data works surprisingly well for user inputs.
Direct Domain Expert to Make Pass/Fail Judgments with Critiques
Have the domain expert evaluate each AI interaction with:
- A binary pass/fail judgment answering: "Did the AI achieve the desired outcome?"
- A detailed critique explaining their reasoning
Provide the expert with:
- User input and AI response
- Relevant context (user metadata, system state)
- Easy-to-use interface (spreadsheet or simple web app)
Critiques should be detailed enough to use in few-shot prompts later.
Pro Tip
Resist pressure to use complex scoring scales (1-5). Binary judgments force clarity on what matters. Make reviewing data frictionless - present everything on one screen. Start with ~30 examples and expand as needed.
Fix Obvious Errors
After initial review, fix any pervasive errors in your AI system before building the judge. This might include:
- Incorrect API calls
- Missing error handling
- Poor prompt engineering
- Inadequate context retrieval
Iterate between fixing errors and expert review until the system stabilizes.
Pro Tip
Don't skip this step - it's easier to fix errors now than after building the judge. If you have Level 1 evals (unit tests), you shouldn't have many pervasive errors.
Build Your LLM Judge Iteratively
Create an LLM judge prompt using the expert's examples:
- Start with a base prompt explaining the evaluation task
- Include few-shot examples from the expert's critiques
- Test the judge against new examples
- Compare judge outputs with expert judgments
- Refine the prompt based on disagreements
- Iterate until achieving >90% agreement
Prompt Template
You are a [domain] evaluator with advanced capabilities to judge if [output] is good or not.
Here are guidelines for evaluation:
{{guidelines}}
Example evaluations:
<example-1>
<input>{{user_input}}</input>
<output>{{ai_output}}</output>
<critique>
{
"critique": "{{expert_critique}}",
"outcome": "pass|fail"
}
</critique>
</example-1>
For the following interaction, write a detailed critique and provide a pass/fail judgment:
<input>{{new_input}}</input>
<output>{{new_output}}</output>
Pro Tip
Track agreement rates over time. Consider dynamic example selection based on the item being judged. The process often helps domain experts clarify their own criteria.
Perform Error Analysis
Apply the judge to a larger dataset and analyze failures:
- Calculate error rates by dimensions (feature, scenario, persona)
- Examine failed interactions to identify patterns
- Classify root causes (e.g., missing user education, poor error messages)
- Prioritize fixes based on frequency and impact
Example analysis structure:
- Error rates table showing failure percentages by dimension
- Root cause distribution showing most common failure types
- Specific examples of each failure pattern
Code Example
# Example error analysis structure
error_analysis = {
"by_dimension": {
"feature": {"order_tracking": 0.35, "contact_search": 0.22},
"scenario": {"no_matches": 0.68, "multiple_matches": 0.21},
"persona": {"new_user": 0.45, "expert_user": 0.20}
},
"root_causes": {
"missing_user_education": 0.40,
"authentication_issues": 0.30,
"poor_context_handling": 0.20,
"inadequate_error_messages": 0.10
}
}
Pro Tip
You can get valuable insights in just 15 minutes of analysis. Focus on high-frequency, high-impact errors first. Create test cases for each error type you fix.
Create Specialized Judges (If Needed)
Based on error analysis, create targeted judges for specific issues:
- Citation accuracy checker
- Response completeness validator
- Tone appropriateness judge
- Technical accuracy verifier
Only create specialized judges after understanding the main failure modes through the general judge.
Pro Tip
Don't jump to specialized judges too early. Some errors might be better caught with simple code-based assertions rather than LLM judges.
Building LLM-as-a-Judge Using Critique Shadowing
This workflow addresses a critical problem in AI development: teams drowning in metrics that don't reflect real user needs. The solution is Critique Shadowing - a process where you build an LLM judge by learning from domain expert evaluations.
You can read Hamal's original post here
Common Pitfalls to Avoid
- Too Many Metrics: Creating numerous measurements that become unmanageable
- Arbitrary Scoring Systems: Using uncalibrated scales (1-5) where score differences are unclear
- Ignoring Domain Experts: Not involving people who understand the subject matter deeply
- Unvalidated Metrics: Using measurements that don't reflect what matters to users
The Critique Shadowing Process
The process involves finding a principal domain expert, creating a diverse dataset, having the expert make pass/fail judgments with detailed critiques, and iteratively building an LLM judge that aligns with their expertise.
Key Principles
- Binary Pass/Fail Judgments: Force clarity on what truly matters
- Detailed Critiques: Capture nuanced reasoning behind judgments
- Iterative Refinement: Continuously improve alignment with domain expert
- Error Analysis: Systematically identify and fix failure patterns
Expected Outcomes
- Clear, actionable evaluation metrics aligned with business goals
- Reduced evaluation overhead through automated LLM judges
- Better understanding of product strengths and weaknesses
- Improved AI system performance through targeted fixes
Discussion (0)
Comments coming soon!