Judging
When you run multiple agents on the same task, judge agents can automatically evaluate the results and help you identify the best solution. Judges analyze code quality, correctness, and completeness—providing objective feedback that saves you time reviewing variants.
What Are Judge Tasks?
A judge task is a special task that evaluates other tasks in a group. Unlike regular tasks that modify your code, judges:
- Run after primary tasks complete
- Have read-only access to primary task results
- Analyze code quality, correctness, and completeness
- Produce evaluation notes and scoring
- Do not modify your repositories
Judge tasks appear in the task group alongside other variants.
How Judges Evaluate
When a judge task runs, it:
- Reads all primary task results—patches, summaries, exit codes, logs
- Analyzes the code changes each agent made
- Reviews test results and error messages
- Evaluates each variant on multiple dimensions
- Generates a detailed report with scores and recommendations
Evaluation Dimensions
Judges score variants on:
- Correctness: Does the code solve the problem? Are edge cases handled? Do tests pass?
- Code quality: Is it readable, maintainable, and following good patterns?
- Completeness: Are all requirements addressed? Is anything missing?
- Performance: Is the implementation efficient? (when applicable)
Each dimension receives a score, and judges provide detailed notes explaining their reasoning.
Judges may use or add other dimensions based on the task context.
Automatic Judging
You can configure task groups to automatically launch judge tasks when primary agents finish.
Configuring Auto-Judge
When creating a task group, select which agents should serve as judges.
Multiple judges provide independent evaluations, reducing bias and increasing confidence in the results.
When Auto-Judge Launches
Judge tasks launch automatically when:
- All primary tasks have completed
- At least two variants finished successfully
- Multiple variants made file changes
- No follow-up instructions are pending
If conditions aren't met, auto-judge is skipped—but you can always launch judges manually.
Manual Judge Launch
You can launch judge tasks at any time:
- Open the task group
- Click the Judge ucib
- Select which agents to use as judges
- Judge tasks are created and queued
This is useful when auto-judge conditions weren't met, or when you want additional evaluation after making changes.
Judge Consensus
When multiple judges evaluate the same variants:
- Each judge scores independently
- Results can be compared side-by-side
- Consensus emerges when judges agree on the best variant
- Disagreements highlight areas worth closer review
If two out of three judges recommend the same variant, that's a strong signal. If judges disagree significantly, you may want to review their reasoning before deciding.
Using Judge Feedback
Judge feedback isn't just for picking a winner—it helps you improve the code.
Common Issues Judges Identify
- Test failures: Some tests aren't passing
- Edge cases: Boundary conditions not handled
- Error handling: Missing validation or exception handling
- Code style: Inconsistent naming or formatting
- Incomplete implementation: Features not fully implemented
Feedback Loops
After reviewing judge feedback:
- Identify specific issues mentioned in the evaluation
- Send follow-up instructions to the winning variant addressing those issues
- The agent resumes and implements improvements
- Optionally re-run judges to verify the improvements
This creates an automated refinement cycle where judges catch issues that agents then fix.
Judges Don't Approve
Important: Judge tasks provide feedback and recommendations only. They do not:
- Automatically approve changes
- Commit or push code
- Mark tasks as winners
You make the final decision on winner selection and approval. Judges inform your decision—they don't make it for you.