Open Source AI Agent Testing Framework with Auto-Generated Benchmarks

Building AI agents is hard. AI agent testing is harder.

I’ve been building BrandCast as a one-person company using Claude Code agents across multiple repositories:

brandcast - Product code (backend, frontend, database migrations)
brandcast-marketing - Content, SEO, analytics, weekly planning
brandcast-biz - Business operations, customer outreach, pricing

Each repo has specialized agents doing specific work: SEO optimization, content publishing, code review, weekly planning, customer discovery, competitive analysis. At last count, 7 production agents running daily.

The problem? I had no systematic way to know if they were getting better or worse.

Traditional unit tests don’t work when your “code” is a prompt and your output is non-deterministic. You can’t assert exact string matches on LLM output. AI agent testing requires a completely different approach.

So I built an AI agent testing framework using AI to evaluate AI. And today, I’m open sourcing it.

Repository: github.com/BrandCast-Signage/agent-benchmark-kit

The Problem with AI Agent Testing

When you’re building agents with Claude or GPT-4, you can’t write traditional tests:

# This doesn't work for LLMs
assert agent.output == "Expected exact string"

LLMs are non-deterministic. The same input can produce different outputs. You need semantic evaluation, not exact matching.

You need to test things like:

Did the agent catch the critical issues?
Did it avoid false positives?
Is the output actionable?
Does it follow best practices?

These require human judgment—or an LLM-as-judge.

What I Built: Agent Benchmark Kit

The framework has three core components that work together:

1. Interactive Test Suite Creator

Manual test creation is a massive barrier to systematic agent QA. Writing test cases, defining ground truth, creating scoring rubrics—it’s time-consuming.

The test-suite-creator agent asks you 5 questions and generates a complete benchmark suite:

What does your agent do?
What validations does it perform?
What are common edge cases?
What would “perfect” output look like?
What would “clearly failing” output look like?

From those answers, it generates:

5 diverse test cases
Ground truth expectations (JSON)
100-point scoring rubric
README and configuration

Time to first benchmark: under an hour.

2. LLM-as-Judge Evaluation

The benchmark-judge agent compares your agent’s output against ground truth using objective criteria:

{
  "must_catch_issues": [
    "Missing required field 'title' in frontmatter",
    "Description is 45 characters (needs 120-160)"
  ],
  "validation_checks": {
    "metadata_complete": {
      "expected": "all required fields present",
      "status": "fail"
    }
  }
}

The judge scores 0-100 based on your custom rubric. In my testing, it achieves 95%+ agreement with manual human scoring.

3. Test Rotation & Performance Tracking

As your agent improves and starts scoring 95+ consistently, the orchestrator suggests new, harder tests.

When a test scores 100 three consecutive times, it gets retired (your agent mastered it).

Performance is tracked in performance-history.json:

{
  "2025-11-01": { "average": 88.2 },
  "2025-11-09": { "average": 90.4 }
}

Real Results from Production

I use this framework across all my BrandCast repos. Here are actual scores from production agents:

Agent	Baseline	Current	Improvement	Days
SEO Specialist	88/100	90/100	+2.3%	8
Content Publisher	97.5/100	97.5/100	Excellent baseline	-
Weekly Planner	85/100	87/100	Tracked over 12 weeks

The SEO agent improvement is small but meaningful. The benchmark revealed it was missing citation validation in blog posts—statistics without source links. I added explicit citation checks to the prompt. Score went from 88 → 90.

Without the benchmark framework, I wouldn’t have caught that. It would have shipped uncited statistics to production.

How It Works Across Multiple Repos

The framework installs once in Claude Code and works across all your repositories:

# Install once
/plugin marketplace add https://github.com/BrandCast-Signage/agent-benchmark-kit

# Use in any repo
cd ~/brandcast && /benchmark-agent seo-specialist
cd ~/brandcast-marketing && /benchmark-agent content-publisher
cd ~/brandcast-biz && /benchmark-agent customer-discovery

Test suites live in ~/.agent-benchmarks/ by default, so they’re available regardless of which repo you’re in.

Example: Content Validator Agent

Let’s say you have a blog post validation agent in your marketing repo. You run:

/benchmark-agent --create content-validator

The test-suite-creator asks the 5 questions. You describe your agent. It generates:

Test #01: Perfect Post

Complete metadata
All statistics cited
Proper formatting
Purpose: Baseline (agent MUST NOT flag valid content)

Test #02: Missing Metadata

Missing required fields
Wrong date format
Purpose: Tests metadata validation

Test #03: Broken Citations

15+ statistics without sources
Vague attributions (“experts say”)
Purpose: Content integrity validation

Test #04: Missing Image

No hero image specified
Purpose: Resource validation

Test #05: Format Errors

YAML syntax error
No H1 header
Very short content
Purpose: Comprehensive test (multiple issues)

Each test has ground truth expectations in JSON. The METRICS.md defines a 100-point scoring rubric.

You run benchmarks:

/benchmark-agent content-validator

Output:

🎯 Running benchmarks for: content-validator

Test #01: Perfect Post .............. 100/100 ✅
Test #02: Missing Metadata ........... 92/100 ✅
Test #03: Broken Citations ........... 85/100 ✅
Test #04: Missing Image .............. 88/100 ✅
Test #05: Format Errors .............. 78/100 ⚠️

📊 Overall Score: 88.6/100 (PASS - threshold: 80)

💡 Recommendations:
  - Test #05 scoring below 80 - review YAML syntax detection

You fix the agent. Re-run. Track improvement over time.

Why I Open Sourced It

I built this because I needed it. Running a one-person company with AI agents means I need systematic quality checks and feedback loops, not manual spot-checking.

But the real value comes from the framework, not my specific test cases.

Every team building AI agents faces the same problem: How do you know if your agent is getting better or worse?

Open sourcing this:

Helps other teams avoid the mistakes I made
Gets feedback to make the framework better
Establishes transparency as part of my brand (I’ll publish agent scores publicly)
Turns an internal QA tool into a way to demonstrate quality

Key Differences from Other Tools

vs. Manual Testing:

Auto-generates test suites (5 questions → complete benchmark)
Systematic evaluation (not ad-hoc)
Performance tracking over time

vs. PromptFoo / LangSmith:

Interactive test creation (eliminates manual test barrier)
Production examples included (learn from real usage)
Test rotation built-in (keeps tests challenging)
Native Claude Code integration (works across repos)

Getting Started

Install

# In Claude Code
/plugin marketplace add https://github.com/BrandCast-Signage/agent-benchmark-kit

This installs the framework once and makes it available in all your repositories.

Create Your First Benchmark

/benchmark-agent --create my-agent

Answer the 5 questions. It generates everything.

Run Benchmarks

/benchmark-agent my-agent

See the scores. Fix issues. Re-run. Track improvement.

Documentation

What I Learned Building This

1. Test #01 Must Be Perfect

Your first test case should be perfect input with zero issues. This is critical.

Why? It detects false positives. If your agent flags “perfect” content as broken, you have a problem.

I learned this the hard way. My SEO agent kept flagging valid blog posts because I didn’t have a perfect baseline test.

2. Ground Truth Must Be Objective

Bad ground truth:

{
  "must_catch_issues": ["Content has quality problems"]
}

Good ground truth:

{
  "must_catch_issues": [
    "Description is 45 characters (needs 120-160)",
    "Missing required field 'title' in frontmatter"
  ]
}

Specific. Measurable. No ambiguity.

3. Start with 5 Tests, Add More Later

Don’t try to cover every edge case on day one. Start with:

Perfect case (baseline)
Single issue (common error)
Quality issue (deeper validation)
Edge case
Multiple issues (comprehensive)

Add more tests as your agent improves and scores 95+.

4. LLM-as-Judge Works (with Good Rubrics)

I was skeptical about LLM-as-judge. Wouldn’t it be too subjective?

Turns out: if you have objective criteria in ground truth, LLM-as-judge is highly consistent. I target 95%+ agreement with manual scoring.

The key is the rubric. Vague criteria → inconsistent scores. Specific criteria → reliable evaluation.

The Bigger Picture

This is one piece of building BrandCast as a one-person company. Other systems I’m working on:

Agent prompt versioning - Track changes, measure impact, roll back when needed
Cross-repo agent orchestration - Marketing agents trigger product agents trigger biz agents
Automated health checks - Daily validation that all systems are working
Performance monitoring - Track API costs, latency, success rates across agents

The benchmark kit ensures the agents are improving over time. The other systems ensure they play nicely together.

I’ll be open sourcing more of these tools as they mature.

What’s Next

I’m using this framework internally for all BrandCast agents across three repositories. Next steps:

Public performance tracking - Publish agent scores on my docs site (full transparency)
More examples - Add benchmarks for my other agent types (customer discovery, competitive analysis, etc.)
CI/CD integration - Run benchmarks automatically on agent prompt changes
Community contributions - Accept sanitized benchmark suites from other solo builders

Try It

If you’re building AI agents and struggling with systematic QA, give this framework a try.

It’s MIT licensed. Free. Open source. Production-tested across multiple repositories in a real company.

Repository: github.com/BrandCast-Signage/agent-benchmark-kit

Feedback and contributions welcome. Open an issue, submit a PR, or just star the repo if you find it useful.

Building BrandCast’s AI agent infrastructure has been a fascinating journey. This is the first of several tools I’m open sourcing. Next up: agent prompt versioning and deployment system.

If you’re working on similar problems—building a company with AI agents—I’d love to hear about it. Reach out on GitHub or LinkedIn.