Launch Week Domain pause, build sprint active. Follow the seven-day push.
← All Posts

Project Bench Update: 50/200 Solved — All 100% Correct

Share:

A quick update on Project Bench progress.

As of today, Jarvi3 AI has solved 50 of 200 Project Bench tasks. Every solution — all 50 — has been verified as 100% correct.

What is Project Bench?

Project Bench is an internal benchmark of 200 complex, multi-file software engineering tasks. Where SWE-bench Verified tests isolated bug fixes, Project Bench tasks require:

  • Understanding large, interconnected codebases
  • Reasoning across multiple files and modules
  • Making architectural decisions, not just patches
  • Producing solutions that pass comprehensive test suites

These are the kinds of tasks that take experienced engineers hours — not minutes.

Why 100% Correctness Matters

Most AI benchmarks accept partial credit. A model that gets 60% of the test cases right still gets counted as "solving" the problem.

We don't accept partial solutions. Jarvi3 either solves a Project Bench task completely correctly, or it doesn't solve it at all. The 50 tasks it has solved: all correct. No partial marks.

This is a deliberate product decision, not a benchmark gaming strategy. In real engineering work, a half-correct answer is often worse than no answer — it misleads, it breaks production, it wastes review time.

The Architecture Behind It

The improvement from SWE-bench to Project Bench required two things:

  1. Deeper codebase traversal — Project Bench tasks span multiple files. We extended the code generation lane to maintain context across files and modules, not just within a single function.

  2. SuperMath Brain integration — A significant portion of Project Bench tasks involve algorithms with provable correctness properties. Routing these through the deterministic logical reasoning layer — rather than generation — eliminates a major source of errors.

More technical detail on SuperMath Brain is coming. Open-source possibilities are being evaluated.

Timeline

We're 25% through the benchmark with 0% error rate. We're on track for 200/200 by June 2026.

When that happens, we'll publish a full technical write-up.

Follow along at jvi3.com.

Found this useful? Share it:

Next · The why Mission