A quick update on Project Bench progress.
As of today, Jarvi3 AI has solved 50 of 200 Project Bench tasks. Every solution — all 50 — has been verified as 100% correct.
What is Project Bench?
Project Bench is an internal benchmark of 200 complex, multi-file software engineering tasks. Where SWE-bench Verified tests isolated bug fixes, Project Bench tasks require:
- Understanding large, interconnected codebases
- Reasoning across multiple files and modules
- Making architectural decisions, not just patches
- Producing solutions that pass comprehensive test suites
These are the kinds of tasks that take experienced engineers hours — not minutes.
Why 100% Correctness Matters
Most AI benchmarks accept partial credit. A model that gets 60% of the test cases right still gets counted as "solving" the problem.
We don't accept partial solutions. Jarvi3 either solves a Project Bench task completely correctly, or it doesn't solve it at all. The 50 tasks it has solved: all correct. No partial marks.
This is a deliberate product decision, not a benchmark gaming strategy. In real engineering work, a half-correct answer is often worse than no answer — it misleads, it breaks production, it wastes review time.
The Architecture Behind It
The improvement from SWE-bench to Project Bench required two things:
-
Deeper codebase traversal — Project Bench tasks span multiple files. We extended the code generation lane to maintain context across files and modules, not just within a single function.
-
SuperMath Brain integration — A significant portion of Project Bench tasks involve algorithms with provable correctness properties. Routing these through the deterministic logical reasoning layer — rather than generation — eliminates a major source of errors.
More technical detail on SuperMath Brain is coming. Open-source possibilities are being evaluated.
Timeline
We're 25% through the benchmark with 0% error rate. We're on track for 200/200 by June 2026.
When that happens, we'll publish a full technical write-up.
Follow along at jvi3.com.
Found this useful? Share it: