Research that ships.
What we’re testing inside The Foundry, the open problems we think matter, the techniques we’re benchmarking, and what graduates into client systems.
Active research areas
What we’re working on right now.
-
Retrieval
Training-free multi-hop QA
How to score 86%+ on HotpotQA without fine-tuning. So the same architecture works in any domain. Currently flagship.
-
Agents
Tool-use reliability
Why agents fail when chains exceed 5–7 tool calls, and what makes some agent loops stable past 50.
-
Evaluation
Production-faithful evals
Replacing benchmark scores with eval suites that predict deployment behaviour. Most ML evals don't.
-
UX
Confidence surfacing
When the model is uncertain, how do we show it to the user without breaking flow? Especially for high-stakes decisions.
-
Cost
Model routing economics
Tiered routing (small model → big model → human) is obvious. Building the trust signals that make routing safe is not.
-
Compliance
Audit-grade logging
Every action emits a structured log. Tamper-evident, replay-able. What does this look like at scale?
From research to product
How a paper becomes a deployed system.
-
Step 01
Hypothesis
Started by a real client problem we couldn't solve cleanly with existing tools.
-
Step 02
Bench + benchmark
Build the smallest experiment. Run it against the hardest public benchmark we can find.
-
Step 03
Harden
Eval suite → load test → adversarial test. Most ideas die here. The survivors graduate.
-
Step 04
Deploy
Goes into a client system as Bespoke Build, instrumented to feed back signal for the next iteration.
See the flagship benchmark.
Our HotpotQA result is our flagship case study, F1 86.8%, training-free, runs out of the box on any domain.