The Foundry
Where we build what doesn’t exist yet.
BespokeWorks' research and development arm. We build AI systems from first principles, test them against the hardest public benchmarks, and ship what works directly into client products.
Flagship
Benchmark-Breaking
Retrieval AI.
We built a question-answering system that outperforms every published alternative. When given a complex question that requires connecting information across multiple documents, our system finds the right answer more reliably than any other publicly documented approach.
On HotpotQA, a standard academic benchmark for multi-hop reasoning used across research labs and industry, our system achieved an F1 score of 86.8%.
Why this matters: Unlike most high-scoring systems, ours doesn't require training on the benchmark dataset. Fine-tuned models score higher on the specific data they're trained on, but can't generalise. Our system is training-free. It works out of the box on any domain. Finance documents, medical records, legal contracts. Same architecture, same accuracy. That's the difference between a research result and a production system.
When your AI gets the answer wrong, someone has to catch it. Higher accuracy means fewer errors, less human review, and more trust in the system. The gap between 79.5% and 86.8% isn't academic. It's the difference between a system that needs constant supervision and one that works.
HotpotQA · F1 Score (0–100%)
Public benchmark · Methodology and results published · Reproducible by any researcher.
From The Foundry
Case studies
Beating Every Published Benchmark for Multi-Hop QA
Our training-free RAG system achieved F1 86.8% on HotpotQA, outperforming StepChain GraphRAG (79.5%) and every other published result.
Read the case study →Inside Our Free AI Business Analyser
How our free instant business analyser audits any company website in about five minutes and returns a personalised 3-phase AI automation roadmap.
Read the case study →A Blog Generator That Outscored Opus 4.6 and the Internet's Top Writers
Four consecutive production posts averaged 81.8 / 100 on our open 8-category benchmark. Raw Claude Opus 4.6 scored 65. GPT-4o scored 55.
Read the case study →Generating Full MRI Scans from Partial Data
Using diffusion models to help clinicians work with complete imaging when only fragments are available.
Case study coming soonFree Tools
Working tools, released publicly.
Tools we have built in The Foundry and released publicly. No signup, no credit card.
Free AI Business Analyser
Enter any website URL and get a personalised 3-phase AI automation roadmap in about five minutes.
Launch the analyser →Business Analysis Cost Calculator
Interactive calculator estimating what a full business analysis would cost if you hired a human consultant. Transparent formula, live stage breakdown.
Open the calculator →Coming soon
More tools in the pipeline.
We publish new free tools as we build them. If there is a specific piece of analysis you wish was automated, let us know.
Tell us what you need →Pipeline
How research becomes product.
Everything we build in The Foundry eventually ships to clients. The pipeline is straightforward.
01
Research
Build new techniques from first principles
02
Benchmark
Test against public academic datasets
03
Harden
Production-grade reliability and speed
04
Deploy
Integrate into client systems
When we build a chatbot or knowledge system for your business, it's running the same architecture we've stress-tested against the hardest public benchmarks. You get research-grade AI without the research timeline.
Behind the work
The team and the research that drive The Foundry.
About
Meet the team
Who we are, why we started BespokeWorks, and how we work with UK SMEs.
Read more →Applied research
Research → product
How Foundry research becomes shipping client systems, methodology, ethics, and pipeline.
Read more →Careers
Work with us
Open roles, how we hire, and what it's like building inside The Foundry.
Read more →See what this means for your business.
The same architecture that scored 86.8% on the world's hardest retrieval benchmark is what runs inside your system. Run the free analyser to see exactly where it applies.