The Foundry · Case Study

Beating every published benchmark for multi-hop QA

A training-free RAG pipeline that beats every published HotpotQA result, works on any domain without retraining, and runs inside every knowledge system we deploy.

01

The problem: AI hallucinations in retrieval

Large Language Models (LLMs) are powerful, but they have a fundamental flaw: they hallucinate. They generate plausible-sounding but factually incorrect information. This is a major problem for businesses that need accurate, verifiable answers from their data.

To combat this, a technique called Retrieval Augmented Generation (RAG) was developed. Instead of relying solely on the LLM’s internal knowledge, RAG systems first retrieve relevant information from a trusted knowledge base, like your company documents, and then provide that information to the LLM as context for generating an answer.

A typical RAG pipeline looks like this:

  1. Retrieve. Given a user’s question, search a database of documents to find relevant passages.
  2. Augment. Combine the original question with the retrieved passages, forming a prompt for the LLM.
  3. Generate. The LLM uses this augmented prompt to generate an answer, grounded in the provided context.

RAG significantly reduces hallucinations and improves factual accuracy. However, not all RAG systems are created equal, and the difference between a mediocre pipeline and a great one is exactly what determines whether you can trust the answer.

02

HotpotQA: the adversarial benchmark

When you ask an AI a question that requires connecting information from multiple sources, most RAG systems fail. Not obviously, they give you an answer that sounds right but isn’t. This is especially true for complex questions that require multi-hop reasoning.

HotpotQA is an academic benchmark specifically designed to test this: can your system answer questions that require reading two or more passages and connecting the dots? It’s an adversarial dataset, meaning questions are crafted to trick models that rely on simple keyword matching or single-document retrieval.

For example, a question might be: “Who was the director of the movie starring the actor who played Gandalf in The Lord of the Rings?”

This requires three hops:

  1. Identify “Gandalf” and “The Lord of the Rings” to find the actor (Ian McKellen).
  2. Use “Ian McKellen” to find a movie he starred in (e.g. X-Men).
  3. Find the director of that movie (Bryan Singer).

A simple RAG system might retrieve documents about Gandalf and The Lord of the Rings, but fail to connect that information to another movie and its director.

Most standard RAG systems score around 72% F1 on HotpotQA. The best published result, StepChain GraphRAG, which uses knowledge graphs and multi-step chain-of-thought reasoning, reaches 79.5%. We wanted to do better.

The questions in HotpotQA were engineered to trick models that rely on simple keyword matching. Most standard RAG pipelines fall right into the trap.

, on why this benchmark separates real reasoning from pattern-matching
03

The result: outperforming SOTA

Our system scored F1 86.8% on HotpotQA. That’s 7.3 points above the best published result and 14.8 points above a standard RAG pipeline. The bar chart below puts that gap in context, alongside the human-annotator baseline reported by the dataset authors.

HotpotQA · F1 Score (0–100%)

Our systemBespokeWorks Foundry 86.8%
StepChain GraphRAGPublished best 79.5%
Standard RAGBaseline approach 72.0%
Human baselineYang et al., 2018 91.4%

What does F1 mean?

F1 is a common metric in question answering, measuring both precision (did the answer contain the right information?) and recall (did it capture all the right information?). An F1 of 86.8% means our system is right about 87 times out of 100, and when it’s partially right, it still captures most of the correct answer.

Beyond F1, we track two additional quality signals:

  • Correctness 88.1%. Exact-match factually accurate answers, not just close, precisely correct.
  • Faithfulness 97.3%. Answers grounded in the retrieved source documents. Near-zero hallucination.

For context: human annotators score F1 91.4% on this benchmark (Yang et al., 2018). Our system gets within 5 points of human accuracy, in seconds, not minutes.

Try this on your data

Curious how the same pipeline performs on your documents?

The Foundry pipeline runs inside our free Business Analyser. Drop in your website and we’ll show you exactly where multi-hop retrieval applies in your operation, with a costed three-phase roadmap. About five minutes. No signup.

The gap between a 72% pipeline and an 86.8% pipeline isn’t academic. It’s the difference between a system that creates work and one that eliminates it.

, on what the accuracy gap means in production
04

Why training-free matters

Most systems that score well on HotpotQA have been fine-tuned on HotpotQA data. They’ve seen the questions before, or questions very similar to them.

Our system has never seen HotpotQA data. It uses a general-purpose pipeline that works on any domain.

This is the difference between a student who memorised the exam answers and one who actually understands the subject.

, the training-free vs fine-tuned distinction

For businesses, this means: the same system that scores 86.8% on academic questions works on your finance documents, your legal contracts, your medical records, without retraining.

Fine-tuned models break when you move them to a new domain. They need new training data, new compute, new evaluation. Our approach doesn’t. Deploy it on Monday, and it works on whatever documents you point it at.

05

How it works

The core insight: the bottleneck in RAG isn’t retrieval, it’s extraction. Our retrieval already captures roughly 97% of relevant passages. The problem is what happens after. The AI finds the right documents but then extracts the wrong answer from them.

We solved this with a small set of techniques working together. We’re intentionally vague on specifics, the exact recipe is what makes the system worth running, but here’s the shape of it.

01

Multi-prompt extraction with evidence-weighted voting

We ask the same question several different ways, each time with a slightly different emphasis. Each prompt produces a candidate answer. Then we vote, not majority-vote, but evidence-weighted voting that considers how much supporting evidence each candidate has across all retrieved passages.

02

Bridge entity detection

Multi-hop questions have a hidden structure: you find Entity A in Document 1, then use Entity A to find the answer in Document 2. We detect these bridge entities automatically and use them to guide extraction, so the system follows the reasoning chain rather than jumping to surface-level matches.

03

Adversarial candidate deliberation

When the prompts disagree and produce three or more unique candidates, we run a deliberation step: the LLM explicitly reasons about which candidate actually answers the question that was asked. This catches "wrong hop" errors where the system finds a related entity but not the one the question is about.

04

Precision post-processing

Roughly 36% of errors in standard RAG systems come from formatting, the AI knows the right answer but wraps it in unnecessary context. We strip parentheticals, truncate verbose explanations, normalise name variants, and ground extracted spans against the source text. The answer gets shorter and more precise.

Each technique contributes. But the biggest single improvement came from multi-prompt voting, asking the same question in different ways and letting the evidence decide. It’s a simple idea, and it works because interpretation diversity surfaces answers that any single prompt might miss.

06

What we tested

We didn’t cherry-pick results. We tested on 35+ cases with random seeds to avoid overfitting to a specific sample.

This matters more than most people realise. A 10-case sample has a confidence interval of roughly ±16%, you can get an F1 of 0.92 one day and 0.78 the next with the same code. We only trust results from 35+ case evaluations with randomised selection.

We tested 16 pipeline variations to find what actually works. Most ideas that sound good on paper made things worse in practice:

  • Sentence isolation, extracting individual sentences instead of spans. Sounded precise. Lost context. Regressed.
  • Comparison decomposition, breaking comparison questions into sub-questions. Added complexity without accuracy. Regressed.
  • Contrastive verification, asking the model to verify its answer against alternatives. Over-corrected. Changed right answers to wrong ones.
  • High-temperature prompt diversity, using temperature 0.7–0.9 for voting prompts. Introduced noise instead of diversity. Regressed.

More complexity doesn’t equal more accuracy. The final system uses techniques that each proved their value in isolation on held-out data.

07

What this means for your business

If you’re building a chatbot, a knowledge system, or any AI that answers questions from your documents, the accuracy of the underlying pipeline determines how much human oversight you need.

At 72% accuracy, someone needs to check nearly every third answer. At 86.8% F1 and 88.1% correctness, the error rate drops by more than half. That’s not an incremental improvement, it’s the difference between a system that creates work and one that eliminates it.

Our pipeline, the same one that scored 86.8% on the hardest academic benchmark for multi-hop reasoning, is what powers every knowledge system we deploy for clients. It works on finance documents, legal contracts, medical records, technical documentation. No retraining needed.

And here’s what that means in practice: your company’s data will very likely produce even higher scores than the benchmark numbers suggest. HotpotQA was engineered with hand-crafted adversarial distractors, passages that share the same named entities and topics as the correct answer but deliberately lead the model in the wrong direction. That kind of adversarial noise doesn’t exist in your document store. Your data might be extensive, unstructured, or inconsistently formatted, but it isn’t trying to trick the AI.

The result: the gap in accuracy between HotpotQA and real enterprise deployments has consistently favoured the production environment. F1, correctness, and faithfulness all tend to be meaningfully higher when the system runs on your actual data.

The cost-vs-accuracy framing. Human annotators score F1 91.4% on this benchmark, but each question takes a skilled researcher 2–5 minutes, reading 500–1,500 words across multiple passages, identifying which pieces connect, and writing a precise answer. With real business documents, 30-page contracts, dense financial reports, multi-section medical records, the reading time per query is significantly higher. Our system handles the same reasoning in seconds, and gets within 5 points of human accuracy. High-stakes answers still benefit from a human check, but the bulk of review work that doesn’t need one is eliminated.

See it on your data

See what 86.8% means for your business.

The same architecture that beat every published HotpotQA result is what runs inside the systems we deploy. Run the free analyser to see exactly where it applies, or talk to the team about your knowledge retrieval needs.


BespokeWorks

Worked with us? We'd love your feedback.

Your experience helps other businesses make the right decision.

Leave a Review on Trustpilot
100%
Custom Built
Global
Clients Served
Free
AI Analysis
Analysis running

View Your Roadmap →