Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Enterprises are spending time and money building out retrieval-augmented generation (RAG) systems. The goal is to have an accurate enterprise AI system, but are those systems actually working?
The inability to objectively measure whether RAG systems are actually working is a critical blind spot. One potential solution to that challenge is launching today with the debut of the Open RAG Eval open-source framework. The new framework was developed by enterprise RAG platform provider Vectara working together with Professor Jimmy Lin and his research team at the University of Waterloo.
Open RAG Eval transforms the currently subjective ‘this looks better than that’ comparison approach into a rigorous, reproducible evaluation methodology that can measure retrieval accuracy, generation quality and hallucination rates across enterprise RAG deployments.
The framework assesses response quality using two major metric categories: retrieval metrics and generation metrics. It allows organizations to apply this evaluation to any RAG pipeline, whether using Vectara’s platform or custom-built solutions. For technical decision-makers, this means finally having a systematic way to identify exactly which components of their RAG implementations need optimization.
“If you can’t measure it, you can’t improve it,” Jimmy Lin, professor at the University of Waterloo, told VentureBeat in an exclusive interview. “In information retrieval and dense vectors, you could measure lots of things, ndcg [Normalized Discounted Cumulative Gain], precision, recall…but when it came to right answers, we had no way, that’s why we started on this path.”
Why RAG evaluation has become the bottleneck for enterprise AI adoption
Vectara was an early pioneer in the RAG space. The company launched in October 2022, before ChatGPT was a household name. Vectara actually debuted technology it originally referred to as grounded AI back in May 2023, as a way to limit hallucinations, before the RAG acronym was commonly used.
Over the last few months, for many enterprises, RAG implementations have grown increasingly complex and difficult to assess. A key challenge is that organizations are moving beyond simple question-answering to multi-step agentic systems.
“In the agentic world, evaluation is doubly important, because these AI agents tend to be multi-step,” Am Awadallah, Vectara CEO and cofounder told VentureBeat. “If you don’t catch hallucination the first step, then that compounds with the second step, compounds with the third step, and you end up with the wrong action or answer at the end of the pipeline.”
How Open RAG Eval works: Breaking the black box into measurable components
The Open RAG Eval framework approaches evaluation through a nugget-based methodology.
Lin explained that the nugget approach breaks responses down into essential facts, then measures how effectively a system captures the nuggets.
The framework evaluates RAG systems across four specific metrics:
- Hallucination detection – Measures the degree to which generated content contains fabricated information not supported by source documents.
- Citation – Quantifies how well citations in the response are supported by source documents.
- Auto nugget – Evaluates the presence of essential information nuggets from source documents in generated responses.
- UMBRELA (Unified Method for Benchmarking Retrieval Evaluation with LLM Assessment) – A holistic method for assessing overall retriever performance
Importantly, the framework evaluates the entire RAG pipeline end-to-end, providing visibility into how embedding models, retrieval systems, chunking strategies, and LLMs interact to produce final outputs.
The technical innovation: Automation through LLMs
What makes Open RAG Eval technically significant is how it uses large language models to automate what was previously a manual, labor-intensive evaluation process.
“The state of the art before we started, was left versus right comparisons,” Lin explained. “So this is, do you like the left one better? Do you like the right one better? Or they’re both good, or they’re both bad? That was sort of one way of doing things.”
Lin noted that the nugget-based evaluation approach itself isn’t new, but its automation through LLMs represents a breakthrough.
The framework uses Python with sophisticated prompt engineering to get LLMs to perform evaluation tasks like identifying nuggets and assessing hallucinations, all wrapped in a structured evaluation pipeline.
Competitive landscape: How Open RAG Eval fits into the evaluation ecosystem
As enterprise use of AI continues to mature, there is a growing number of evaluation frameworks. Just last week, Hugging Face launched Yourbench to test models against the company’s internal data. At the end of January, Galileo launched its Agentic Evaluations technology.
The Open RAG Eval is different in that it is strongly focussed on the RAG pipeline, not just LLM outputs.. The framework also has a strong academic foundation and is built on established information retrieval science rather than ad-hoc methods.
The framework builds on Vectara’s previous contributions to the open-source AI community, including its Hughes Hallucination Evaluation Model (HHEM), which has been downloaded over 3.5 million times on Hugging Face and has become a standard benchmark for hallucination detection.
“We’re not calling it the Vectara eval framework, we’re calling it the Open RAG Eval framework because we really want other companies and other institutions to start helping build this out,” Awadallah emphasized. “We need something like that in the market, for all of us, to make these systems evolve in the right way.”
What Open RAG Eval means in the real world
While still an early stage effort, Vectara at least already has multiple users interested in using the Open RAG Eval framework.
Among them is Jeff Hummel, SVP of Product and Technology at real estate firm Anywhere.re. Hummel expects that partnering with Vectara will allow him to streamline his company’s RAG evaluation process.
Hummel noted that scaling his RAG deployment introduced significant challenges around infrastructure complexity, iteration velocity and rising costs.
“Knowing the benchmarks and expectations in terms of performance and accuracy helps our team be predictive in our scaling calculations,” Hummel said. “To be frank, there weren’t a ton of frameworks for setting benchmarks on these attributes; we relied heavily on user feedback, which was sometimes objective and did translate to success at scale.”
From measurement to optimization: Practical applications for RAG implementers
For technical decision-makers, Open RAG Eval can help answer crucial questions about RAG deployment and configuration:
- Whether to use fixed token chunking or semantic chunking
- Whether to use hybrid or vector search, and what values to use for lambda in hybrid search
- Which LLM to use and how to optimize RAG prompts
- What thresholds to use for hallucination detection and correction
In practice, organizations can establish baseline scores for their existing RAG systems, make targeted configuration changes, and measure the resulting improvement. This iterative approach replaces guesswork with data-driven optimization.
While this initial release focuses on measurement, the roadmap includes optimization capabilities that could automatically suggest configuration improvements based on evaluation results. Future versions might also incorporate cost metrics to help organizations balance performance against operational expenses.
For enterprises looking to lead in AI adoption, Open RAG Eval means they can implement a scientific approach to evaluation rather than relying on subjective assessments or vendor claims. For those earlier in their AI journey, it provides a structured way to approach evaluation from the beginning, potentially avoiding costly missteps as they build out their RAG infrastructure.