The introduction and evolution of generative AI have been so sudden and intense that it’s actually quite difficult to fully appreciate just how much this technology has changed our lives.
Zoom out to just three years ago. Yes, AI was becoming more pervasive, at least in theory. More people knew some of the things it could do, although even with that there were massive misunderstandings about the capabilities of AI. Somehow the technology was given simultaneously not enough and too much credit for what it could actually achieve. Still, the average person could point to at least one or two areas where AI was at work, performing highly specialized tasks fairly well, in highly controlled environments. Anything beyond that was either still in a research lab, or simply didn’t exist.
Compare that to today. With zero skills other than the ability to write a sentence or ask a question, the world is at our fingertips. We can generate images, music, and even movies that are truly unique and amazing, and have the capability to disrupt entire industries. We can supercharge our search engine process, asking a simple question that if framed right, can generate pages of custom content good enough to pass as a university-trained scholar … or an average third grader if we specify the POV. While they have somehow, in just a year or two, become commonplace, these capabilities were considered absolutely impossible just a few short years ago. The field of generative AI existed but had not taken off by any means.
Today, many people have experimented with generative AI such as ChatGPT, Midjourney, or other tools. Others have already incorporated them into their daily lives. The speed at which these have evolved is blistering to the point of being almost alarming. And given the advances of the last six months, we are no doubt going to be blown away, over and over, in the next few years.
One specific tool at play within generative AI has been the performance of Retrieval-Augmented Generation (RAG) systems, and their ability to think through especially complex queries. The introduction of the FRAMES dataset, explained in detail within an article on how the evaluation dataset works, shows both where the state of the art is now, and where it is headed. Even since the introduction of FRAMES in late 2024, a number of platforms have already broken new records on their ability to reason through difficult and complex queries.
Let’s dive into what FRAMES is meant to evaluate and how well different generative AI models are performing. We can see how both decentralization and open-source platforms are not only holding their ground (notably Sentient Chat), they are allowing users to get a clear glimpse of the astounding reasoning that some AI models are capable of achieving.
The FRAMES dataset and its evaluation process focuses on 824 “multi-hop” questions designed to require inference, logical connect-the-dots, the use of several different sources to retrieve key information, and the ability to logically piece them all together to answer the question. The questions need between two and 15 documents to answer them correctly, and also purposefully include constraints, mathematical calculations and deductions, as well as the ability to process time-based logic. In other words, these questions are extremely difficult and actually represent very real-world research chores that a human might undertake on the internet. We deal with these challenges all the time, and must search for the scattered key pieces of information in a sea of internet sources, piecing together information based on different sites, creating new information by calculating and deducing, and understanding how to consolidate these facts into a correct answer of the question.
What researchers found when the dataset was first released and tested is that the top GenAI models were able to be somewhat accurate (about 40%) when they had to answer using single-step methods, but could achieve a 73% accuracy if allowed to collect all necessary documents to answer the question. Yes, 73% might not seem like a revolution. But if you understand exactly what has to be answered, the number becomes much more impressive.
For example, one particular question is: “What year was the bandleader of the group who originally performed the song sampled in Kanye West’s song Power born?” How would a human go about solving this problem? The person might see that they need to gather various information elements, such as the lyrics to the Kanye West song called “Power”, and then be able to look through the lyrics and identify the point in the song that actually samples another song. We as humans could probably listen to the song (even if unfamiliar with it) and be able to tell when a different song is sampled.
But think about it: what would a GenAI have to accomplish to detect a song other than the original while “listening” to it? This is where a basic question becomes an excellent test of truly intelligent AI. And if we were able to find the song, listen to it, and identify the lyrics sampled, that is just Step 1. We still need to find out what the name of the song is, what the band is, who the leader of that band is, and then what year that person was born.
FRAMES shows that to answer realistic questions, a huge amount of thought processing is needed. Two things come to mind here.
First, the ability of decentralized GenAI models to not just compete, but potentially dominate the results, is incredible. A growing number of companies are using the decentralized method to scale their processing abilities while ensuring that a large community owns the software, not a centralized black box that will not share its advances. Companies like Perplexity and Sentient are leading this trend, each with formidable models performing above the first accuracy records when FRAMES was released.
The second element is that a smaller number of these AI models are not only decentralized, they are open-source. For instance, Sentient Chat is both, and early tests show just how complex its reasoning can be, thanks to the invaluable open-source access. The FRAMES question above is answered using much the same thought process as a human would use, with its reasoning details available for review. Perhaps even more interesting, their platform is structured as a number of models that can fine-tune a given perspective and performance, even though the fine-tuning process in some GenAI models results in diminished accuracy. In the case of Sentient Chat, many different models have been developed. For instance, a recent model called “Dobby 8B” is able to both outperform the FRAMES benchmark, but also develop a distinct pro-crypto and pro-freedom attitude, which affects the perspective of the model as it processes pieces of information and develops an answer.
The key to all these astounding innovations is the rapid speed that brought us here. We have to acknowledge that as fast as this technology has evolved, it is only going to evolve even faster in the near future. We will be able to see, especially with decentralized and open-source GenAI models, that crucial threshold where the system’s intelligence starts to exceed more and more of our own, and what that means for the future.
Credit: Source link