LongPiBench: A Comprehensive Benchmark that Explores How Even the Top Large Language Models have Relative Positional Biases

Klap AI Review: Transform Videos Into Viral Shorts Instantly

24 Amazing Gadgets ( 2024 ) You Can Buy On Amazon And Online

Accurate assessment of Large Language Models is best done with complex tasks involving long input sequences. Input sequence can exceed even 200,000 tokens in complex tasks such as repository analysis and information retrieval.LLMs, in response, have evolved, too, to accommodate context lengths of up to 1 million tokens. While examining the performance of capable LLMs on tasks involving long context lengths, researchers noticed a few underlying problems. Models exhibited difficulties while processing an input’s middle information, commonly called the “Lost in the Middle Effect.” Earlier research in LLM assessment had absolute positional biases that presumed relevant information concentration in specific locations. However, realistically, the information is scattered as multiple pertinent chunks of the text, which brings in the view of relative positional biases where the performance is examined with respect to the relative distance between chunks. Relative position introduces a bias in LLMs, thus affecting their performance. This article explains the latest research that systematically investigates positional biases in large language models.

Researchers from Tsinghua University and ModelBest Inc. introduced LongPiBench, a comprehensive benchmark to isolate and assess positional biases of LLMs. LongPiBench allows assessment concerning absolute and relative information positions with tasks ranging from easy to complex and 32k to 256k tokens. It contains three different tasks spanning four different context lengths-32k, 64k, 128k, and 256k. Furthermore, it has 16 different levels of absolute and relative positions. LongPiBench is collocated in two steps. Manual annotation of multiple seed examples is succeeded by augmentations to vary the positions of relevant information. The authors assessed multiple LLMs on this dataset, and it helped them to unravel the significant shortcomings of the latest models.

LongPiBench was developed by labeling seed points from Table SQL, Timeline Reordering, and Equation Solving tasks. This was followed by augmentation or rearrangement of relevant information. The context was decomposed into elements for each task based on respective units. Table SQL units were table entries, event entries for timeline reordering, and equation lines for equation solving. Every element was further annotated for relevance by forming queries around relevant items and adding irrelevant ones. The authors further implemented quality control checks to ensure integrity.

The research team evaluated 11 renowned LLMs on LongPiBench. They found that newer models are somewhat immune to the “Lost in Middle Effect,” but they still exhibit biases related to the spacing of relevant information. Six of the 11 LLMs were open-sourced models, and the remaining were commercial models. Llama-3.1-Instruct series, GPT-4o-mini, Claude-3-Haiku, and Gemini-1.5-Flash were some of the models assessed. During the preliminary tests, authors found that timeline reordering and equation solving were rigorous and challenging, and even top-performing models could have at most 20 % accuracy. Therefore, further analysis was performed on the Table SQL task. In tasks with absolute positioning, commercial and larger open-sourced models showed excellent robustness against the ‘lost in the middle effect ‘. For relative positioning, all models exhibited biases in different positions. Their performance sharply decreased with variations in relative distance. The issue of relative positioning bias is so severe that it reduced the recall rate by 30 %, even in the most straightforward tasks of retrieval. This highlights the necessity of continuously mitigating positional biases in long-text models

LongPiBench highlights the importance of relative positioning biases in modern LLMs and how they remain unresolved.It is essential to analyze this bias in more tasks to understand and solve the challenge because, if unresolved, this issue may substantially undermine the effectiveness of long-text language models in practical applications.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

Adeeba Alam Ansari is currently pursuing her Dual Degree at the Indian Institute of Technology (IIT) Kharagpur, earning a B.Tech in Industrial Engineering and an M.Tech in Financial Engineering. With a keen interest in machine learning and artificial intelligence, she is an avid reader and an inquisitive individual. Adeeba firmly believes in the power of technology to empower society and promote welfare through innovative solutions driven by empathy and a deep understanding of real-world challenges.

Listen to our latest AI podcasts and AI research videos here ➡️

Credit: Source link