• Kinza Babylon Staked BTCKinza Babylon Staked BTC(KBTC)$83,270.000.00%
  • Steakhouse EURCV Morpho VaultSteakhouse EURCV Morpho Vault(STEAKEURCV)$0.000000-100.00%
  • Stride Staked InjectiveStride Staked Injective(STINJ)$16.51-4.18%
  • Vested XORVested XOR(VXOR)$3,404.231,000.00%
  • FibSwap DEXFibSwap DEX(FIBO)$0.0084659.90%
  • ICPanda DAOICPanda DAO(PANDA)$0.003106-39.39%
  • TruFin Staked APTTruFin Staked APT(TRUAPT)$8.020.00%
  • bitcoinBitcoin(BTC)$103,724.00-2.15%
  • ethereumEthereum(ETH)$2,519.99-4.27%
  • VNST StablecoinVNST Stablecoin(VNST)$0.0000400.67%
  • tetherTether(USDT)$1.000.01%
  • rippleXRP(XRP)$2.14-2.97%
  • binancecoinBNB(BNB)$655.08-2.59%
  • Wrapped SOLWrapped SOL(SOL)$143.66-2.32%
  • solanaSolana(SOL)$154.86-5.84%
  • usd-coinUSDC(USDC)$1.000.00%
  • dogecoinDogecoin(DOGE)$0.189831-7.59%
  • tronTRON(TRX)$0.268802-0.35%
  • cardanoCardano(ADA)$0.67-5.14%
  • staked-etherLido Staked Ether(STETH)$2,519.60-4.24%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$103,596.00-1.92%
  • Gaj FinanceGaj Finance(GAJ)$0.0059271.46%
  • Content BitcoinContent Bitcoin(CTB)$24.482.55%
  • USD OneUSD One(USD1)$1.000.11%
  • Wrapped stETHWrapped stETH(WSTETH)$3,028.02-4.45%
  • HyperliquidHyperliquid(HYPE)$31.66-1.39%
  • SuiSui(SUI)$3.16-8.53%
  • UGOLD Inc.UGOLD Inc.(UGOLD)$3,042.460.08%
  • ParkcoinParkcoin(KPK)$1.101.76%
  • chainlinkChainlink(LINK)$13.69-6.19%
  • avalanche-2Avalanche(AVAX)$20.19-6.65%
  • stellarStellar(XLM)$0.262378-3.73%
  • leo-tokenLEO Token(LEO)$8.74-4.23%
  • bitcoin-cashBitcoin Cash(BCH)$402.04-1.59%
  • ToncoinToncoin(TON)$3.08-7.00%
  • shiba-inuShiba Inu(SHIB)$0.000013-6.48%
  • USDSUSDS(USDS)$1.000.01%
  • hedera-hashgraphHedera(HBAR)$0.163573-7.22%
  • Yay StakeStone EtherYay StakeStone Ether(YAYSTONE)$2,671.07-2.84%
  • wethWETH(WETH)$2,520.05-3.95%
  • Wrapped eETHWrapped eETH(WEETH)$2,691.04-4.23%
  • litecoinLitecoin(LTC)$85.05-6.59%
  • polkadotPolkadot(DOT)$3.99-4.47%
  • Pundi AIFXPundi AIFX(PUNDIAI)$16.000.00%
  • Binance Bridged USDT (BNB Smart Chain)Binance Bridged USDT (BNB Smart Chain)(BSC-USD)$1.00-0.08%
  • PengPeng(PENG)$0.60-13.59%
  • moneroMonero(XMR)$320.55-5.05%
  • Bitget TokenBitget Token(BGB)$4.62-9.34%
  • Ethena USDeEthena USDe(USDE)$1.000.04%
  • MurasakiMurasaki(MURA)$4.32-12.46%
  • Kinza Babylon Staked BTCKinza Babylon Staked BTC(KBTC)$83,270.000.00%
  • Steakhouse EURCV Morpho VaultSteakhouse EURCV Morpho Vault(STEAKEURCV)$0.000000-100.00%
  • Stride Staked InjectiveStride Staked Injective(STINJ)$16.51-4.18%
  • Vested XORVested XOR(VXOR)$3,404.231,000.00%
  • FibSwap DEXFibSwap DEX(FIBO)$0.0084659.90%
  • ICPanda DAOICPanda DAO(PANDA)$0.003106-39.39%
  • TruFin Staked APTTruFin Staked APT(TRUAPT)$8.020.00%
  • bitcoinBitcoin(BTC)$103,724.00-2.15%
  • ethereumEthereum(ETH)$2,519.99-4.27%
  • VNST StablecoinVNST Stablecoin(VNST)$0.0000400.67%
  • tetherTether(USDT)$1.000.01%
  • rippleXRP(XRP)$2.14-2.97%
  • binancecoinBNB(BNB)$655.08-2.59%
  • Wrapped SOLWrapped SOL(SOL)$143.66-2.32%
  • solanaSolana(SOL)$154.86-5.84%
  • usd-coinUSDC(USDC)$1.000.00%
  • dogecoinDogecoin(DOGE)$0.189831-7.59%
  • tronTRON(TRX)$0.268802-0.35%
  • cardanoCardano(ADA)$0.67-5.14%
  • staked-etherLido Staked Ether(STETH)$2,519.60-4.24%
  • wrapped-bitcoinWrapped Bitcoin(WBTC)$103,596.00-1.92%
  • Gaj FinanceGaj Finance(GAJ)$0.0059271.46%
  • Content BitcoinContent Bitcoin(CTB)$24.482.55%
  • USD OneUSD One(USD1)$1.000.11%
  • Wrapped stETHWrapped stETH(WSTETH)$3,028.02-4.45%
  • HyperliquidHyperliquid(HYPE)$31.66-1.39%
  • SuiSui(SUI)$3.16-8.53%
  • UGOLD Inc.UGOLD Inc.(UGOLD)$3,042.460.08%
  • ParkcoinParkcoin(KPK)$1.101.76%
  • chainlinkChainlink(LINK)$13.69-6.19%
  • avalanche-2Avalanche(AVAX)$20.19-6.65%
  • stellarStellar(XLM)$0.262378-3.73%
  • leo-tokenLEO Token(LEO)$8.74-4.23%
  • bitcoin-cashBitcoin Cash(BCH)$402.04-1.59%
  • ToncoinToncoin(TON)$3.08-7.00%
  • shiba-inuShiba Inu(SHIB)$0.000013-6.48%
  • USDSUSDS(USDS)$1.000.01%
  • hedera-hashgraphHedera(HBAR)$0.163573-7.22%
  • Yay StakeStone EtherYay StakeStone Ether(YAYSTONE)$2,671.07-2.84%
  • wethWETH(WETH)$2,520.05-3.95%
  • Wrapped eETHWrapped eETH(WEETH)$2,691.04-4.23%
  • litecoinLitecoin(LTC)$85.05-6.59%
  • polkadotPolkadot(DOT)$3.99-4.47%
  • Pundi AIFXPundi AIFX(PUNDIAI)$16.000.00%
  • Binance Bridged USDT (BNB Smart Chain)Binance Bridged USDT (BNB Smart Chain)(BSC-USD)$1.00-0.08%
  • PengPeng(PENG)$0.60-13.59%
  • moneroMonero(XMR)$320.55-5.05%
  • Bitget TokenBitget Token(BGB)$4.62-9.34%
  • Ethena USDeEthena USDe(USDE)$1.000.04%
  • MurasakiMurasaki(MURA)$4.32-12.46%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

This AI Paper from CMU and Apple Unveils WRAP: A Game-Changer for Pre-training Language Models with Synthetic Data

February 5, 2024
in AI & Technology
Reading Time: 4 mins read
A A
This AI Paper from CMU and Apple Unveils WRAP: A Game-Changer for Pre-training Language Models with Synthetic Data
ShareShareShareShareShare

YOU MAY ALSO LIKE

QwenLong-L1 solves long-context reasoning challenge that stumps current LLMs

ElevenLabs debuts Conversational AI 2.0 voice assistants that understand when to pause, speak, and take turns talking

Large Language Models (LLMs) have gathered a massive amount of attention and popularity among the Artificial Intelligence (AI) community in recent months. These models have demonstrated great capabilities in tasks including text summarization, question answering, code completion, content generation, etc. 

LLMs are frequently trained on inadequate web-scraped data. Most of the time, this data is loud, unstructured, and not necessarily expressed clearly. Following the existing scaling principles, which indicate that as the size of the model increases, computational power and data quantity should also increase proportionately, comes as a challenge.

There are two main limitations. Firstly, there is the significant computational cost and time involved in pre-training. Secondly, there is the impending problem of the scarcity of high-quality data available on the Internet. In recent research, a team of researchers from Apple and Carnegie Mellon University has addressed these issues by introducing the idea of Web Rephrase Augmented Pre-training (WRAP). 

WRAP is an innovative method that makes use of an already-existing, instruction-tuned LLM. This LLM is used to paraphrase online pages into particular styles, including mimicking the tone of Wikipedia or converting text into an answer-question format. The main goal of WRAP is to improve LLMs’ pre-training by adding both genuine and artificially rephrased data. 

The primary features of WRAP are as follows:

  1. Pre-training Efficiency: Applying WRAP to the noisy C4 dataset considerably speeds up pre-training, around three times faster. This effectiveness is critical in reducing the high expenses and time commitment usually related to LLM training.
  1. Enhancement of Model Performance: WRAP makes the model perform better when run within the same computational budget. Using different subsets of the Pile, a large-scale dataset used for training and assessing LLMs reduces ambiguity by more than 10%. It improves zero-shot question-answer accuracy by over 2% for 13 different activities.
  1. Rephrasing Web Documents: WRAP uses a medium-sized LLM to paraphrase documents from the web into several styles. This method is different from creating new data because it improves already-existing content while preserving the original information’s quality and diversity.

There are two main benefits to the synthetic data produced by WRAP. Firstly, it includes a range of styles that reflect the diversity of languages used in applications farther down the line. With this diversity, the LLM is better prepared for a wider variety of real-world events. Secondly, the synthetic data rephrased is of a higher quality than the raw web-scraped data. This quality enhancement results from language that is more ordered and cohesive, as this promotes more efficient model learning.

In conclusion, WRAP is a big advancement in the field of LLM pre-training. Through the use of superior-quality, different-style synthetic data, WRAP not only expedites the training process but also improves the overall performance of LLMs. Given the abundance of low-quality web data and the resource-intensive nature of classic LLM training approaches, this approach presents a possible way forward. 


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our Telegram Channel


Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.


🎯 [FREE AI WEBINAR] ‘Using ANN for Vector Search at Speed & Scale (Demo on AWS)’ (Feb 5, 2024)


Credit: Source link

ShareTweetSendSharePin

Related Posts

QwenLong-L1 solves long-context reasoning challenge that stumps current LLMs
AI & Technology

QwenLong-L1 solves long-context reasoning challenge that stumps current LLMs

May 30, 2025
ElevenLabs debuts Conversational AI 2.0 voice assistants that understand when to pause, speak, and take turns talking
AI & Technology

ElevenLabs debuts Conversational AI 2.0 voice assistants that understand when to pause, speak, and take turns talking

May 30, 2025
Yooka-Laylee developer Playtonic is laying off over a dozen staff
AI & Technology

Yooka-Laylee developer Playtonic is laying off over a dozen staff

May 30, 2025
ZeniMax QA workers win tentative union contract with Microsoft
AI & Technology

ZeniMax QA workers win tentative union contract with Microsoft

May 30, 2025
Next Post
Wildfire burns on Spanish island amid European heatwave

Wildfire burns on Spanish island amid European heatwave

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Judge orders government to facilitate return of deportee after error – The Washington Post

Judge orders government to facilitate return of deportee after error – The Washington Post

May 25, 2025
M*A*S*H actress Loretta Swit dies aged 87 – BBC

M*A*S*H actress Loretta Swit dies aged 87 – BBC

May 30, 2025
Pope Leo XIV urges the media to end divisiveness and calls for the release of jailed journalists.

Pope Leo XIV urges the media to end divisiveness and calls for the release of jailed journalists.

May 24, 2025

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!