• bitcoinBitcoin(BTC)$64,926.00-2.66%
  • ethereumEthereum(ETH)$1,767.95-1.72%
  • tetherTether(USDT)$1.00-0.04%
  • binancecoinBNB(BNB)$601.43-2.64%
  • usd-coinUSDC(USDC)$1.00-0.01%
  • rippleXRP(XRP)$1.20-3.88%
  • solanaSolana(SOL)$72.36-3.78%
  • tronTRON(TRX)$0.3188890.42%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.040.77%
  • HyperliquidHyperliquid(HYPE)$72.52-1.14%
  • dogecoinDogecoin(DOGE)$0.086007-3.05%
  • USDSUSDS(USDS)$1.000.01%
  • leo-tokenLEO Token(LEO)$9.65-0.76%
  • RainRain(RAIN)$0.0140710.66%
  • zcashZcash(ZEC)$505.44-3.41%
  • stellarStellar(XLM)$0.218270-3.00%
  • moneroMonero(XMR)$349.640.50%
  • whitebitWhiteBIT Coin(WBT)$53.32-2.81%
  • cardanoCardano(ADA)$0.168841-6.44%
  • CantonCanton(CC)$0.160419-2.08%
  • chainlinkChainlink(LINK)$8.17-2.59%
  • USD1USD1(USD1)$1.00-0.05%
  • Ethena USDeEthena USDe(USDE)$1.00-0.04%
  • the-open-networkGram (prev. Toncoin)(GRAM)$1.63-2.07%
  • bitcoin-cashBitcoin Cash(BCH)$213.54-5.56%
  • daiDai(DAI)$1.000.00%
  • LABLAB(LAB)$12.9822.29%
  • MemeCoreMemeCore(M)$3.050.16%
  • litecoinLitecoin(LTC)$45.33-1.95%
  • hedera-hashgraphHedera(HBAR)$0.079947-4.43%
  • suiSui(SUI)$0.79-1.79%
  • Circle USYCCircle USYC(USYC)$1.130.00%
  • nearNEAR Protocol(NEAR)$2.28-8.47%
  • avalanche-2Avalanche(AVAX)$6.82-2.73%
  • shiba-inuShiba Inu(SHIB)$0.000005-2.59%
  • Global DollarGlobal Dollar(USDG)$1.00-0.01%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.01%
  • crypto-com-chainCronos(CRO)$0.059445-5.00%
  • tether-goldTether Gold(XAUT)$4,301.40-0.57%
  • BittensorBittensor(TAO)$252.61-6.04%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • worldcoin-wldWorldcoin(WLD)$0.65-2.55%
  • uniswapUniswap(UNI)$3.5117.50%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.130.30%
  • pax-goldPAX Gold(PAXG)$4,311.73-0.58%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0603300.01%
  • mantleMantle(MNT)$0.55-5.04%
  • OndoOndo(ONDO)$0.366726-4.88%
  • AsterAster(ASTER)$0.66-0.26%
  • polkadotPolkadot(DOT)$1.01-2.01%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics

June 14, 2026
in AI & Technology
Reading Time: 1 min read
A A
A Coding Hands-On on FineWeb for Streaming, Filtering, Deduplication, Tokenization, and Large-Scale Web Corpus Analytics
ShareShareShareShareShare

YOU MAY ALSO LIKE

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget

FIFA Wants Jamal Musiala To Forget About Dre (During The World Cup)

df["domain"] = df["url"].apply(lambda u: urlparse(u).netloc.replace("www.", "") if isinstance(u, str) else "?")
top_domains = df["domain"].value_counts().head(15)
print("\n--- Top 15 domains in sample ---")
print(top_domains)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes[0, 0].hist(df["token_count"].clip(upper=4000), bins=50, color="#7b2d26")
axes[0, 0].set_title("Token count per document (gpt2)")
axes[0, 0].set_xlabel("tokens"); axes[0, 0].set_ylabel("docs")
axes[0, 1].hist(df["language_score"], bins=40, color="#2d5d7b")
axes[0, 1].axvline(0.65, color="red", ls="--", label="FineWeb cutoff 0.65")
axes[0, 1].set_title("fastText English language score")
axes[0, 1].set_xlabel("score"); axes[0, 1].legend()
axes[1, 0].hist(df["chars_per_token"].clip(upper=8), bins=40, color="#3f7b2d")
axes[1, 0].set_title("Characters per token (compression)")
axes[1, 0].set_xlabel("chars / token")
top_domains.iloc[::-1].plot(kind="barh", ax=axes[1, 1], color="#7b5d2d")
axes[1, 1].set_title("Top domains")
plt.tight_layout()
plt.show()
print("\n" + "=" * 70)
print("SUMMARY")
print("=" * 70)
print(f"Docs streamed          : {len(df):,}")
print(f"Total gpt2 tokens       : {df['token_count'].sum():,}")
print(f"Median tokens/doc       : {int(df['token_count'].median())}")
print(f"Unique domains          : {df['domain'].nunique():,}")
print(f"Mean language_score     : {df['language_score'].mean():.3f}")
print(f"Near-duplicate pairs    : {len(dup_pairs)}")
print(f"Docs flagged by filters : {(pd.Series(results) != 'kept').sum()} / {len(results)}")
print("\nNext steps:")
print("  • Swap name="sample-10BT" for a real crawl, e.g. name="CC-MAIN-2024-10"")
print("  • Raise N_DOCS for stronger statistics")
print("  • Use the full datatrove pipeline to reproduce FineWeb end-to-end")

Credit: Source link

ShareTweetSendSharePin

Related Posts

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget
AI & Technology

MiniMax Sparse Attention (MSA): a Two-Branch Block-Sparse Attention Trained on a 109B-Parameter MoE With a 3T-Token Budget

June 17, 2026
FIFA Wants Jamal Musiala To Forget About Dre (During The World Cup)
AI & Technology

FIFA Wants Jamal Musiala To Forget About Dre (During The World Cup)

June 17, 2026
OpenAI’s Deployment Simulation Extends Pre-Deployment Risk Assessment to Agentic Coding Through Simulated Tool Calls
AI & Technology

OpenAI’s Deployment Simulation Extends Pre-Deployment Risk Assessment to Agentic Coding Through Simulated Tool Calls

June 17, 2026
Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again
AI & Technology

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

June 17, 2026
Next Post
Iran says draft US deal includes oil sanctions waiver, nuclear limits and asset release – Reuters

Iran says draft US deal includes oil sanctions waiver, nuclear limits and asset release - Reuters

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
When deep research isn’t enough for your business: Sakana AI launches ‘ultra deep research’ agent for 100+ page reports in 8 hours

When deep research isn’t enough for your business: Sakana AI launches ‘ultra deep research’ agent for 100+ page reports in 8 hours

June 15, 2026
Meet the Press Full Episode — May 10

Meet the Press Full Episode — May 10

June 11, 2026
Japan Raises Rates to 31-Year High to Ward Off War Inflation – The New York Times

Japan Raises Rates to 31-Year High to Ward Off War Inflation – The New York Times

June 16, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!