• bitcoinBitcoin(BTC)$64,218.000.91%
  • ethereumEthereum(ETH)$1,723.59-0.09%
  • tetherTether(USDT)$1.00-0.02%
  • binancecoinBNB(BNB)$588.140.30%
  • usd-coinUSDC(USDC)$1.000.00%
  • rippleXRP(XRP)$1.14-0.29%
  • solanaSolana(SOL)$73.342.56%
  • tronTRON(TRX)$0.3264220.77%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.030.00%
  • HyperliquidHyperliquid(HYPE)$67.69-4.17%
  • dogecoinDogecoin(DOGE)$0.082999-1.02%
  • USDSUSDS(USDS)$1.000.00%
  • RainRain(RAIN)$0.014418-0.37%
  • leo-tokenLEO Token(LEO)$9.530.12%
  • zcashZcash(ZEC)$454.38-3.22%
  • stellarStellar(XLM)$0.211511-1.63%
  • whitebitWhiteBIT Coin(WBT)$52.670.32%
  • moneroMonero(XMR)$321.632.19%
  • cardanoCardano(ADA)$0.161204-0.92%
  • CantonCanton(CC)$0.1534390.85%
  • chainlinkChainlink(LINK)$7.94-0.01%
  • LABLAB(LAB)$15.7529.52%
  • USD1USD1(USD1)$1.00-0.05%
  • the-open-networkGram (prev. Toncoin)(GRAM)$1.685.64%
  • Ethena USDeEthena USDe(USDE)$1.00-0.01%
  • daiDai(DAI)$1.000.02%
  • bitcoin-cashBitcoin Cash(BCH)$198.30-0.09%
  • MemeCoreMemeCore(M)$2.830.34%
  • litecoinLitecoin(LTC)$44.971.76%
  • hedera-hashgraphHedera(HBAR)$0.079817-0.52%
  • Circle USYCCircle USYC(USYC)$1.130.00%
  • nearNEAR Protocol(NEAR)$2.245.45%
  • suiSui(SUI)$0.71-1.70%
  • Global DollarGlobal Dollar(USDG)$1.00-0.01%
  • paypal-usdPayPal USD(PYUSD)$1.000.03%
  • shiba-inuShiba Inu(SHIB)$0.000005-0.64%
  • avalanche-2Avalanche(AVAX)$6.281.97%
  • crypto-com-chainCronos(CRO)$0.0589420.51%
  • tether-goldTether Gold(XAUT)$4,144.820.01%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • BittensorBittensor(TAO)$234.241.85%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.13-0.09%
  • worldcoin-wldWorldcoin(WLD)$0.60-0.87%
  • pax-goldPAX Gold(PAXG)$4,154.780.10%
  • uniswapUniswap(UNI)$3.020.72%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.058041-0.65%
  • mantleMantle(MNT)$0.530.23%
  • AsterAster(ASTER)$0.640.29%
  • OndoOndo(ONDO)$0.337222-2.95%
  • polkadotPolkadot(DOT)$0.970.23%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export

June 21, 2026
in AI & Technology
Reading Time: 4 mins read
A A
Crawlee for Python: Build a Web Crawling Pipeline with Robots Handling, Link Graphs, and RAG Chunk Export
ShareShareShareShareShare

YOU MAY ALSO LIKE

Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration

Nous Research Updates Hermes Agent With a Blank Slate Mode That Pins Toolsets via platform_toolsets.cli and disabled_toolsets

def make_rag_chunks(rows, max_chars=700):
   chunks = []
   for row in rows:
       text = (
           row.get("text_preview")
           or row.get("rendered_text")
           or row.get("description")
           or ""
       )
       text = normalize_text(text)
       if not text:
           continue
       sentences = re.split(r"(?<=[.!?])\s+", text)
       current = ""
       for sentence in sentences:
           if len(current) + len(sentence) + 1 <= max_chars:
               current = (current + " " + sentence).strip()
           else:
               if current:
                   chunks.append(
                       {
                           "chunk_id": hashlib.sha1(
                               (row.get("url", "") + current).encode()
                           ).hexdigest()[:12],
                           "url": row.get("url"),
                           "source": row.get("source"),
                           "page_type": row.get("page_type"),
                           "title": row.get("title") or row.get("name"),
                           "text": current,
                       }
                   )
               current = sentence
       if current:
           chunks.append(
               {
                   "chunk_id": hashlib.sha1(
                       (row.get("url", "") + current).encode()
                   ).hexdigest()[:12],
                   "url": row.get("url"),
                   "source": row.get("source"),
                   "page_type": row.get("page_type"),
                   "title": row.get("title") or row.get("name"),
                   "text": current,
               }
           )
   return chunks
def analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows):
   all_rows = bs4_rows + parsel_rows + playwright_rows
   products = flatten_products(all_rows)
   crawl_df = pd.DataFrame(all_rows)
   product_df = pd.DataFrame(products)
   if not product_df.empty:
       product_df["price"] = pd.to_numeric(product_df["price"], errors="coerce")
       product_df["stock"] = pd.to_numeric(product_df["stock"], errors="coerce")
       product_df["rating"] = pd.to_numeric(product_df["rating"], errors="coerce")
       product_df["inventory_value"] = product_df["price"] * product_df["stock"]
   graph = build_link_graph(base_url, bs4_rows)
   graph_path = OUTPUT_DIR / "site_link_graph.graphml"
   if graph.number_of_nodes() > 0:
       nx.write_graphml(graph, graph_path)
   chunks = make_rag_chunks(all_rows)
   rag_path = OUTPUT_DIR / "rag_chunks.jsonl"
   with rag_path.open("w", encoding="utf-8") as f:
       for chunk in chunks:
           f.write(json.dumps(chunk, ensure_ascii=False) + "\n")
   crawl_json_path = OUTPUT_DIR / "combined_crawl_results.json"
   crawl_json_path.write_text(
       json.dumps(all_rows, ensure_ascii=False, indent=2),
       encoding="utf-8",
   )
   product_csv_path = OUTPUT_DIR / "normalized_product_catalog.csv"
   if not product_df.empty:
       product_df.to_csv(product_csv_path, index=False)
   price_plot_path = OUTPUT_DIR / "product_price_chart.png"
   if not product_df.empty and product_df["price"].notna().any():
       plot_df = product_df.dropna(subset=["price"]).copy()
       plot_df["label"] = plot_df["sku"].fillna("unknown") + "\n" + plot_df["source"].fillna("")
       ax = plot_df.plot(
           kind="bar",
           x="label",
           y="price",
           legend=False,
           figsize=(11, 5),
           title="Extracted Product Prices by Source",
       )
       ax.set_xlabel("Product / extraction source")
       ax.set_ylabel("Price")
       plt.xticks(rotation=35, ha="right")
       plt.tight_layout()
       plt.savefig(price_plot_path, dpi=160)
       plt.show()
   graph_stats = {
       "nodes": graph.number_of_nodes(),
       "edges": graph.number_of_edges(),
       "weakly_connected_components": (
           nx.number_weakly_connected_components(graph)
           if graph.number_of_nodes()
           else 0
       ),
   }
   if graph.number_of_nodes() > 0:
       in_degrees = dict(graph.in_degree())
       out_degrees = dict(graph.out_degree())
       graph_stats["top_in_degree"] = sorted(
           in_degrees.items(),
           key=lambda x: x[1],
           reverse=True,
       )[:5]
       graph_stats["top_out_degree"] = sorted(
           out_degrees.items(),
           key=lambda x: x[1],
           reverse=True,
       )[:5]
   summary = {
       "base_url": base_url,
       "rows_total": len(all_rows),
       "beautifulsoup_rows": len(bs4_rows),
       "parsel_rows": len(parsel_rows),
       "playwright_rows": len(playwright_rows),
       "products_total": len(product_df),
       "rag_chunks_total": len(chunks),
       "graph": graph_stats,
       "outputs": {
           "beautifulsoup_json": str(OUTPUT_DIR / "beautifulsoup_crawl.json"),
           "beautifulsoup_csv": str(OUTPUT_DIR / "beautifulsoup_crawl.csv"),
           "parsel_json": str(OUTPUT_DIR / "parsel_products.json"),
           "parsel_csv": str(OUTPUT_DIR / "parsel_products.csv"),
           "playwright_json": str(OUTPUT_DIR / "playwright_dynamic.json"),
           "playwright_csv": str(OUTPUT_DIR / "playwright_dynamic.csv"),
           "combined_json": str(crawl_json_path),
           "product_csv": str(product_csv_path) if product_csv_path.exists() else None,
           "rag_jsonl": str(rag_path),
           "graphml": str(graph_path) if graph_path.exists() else None,
           "price_plot": str(price_plot_path) if price_plot_path.exists() else None,
           "screenshots_dir": str(SCREENSHOT_DIR),
       },
   }
   summary_path = OUTPUT_DIR / "run_summary.md"
   summary_path.write_text(
       "# Crawlee Python Advanced Tutorial Run Summary\n\n"
       f"- Local demo site: `{base_url}`\n"
       f"- Total extracted rows: `{summary['rows_total']}`\n"
       f"- BeautifulSoup rows: `{summary['beautifulsoup_rows']}`\n"
       f"- Parsel rows: `{summary['parsel_rows']}`\n"
       f"- Playwright rows: `{summary['playwright_rows']}`\n"
       f"- Normalized products: `{summary['products_total']}`\n"
       f"- RAG chunks: `{summary['rag_chunks_total']}`\n"
       f"- Link graph nodes: `{graph_stats['nodes']}`\n"
       f"- Link graph edges: `{graph_stats['edges']}`\n\n"
       "## Output files\n\n"
       + "\n".join(f"- `{k}`: `{v}`" for k, v in summary["outputs"].items())
       + "\n",
       encoding="utf-8",
   )
   print("\n=== 4) Analysis summary ===")
   print(json.dumps(summary, indent=2, ensure_ascii=False))
   try:
       from IPython.display import display, Markdown, Image as IPImage
       display(Markdown("## Crawlee crawl preview"))
       if not crawl_df.empty:
           preview_cols = [
               col for col in ["source", "page_type", "title", "url"]
               if col in crawl_df.columns
           ]
           display(crawl_df[preview_cols].head(12))
       display(Markdown("## Normalized product catalog"))
       if not product_df.empty:
           display(product_df.head(20))
       if price_plot_path.exists():
           display(Markdown("## Product price chart"))
           display(IPImage(filename=str(price_plot_path)))
       screenshot_path = SCREENSHOT_DIR / "dynamic_catalog_full_page.png"
       if screenshot_path.exists():
           display(Markdown("## Playwright screenshot of JavaScript-rendered page"))
           display(IPImage(filename=str(screenshot_path)))
       display(Markdown(f"## Output directory\n`{OUTPUT_DIR}`"))
   except Exception as exc:
       print("Notebook display skipped:", repr(exc))
   return summary
async def main():
   httpd, base_url = start_local_server(SITE_DIR)
   print(f"\nLocal demo website is running at: {base_url}/index.html")
   try:
       bs4_rows = await run_beautifulsoup_crawl(base_url)
       parsel_rows = await run_parsel_precision_crawl(base_url)
       playwright_rows = await run_playwright_dynamic_crawl(base_url)
       summary = analyze_outputs(base_url, bs4_rows, parsel_rows, playwright_rows)
       return summary
   finally:
       httpd.shutdown()
       print("\nLocal demo server shut down.")
loop = asyncio.get_event_loop()
summary = loop.run_until_complete(main())
print("\nTutorial complete.")
print(f"All outputs are in: {OUTPUT_DIR}")
print("Key files:")
for file_path in sorted(OUTPUT_DIR.rglob("*")):
   if file_path.is_file():
       print(" -", file_path)

Credit: Source link

ShareTweetSendSharePin

Related Posts

Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration
AI & Technology

Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration

June 20, 2026
Nous Research Updates Hermes Agent With a Blank Slate Mode That Pins Toolsets via platform_toolsets.cli and disabled_toolsets
AI & Technology

Nous Research Updates Hermes Agent With a Blank Slate Mode That Pins Toolsets via platform_toolsets.cli and disabled_toolsets

June 20, 2026
NASA Is Testing A Rover That Can Drive Faster And Lift Its Wheels To Climb Obstacles
AI & Technology

NASA Is Testing A Rover That Can Drive Faster And Lift Its Wheels To Climb Obstacles

June 20, 2026
Epic Is Working On A ‘Ground-Up Rebuild’ Of Its Launcher That Will Be 5x Faster
AI & Technology

Epic Is Working On A ‘Ground-Up Rebuild’ Of Its Launcher That Will Be 5x Faster

June 20, 2026
Next Post
CAVA Stock: Back In Growth Mode With Sizzling Comps (NYSE:CAVA)

CAVA Stock: Back In Growth Mode With Sizzling Comps (NYSE:CAVA)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
SanDisk’s New 8TB PS5 SSD Costs More Than Three Times As Much As The PS5 Pro

SanDisk’s New 8TB PS5 SSD Costs More Than Three Times As Much As The PS5 Pro

June 17, 2026
How to pay off credit card debt with a personal loan

How to pay off credit card debt with a personal loan

June 19, 2026
Invest In SpaceX?

Invest In SpaceX?

June 18, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!