• bitcoinBitcoin(BTC)$59,862.00-0.58%
  • ethereumEthereum(ETH)$1,573.82-0.48%
  • tetherTether(USDT)$1.000.00%
  • binancecoinBNB(BNB)$552.48-0.60%
  • usd-coinUSDC(USDC)$1.000.00%
  • rippleXRP(XRP)$1.050.30%
  • solanaSolana(SOL)$73.262.03%
  • tronTRON(TRX)$0.321420-0.71%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.020.00%
  • HyperliquidHyperliquid(HYPE)$63.671.04%
  • dogecoinDogecoin(DOGE)$0.072878-1.07%
  • USDSUSDS(USDS)$1.000.01%
  • RainRain(RAIN)$0.015544-0.07%
  • leo-tokenLEO Token(LEO)$9.39-0.32%
  • zcashZcash(ZEC)$383.14-0.78%
  • stellarStellar(XLM)$0.1731050.88%
  • moneroMonero(XMR)$305.88-2.11%
  • CantonCanton(CC)$0.145812-3.98%
  • whitebitWhiteBIT Coin(WBT)$47.85-0.57%
  • chainlinkChainlink(LINK)$7.310.38%
  • cardanoCardano(ADA)$0.144717-0.02%
  • LABLAB(LAB)$15.86-8.79%
  • USD1USD1(USD1)$1.00-0.05%
  • daiDai(DAI)$1.00-0.01%
  • Ethena USDeEthena USDe(USDE)$1.000.02%
  • the-open-networkGram (prev. Toncoin)(GRAM)$1.602.89%
  • bitcoin-cashBitcoin Cash(BCH)$196.571.81%
  • litecoinLitecoin(LTC)$42.46-1.39%
  • Circle USYCCircle USYC(USYC)$1.130.00%
  • hedera-hashgraphHedera(HBAR)$0.071076-0.29%
  • Global DollarGlobal Dollar(USDG)$1.000.01%
  • avalanche-2Avalanche(AVAX)$6.574.00%
  • suiSui(SUI)$0.690.44%
  • paypal-usdPayPal USD(PYUSD)$1.000.01%
  • crypto-com-chainCronos(CRO)$0.053881-0.81%
  • shiba-inuShiba Inu(SHIB)$0.0000040.33%
  • tether-goldTether Gold(XAUT)$4,034.67-0.77%
  • nearNEAR Protocol(NEAR)$1.84-0.97%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Ondo US Dollar YieldOndo US Dollar Yield(USDY)$1.140.01%
  • BittensorBittensor(TAO)$205.62-1.23%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0581550.52%
  • uniswapUniswap(UNI)$2.950.72%
  • pax-goldPAX Gold(PAXG)$4,036.74-0.85%
  • AsterAster(ASTER)$0.631.47%
  • okbOKB(OKB)$77.780.75%
  • Ripple USDRipple USD(RLUSD)$1.000.07%
  • OndoOndo(ONDO)$0.3120800.39%
  • HTX DAOHTX DAO(HTX)$0.000002-0.58%
  • worldcoin-wldWorldcoin(WLD)$0.427620-4.28%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls, Auditing Data, and Training Baselines

June 28, 2026
in AI & Technology
Reading Time: 6 mins read
A A
Building a Stable Fable 5 Traces Workflow in Colab: Parsing Tool Calls, Auditing Data, and Training Baselines
ShareShareShareShareShare

YOU MAY ALSO LIKE

Rocket Lab Buys Satellite Company Iridium To Go Up Against Starlink And Amazon’s Leo

Meet EverOS: An Open Source Markdown-First Agent Memory Runtime With Hybrid BM25 + Vector Retrieval and Self-Evolving Skills

rprint(Panel.fit("[bold]Baseline 1: Predict output_type from context using pure Python Naive Bayes[/bold]"))
model_artifacts = {}
classifier_df = df.dropna(subset=["output_type"]).copy()
classifier_df = classifier_df[
   classifier_df["output_type"].astype(str).str.len() > 0
].copy()
if classifier_df["output_type"].nunique() >= 2 and len(classifier_df) >= 30:
   X_text = (
       classifier_df["context"]
       .fillna("")
       .astype(str)
       .map(lambda text: text[:12000])
       .tolist()
   )
   y = classifier_df["output_type"].astype(str).tolist()
   train_indices, test_indices = stratified_train_test_indices(y, test_size=0.2, seed=SEED)
   X_train = [X_text[i] for i in train_indices]
   y_train = [y[i] for i in train_indices]
   X_test = [X_text[i] for i in test_indices]
   y_test = [y[i] for i in test_indices]
   output_type_classifier = PureMultinomialNB(
       max_features=20000,
       min_df=2,
       alpha=1.0,
   )
   output_type_classifier.fit(X_train, y_train)
   predictions = output_type_classifier.predict(X_test)
   output_type_metrics, output_report_df = evaluate_predictions(y_test, predictions)
   output_matrix_df = confusion_matrix_df(y_test, predictions)
   output_type_metrics["train_rows"] = len(X_train)
   output_type_metrics["test_rows"] = len(X_test)
   output_type_metrics["vocab_size"] = len(output_type_classifier.vocab)
   rprint("[bold]Output type classifier report:[/bold]")
   display(output_report_df)
   display(output_matrix_df)
   output_report_df.to_csv(OUT_DIR / "output_type_classifier_report.csv", index=False)
   output_matrix_df.to_csv(OUT_DIR / "output_type_confusion_matrix.csv")
   top_token_records = []
   for label in output_type_classifier.labels:
       for token, margin in output_type_classifier.top_tokens_for_class(label, n=25):
           top_token_records.append(
               {
                   "label": label,
                   "token": token,
                   "score_margin": margin,
               }
           )
   pd.DataFrame(top_token_records).to_csv(
       OUT_DIR / "output_type_top_tokens.csv",
       index=False,
   )
   with open(
       OUT_DIR / "output_type_classifier_metrics.json",
       "w",
       encoding="utf-8",
   ) as file:
       json.dump(output_type_metrics, file, ensure_ascii=False, indent=2)
   model_artifacts["output_type_classifier_metrics"] = str(
       OUT_DIR / "output_type_classifier_metrics.json"
   )
   model_artifacts["output_type_classifier_report"] = str(
       OUT_DIR / "output_type_classifier_report.csv"
   )
   model_artifacts["output_type_confusion_matrix"] = str(
       OUT_DIR / "output_type_confusion_matrix.csv"
   )
   model_artifacts["output_type_top_tokens"] = str(
       OUT_DIR / "output_type_top_tokens.csv"
   )
else:
   rprint(
       "[yellow]Skipping output_type classifier because there are too few "
       "classes or rows.[/yellow]"
   )
   output_type_metrics = {}
rprint(Panel.fit("[bold]Baseline 2: Predict tool_name from context using pure Python Naive Bayes[/bold]"))
tool_classifier_df = df[
   df["output_type"].eq("tool_use")
   & df["tool_name"].fillna("").astype(str).str.len().gt(0)
].copy()
if len(tool_classifier_df) >= 50 and tool_classifier_df["tool_name"].nunique() >= 2:
   top_tools = tool_classifier_df["tool_name"].value_counts().head(12).index.tolist()
   tool_classifier_df["tool_label"] = tool_classifier_df["tool_name"].where(
       tool_classifier_df["tool_name"].isin(top_tools),
       "__OTHER__",
   )
   y_tool = tool_classifier_df["tool_label"].astype(str).tolist()
   X_tool_text = (
       tool_classifier_df["context"]
       .fillna("")
       .astype(str)
       .map(lambda text: text[:12000])
       .tolist()
   )
   if len(set(y_tool)) >= 2:
       train_indices, test_indices = stratified_train_test_indices(y_tool, test_size=0.2, seed=SEED)
       X_train = [X_tool_text[i] for i in train_indices]
       y_train = [y_tool[i] for i in train_indices]
       X_test = [X_tool_text[i] for i in test_indices]
       y_test = [y_tool[i] for i in test_indices]
       tool_classifier = PureMultinomialNB(
           max_features=20000,
           min_df=2,
           alpha=1.0,
       )
       tool_classifier.fit(X_train, y_train)
       tool_predictions = tool_classifier.predict(X_test)
       tool_metrics, tool_report_df = evaluate_predictions(y_test, tool_predictions)
       tool_matrix_df = confusion_matrix_df(y_test, tool_predictions)
       tool_metrics["train_rows"] = len(X_train)
       tool_metrics["test_rows"] = len(X_test)
       tool_metrics["vocab_size"] = len(tool_classifier.vocab)
       rprint("[bold]Tool classifier report:[/bold]")
       display(tool_report_df)
       display(tool_matrix_df)
       tool_report_df.to_csv(OUT_DIR / "tool_name_classifier_report.csv", index=False)
       tool_matrix_df.to_csv(OUT_DIR / "tool_name_confusion_matrix.csv")
       top_tool_token_records = []
       for label in tool_classifier.labels:
           for token, margin in tool_classifier.top_tokens_for_class(label, n=25):
               top_tool_token_records.append(
                   {
                       "label": label,
                       "token": token,
                       "score_margin": margin,
                   }
               )
       pd.DataFrame(top_tool_token_records).to_csv(
           OUT_DIR / "tool_name_top_tokens.csv",
           index=False,
       )
       with open(
           OUT_DIR / "tool_name_classifier_metrics.json",
           "w",
           encoding="utf-8",
       ) as file:
           json.dump(tool_metrics, file, ensure_ascii=False, indent=2)
       model_artifacts["tool_name_classifier_metrics"] = str(
           OUT_DIR / "tool_name_classifier_metrics.json"
       )
       model_artifacts["tool_name_classifier_report"] = str(
           OUT_DIR / "tool_name_classifier_report.csv"
       )
       model_artifacts["tool_name_confusion_matrix"] = str(
           OUT_DIR / "tool_name_confusion_matrix.csv"
       )
       model_artifacts["tool_name_top_tokens"] = str(
           OUT_DIR / "tool_name_top_tokens.csv"
       )
   else:
       rprint("[yellow]Skipping tool classifier because labels collapsed to one class.[/yellow]")
       tool_metrics = {}
else:
   rprint(
       "[yellow]Skipping tool classifier because there are too few tool-use "
       "rows or tool classes.[/yellow]"
   )
   tool_metrics = {}
rprint(Panel.fit("[bold]Building simple keyword search helper[/bold]"))
def search_rows(keyword, limit=5, search_cols=("context", "cot", "completion", "text_payload")):
   keyword = str(keyword).lower()
   mask = pd.Series(False, index=df.index)
   for column in search_cols:
       mask = mask | (
           df[column]
           .fillna("")
           .astype(str)
           .str.lower()
           .str.contains(re.escape(keyword), regex=True)
       )
   hits = df[mask].head(limit)
   results = []
   for _, row in hits.iterrows():
       results.append(
           {
               "uid": row.get("uid"),
               "session": row.get("session"),
               "output_type": row.get("output_type"),
               "tool_name": row.get("tool_name"),
               "context_preview": preview_text(row.get("context"), 400),
               "payload_preview": preview_text(row.get("text_payload"), 400),
           }
       )
   return results
example_queries = [
   "Bash",
   "Write",
   "browser",
   "test",
   "README",
]
search_demo = {
   query: search_rows(query, limit=2)
   for query in example_queries
}
with open(
   OUT_DIR / "keyword_search_demo.json",
   "w",
   encoding="utf-8",
) as file:
   json.dump(search_demo, file, ensure_ascii=False, indent=2)
rprint("[bold]Example keyword search results:[/bold]")
rprint(safe_json_dumps(search_demo, max_chars=5000))
summary = {
   "dataset_id": DATASET_ID,
   "flat_jsonl_filename": FLAT_JSONL_FILENAME,
   "output_directory": str(OUT_DIR),
   "repo_file_summary": file_summary,
   "rows": int(len(df)),
   "columns": list(df.columns),
   "output_type_distribution": (
       df["output_type"]
       .fillna("missing")
       .value_counts()
       .to_dict()
   ),
   "top_tools": (
       df.loc[df["output_type"].eq("tool_use"), "tool_name"]
       .replace("", "unknown")
       .value_counts()
       .head(20)
       .to_dict()
   ),
   "top_source_roots": (
       df["source_root"]
       .fillna("unknown")
       .value_counts()
       .head(20)
       .to_dict()
   ),
   "length_summary": {
       column: {
           "mean": float(df[column].mean()),
           "median": float(df[column].median()),
           "p90": float(df[column].quantile(0.90)),
           "p95": float(df[column].quantile(0.95)),
           "max": int(df[column].max()),
       }
       for column in [
           "context_chars",
           "cot_chars",
           "completion_chars",
           "text_payload_chars",
       ]
   },
   "possible_secret_rows": int(df["possible_secret_anywhere"].sum()),
   "plots": plot_paths,
   "model_artifacts": model_artifacts,
   "safe_exports": {
       "train": str(OUT_DIR / "fable5_no_cot_chat_train.jsonl"),
       "validation": str(OUT_DIR / "fable5_no_cot_chat_validation.jsonl"),
       "test": str(OUT_DIR / "fable5_no_cot_chat_test.jsonl"),
   },
   "analysis_files": {
       "csv": str(OUT_DIR / "fable5_analysis_index.csv"),
       "pickle": str(OUT_DIR / "fable5_analysis_index.pkl"),
       "keyword_search_demo": str(OUT_DIR / "keyword_search_demo.json"),
   },
}
with open(
   OUT_DIR / "analysis_summary.json",
   "w",
   encoding="utf-8",
) as file:
   json.dump(clean_for_json(summary), file, ensure_ascii=False, indent=2, default=str)
FENCE = chr(96) * 3
report_md = (
   "# Fable 5 Traces Advanced Tutorial Report\n\n"
   "## Dataset\n\n"
   f"- Dataset: `{DATASET_ID}`\n"
   f"- Flat JSONL: `{FLAT_JSONL_FILENAME}`\n"
   f"- Rows loaded: `{len(df):,}`\n"
   f"- Unique source sessions: `{df['session'].nunique(dropna=True):,}`\n"
   f"- Unique models: `{df['model'].nunique(dropna=True):,}`\n\n"
   "## Important safety note\n\n"
   "This tutorial treats the dataset as agent telemetry. It previews and analyzes commands, "
   "tool calls, file edits, and transcript text, but it never executes commands found inside "
   "the traces.\n\n"
   f"Potential secret-like patterns detected: `{int(df['possible_secret_anywhere'].sum()):,}` rows.\n"
   "Exports redact common API-key/token-like patterns.\n\n"
   "## Output type distribution\n\n"
   f"{FENCE}json\n"
   f"{json.dumps(clean_for_json(summary['output_type_distribution']), indent=2, ensure_ascii=False)}\n"
   f"{FENCE}\n\n"
   "## Top tools\n\n"
   f"{FENCE}json\n"
   f"{json.dumps(clean_for_json(summary['top_tools']), indent=2, ensure_ascii=False)}\n"
   f"{FENCE}\n\n"
   "## Saved files\n\n"
   "- `analysis_summary.json`\n"
   "- `fable5_analysis_index.csv`\n"
   "- `fable5_analysis_index.pkl`\n"
   "- `fable5_no_cot_chat_train.jsonl`\n"
   "- `fable5_no_cot_chat_validation.jsonl`\n"
   "- `fable5_no_cot_chat_test.jsonl`\n"
   "- plot PNG files\n"
   "- baseline classifier metrics, when enough rows/classes are available\n\n"
   "## Recommended next steps\n\n"
   "1. Inspect `fable5_no_cot_chat_train.jsonl` before any fine-tuning.\n"
   "2. Keep the dataset license in mind before model training or redistribution.\n"
   "3. Avoid training directly on raw terminal outputs without additional privacy and safety filtering.\n"
   "4. Start with the no-CoT chat export unless your research explicitly requires reasoning-trace supervision.\n"
)
with open(
   OUT_DIR / "REPORT.md",
   "w",
   encoding="utf-8",
) as file:
   file.write(report_md)
rprint(
   Panel.fit(
       f"[bold green]Tutorial complete.[/bold green]\n\n"
       f"Artifacts saved in:\n{OUT_DIR}\n\n"
       f"Key files:\n"
       f"- {OUT_DIR / 'REPORT.md'}\n"
       f"- {OUT_DIR / 'analysis_summary.json'}\n"
       f"- {OUT_DIR / 'fable5_no_cot_chat_train.jsonl'}\n"
       f"- {OUT_DIR / 'fable5_analysis_index.csv'}",
       title="Done",
   )
)
display(
   pd.DataFrame(
       {
           "artifact": [
               "Report",
               "Summary JSON",
               "No-CoT train export",
               "No-CoT validation export",
               "No-CoT test export",
               "Analysis CSV",
               "Analysis pickle",
               "Keyword search demo",
           ],
           "path": [
               str(OUT_DIR / "REPORT.md"),
               str(OUT_DIR / "analysis_summary.json"),
               str(OUT_DIR / "fable5_no_cot_chat_train.jsonl"),
               str(OUT_DIR / "fable5_no_cot_chat_validation.jsonl"),
               str(OUT_DIR / "fable5_no_cot_chat_test.jsonl"),
               str(OUT_DIR / "fable5_analysis_index.csv"),
               str(OUT_DIR / "fable5_analysis_index.pkl"),
               str(OUT_DIR / "keyword_search_demo.json"),
           ],
       }
   )
)

Credit: Source link

ShareTweetSendSharePin

Related Posts

Rocket Lab Buys Satellite Company Iridium To Go Up Against Starlink And Amazon’s Leo
AI & Technology

Rocket Lab Buys Satellite Company Iridium To Go Up Against Starlink And Amazon’s Leo

June 29, 2026
Meet EverOS: An Open Source Markdown-First Agent Memory Runtime With Hybrid BM25 + Vector Retrieval and Self-Evolving Skills
AI & Technology

Meet EverOS: An Open Source Markdown-First Agent Memory Runtime With Hybrid BM25 + Vector Retrieval and Self-Evolving Skills

June 29, 2026
DJI’s Osmo Pocket 4P Promises 17 Stops Of Dynamic Range
AI & Technology

DJI’s Osmo Pocket 4P Promises 17 Stops Of Dynamic Range

June 29, 2026
5 Easy Ways To Get More Range Out Of Your EV
AI & Technology

5 Easy Ways To Get More Range Out Of Your EV

June 28, 2026
Next Post
Somali World Cup referee refused U.S. entry

Somali World Cup referee refused U.S. entry

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Good News: NBA player connects with family of 2 young brothers facing health battles

Good News: NBA player connects with family of 2 young brothers facing health battles

June 25, 2026
Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency

Gradium Launches stt-translate and s2s-translate, Real-Time Speech Translation Models Beating gpt-realtime-translate on Accuracy and Latency

June 24, 2026
Prime Day is here: We found 65+ deals to shop from Apple, Lego, Sony, and more – Mashable

Prime Day is here: We found 65+ deals to shop from Apple, Lego, Sony, and more – Mashable

June 24, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!