• bitcoinBitcoin(BTC)$78,419.000.28%
  • ethereumEthereum(ETH)$2,306.840.18%
  • tetherTether(USDT)$1.00-0.01%
  • rippleXRP(XRP)$1.390.11%
  • binancecoinBNB(BNB)$618.28-0.14%
  • usd-coinUSDC(USDC)$1.000.02%
  • solanaSolana(SOL)$84.040.10%
  • tronTRON(TRX)$0.3305181.28%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.03-0.29%
  • dogecoinDogecoin(DOGE)$0.1088870.19%
  • whitebitWhiteBIT Coin(WBT)$58.510.15%
  • USDSUSDS(USDS)$1.000.00%
  • HyperliquidHyperliquid(HYPE)$41.561.66%
  • leo-tokenLEO Token(LEO)$10.32-0.05%
  • cardanoCardano(ADA)$0.2499940.24%
  • bitcoin-cashBitcoin Cash(BCH)$446.33-1.24%
  • moneroMonero(XMR)$385.312.50%
  • chainlinkChainlink(LINK)$9.14-0.56%
  • zcashZcash(ZEC)$379.05-0.41%
  • CantonCanton(CC)$0.148318-0.36%
  • stellarStellar(XLM)$0.160442-0.50%
  • USD1USD1(USD1)$1.000.00%
  • daiDai(DAI)$1.000.00%
  • litecoinLitecoin(LTC)$55.29-0.67%
  • avalanche-2Avalanche(AVAX)$9.15-0.01%
  • MemeCoreMemeCore(M)$3.043.09%
  • Ethena USDeEthena USDe(USDE)$1.00-0.03%
  • hedera-hashgraphHedera(HBAR)$0.0882750.44%
  • shiba-inuShiba Inu(SHIB)$0.000006-0.63%
  • suiSui(SUI)$0.930.24%
  • RainRain(RAIN)$0.007647-2.74%
  • the-open-networkToncoin(TON)$1.330.38%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.01%
  • crypto-com-chainCronos(CRO)$0.068330-0.87%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • tether-goldTether Gold(XAUT)$4,600.99-0.44%
  • BittensorBittensor(TAO)$278.101.46%
  • Global DollarGlobal Dollar(USDG)$1.00-0.02%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • pax-goldPAX Gold(PAXG)$4,601.29-0.47%
  • mantleMantle(MNT)$0.63-0.02%
  • uniswapUniswap(UNI)$3.250.35%
  • polkadotPolkadot(DOT)$1.210.21%
  • SkySky(SKY)$0.0813170.11%
  • Pi NetworkPi Network(PI)$0.177268-1.45%
  • Falcon USDFalcon USD(USDF)$1.000.02%
  • okbOKB(OKB)$83.630.63%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.0550750.62%
  • AsterAster(ASTER)$0.682.52%
  • pepePepe(PEPE)$0.000004-0.01%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset

May 2, 2026
in AI & Technology
Reading Time: 8 mins read
A A
A Coding Implementation to Parsing, Analyzing, Visualizing, and Fine-Tuning Agent Reasoning Traces Using the lambda/hermes-agent-reasoning-traces Dataset
ShareShareShareShareShare

In this tutorial, we explore the lambda/hermes-agent-reasoning-traces dataset to understand how agent-based models think, use tools, and generate responses across multi-turn conversations. We start by loading and inspecting the dataset, examining its structure, categories, and conversational format to get a clear idea of the available information. We then build simple parsers to extract key components such as reasoning traces, tool calls, and tool responses, allowing us to separate internal thinking from external actions. Also, we analyze patterns such as tool usage frequency, conversation length, and error rates to better understand agent behavior. We also create visualizations to highlight these trends and make the analysis more intuitive. Finally, we prepare the dataset for training by converting it into a model-friendly format, making it suitable for tasks like supervised fine-tuning.

!pip -q install -U datasets pandas matplotlib seaborn transformers accelerate trl


import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets


random.seed(0)


CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, split="train")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Categories:", sorted(set(ds["category"])))


COMPARE_BOTH = False
if COMPARE_BOTH:
   ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", split="train")
   ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", split="train")
   ds_kimi = ds_kimi.add_column("source", ["kimi"] * len(ds_kimi))
   ds_glm  = ds_glm.add_column("source", ["glm-5.1"] * len(ds_glm))
   ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
   print("Combined:", ds, "→ counts:", Counter(ds["source"]))


sample = ds[0]
print("\n=== Sample 0 ===")
print("id        :", sample["id"])
print("category  :", sample["category"], "/", sample["subcategory"])
print("task      :", sample["task"])
print("turns     :", len(sample["conversations"]))
print("system[0] :", sample["conversations"][0]["value"][:220], "...\n")

We install all required libraries and import the necessary modules to set up our environment. We then load the lambda/hermes-agent-reasoning-traces dataset and inspect its structure, fields, and categories. We also optionally combine multiple dataset configurations and examine a sample to understand the conversational format.

YOU MAY ALSO LIKE

Meta Acquires Robotics AI Startup As It Makes The Push Into Humanoid Machines

Undead Co-Op Shooters, Gorgeous Hack-And-Slash Action And Other New Indie Games Worth Checking Out

THINK_RE     = re.compile(r"(.*?)", re.DOTALL)
TOOL_CALL_RE = re.compile(r"\s*(\{.*?\})\s*", re.DOTALL)
TOOL_RESP_RE = re.compile(r"\s*(.*?)\s*", re.DOTALL)


def parse_assistant(value: str) -> dict:
   thoughts = [t.strip() for t in THINK_RE.findall(value)]
   calls = []
   for raw in TOOL_CALL_RE.findall(value):
       try:
           calls.append(json.loads(raw))
       except json.JSONDecodeError:
           calls.append({"name": "", "arguments": {}})
   final = TOOL_CALL_RE.sub("", THINK_RE.sub("", value)).strip()
   return {"thoughts": thoughts, "tool_calls": calls, "final": final}


def parse_tool(value: str):
   raw = TOOL_RESP_RE.search(value)
   if not raw: return {"raw": value}
   body = raw.group(1)
   try:    return json.loads(body)
   except: return {"raw": body}


first_gpt = next(t for t in sample["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Tool calls       :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])

We define regex-based parsers to extract reasoning traces, tool calls, and tool responses from the dataset. We process assistant messages to separate thoughts, actions, and final outputs in a structured way. We then test the parser on a sample conversation to verify that the extraction works correctly.

N = 3000
sub = ds.select(range(min(N, len(ds))))


tool_calls         = Counter()
parallel_widths    = Counter()
thoughts_per_turn  = []
calls_per_traj     = []
errors_per_traj    = []
turns_per_traj     = []
cat_counts         = Counter()


for ex in sub:
   cat_counts[ex["category"]] += 1
   n_calls = n_err = 0
   turns_per_traj.append(len(ex["conversations"]))
   for t in ex["conversations"]:
       if t["from"] == "gpt":
           p = parse_assistant(t["value"])
           thoughts_per_turn.append(len(p["thoughts"]))
           if p["tool_calls"]:
               parallel_widths[len(p["tool_calls"])] += 1
               for c in p["tool_calls"]:
                   tool_calls[c.get("name", "")] += 1
               n_calls += len(p["tool_calls"])
       elif t["from"] == "tool":
           r = parse_tool(t["value"])
           blob = json.dumps(r).lower()
           if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
               n_err += 1
   calls_per_traj.append(n_calls)
   errors_per_traj.append(n_err)


print(f"\nScanned {len(sub)} trajectories")
print(f"Avg turns/traj      : {np.mean(turns_per_traj):.1f}")
print(f"Avg tool calls/traj : {np.mean(calls_per_traj):.1f}")
print(f"% with >=1 error    : {100*np.mean([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns    : {100*sum(v for k,v in parallel_widths.items() if k>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Top 10 tools        :", tool_calls.most_common(10))


fig, axes = plt.subplots(2, 2, figsize=(13, 9))


top = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], color="teal")
axes[0,0].set_title("Top 15 tools by call volume")
axes[0,0].set_xlabel("calls")


ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for k in ks], color="coral")
axes[0,1].set_title("Tool-calls per assistant turn (parallel width)")
axes[0,1].set_xlabel("# tool calls in one turn"); axes[0,1].set_ylabel("count")
axes[0,1].set_yscale("log")


axes[1,0].hist(turns_per_traj, bins=40, color="steelblue")
axes[1,0].set_title("Conversation length"); axes[1,0].set_xlabel("turns")


cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Category distribution")


plt.tight_layout(); plt.show()

We perform dataset-wide analytics to measure tool usage, conversation lengths, and error patterns. We aggregate statistics across multiple samples to understand overall agent behavior. We also create visualizations to highlight trends such as tool frequency, parallel calls, and category distribution.

def render_trace(ex, max_chars=350):
   print(f"\n{'='*72}\nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}\n{'='*72}")
   for t in ex["conversations"]:
       role = t["from"]
       if role == "system":
           continue
       if role == "human":
           print(f"\n[USER]\n{textwrap.shorten(t['value'], 600)}")
       elif role == "gpt":
           p = parse_assistant(t["value"])
           for th in p["thoughts"]:
               print(f"\n[THINK]\n{textwrap.shorten(th, max_chars)}")
           for c in p["tool_calls"]:
               args = json.dumps(c.get("arguments", {}))[:200]
               print(f"[CALL] {c.get('name')}({args})")
           if p["final"]:
               print(f"\n[ANSWER]\n{textwrap.shorten(p['final'], max_chars)}")
       elif role == "tool":
           print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
   print("="*72)


idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])


def get_tool_schemas(ex):
   try:    return json.loads(ex["tools"])
   except: return []


schemas = get_tool_schemas(sample)
print(f"\nSample 0 has {len(schemas)} tools available")
for s in schemas[:3]:
   fn = s.get("function", {})
   print(" -", fn.get("name"), "—", (fn.get("description") or "")[:80])


ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant", "tool": "tool"}


def to_openai_messages(conv):
   return [{"role": ROLE_MAP[t["from"]], "content": t["value"]} for t in conv]


example_msgs = to_openai_messages(sample["conversations"])
print("\nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
   print(" ", m["role"], "→", m["content"][:120].replace("\n", " "), "...")

We build utilities to render full conversation traces in a readable format for deeper inspection. We also extract tool schemas and convert the dataset into OpenAI-style message format for compatibility with training pipelines. This helps us better understand both the structure of tools and how conversations can be standardized.

from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)


def build_masked(conv, tokenizer, max_len=2048):
   msgs = to_openai_messages(conv)
   for m in msgs:
       if m["role"] == "tool":
           m["role"] = "user"
           m["content"] = "[TOOL OUTPUT]\n" + m["content"]
   input_ids, labels = [], []
   for m in msgs:
       text = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
       ids = tokenizer.encode(text, add_special_tokens=False)
       input_ids.extend(ids)
       labels.extend(ids if m["role"] == "assistant" else [-100] * len(ids))
   return input_ids[:max_len], labels[:max_len]


ids, lbls = build_masked(sample["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"\nTokenized example: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")


think_lens, call_lens, ans_lens = [], [], []
for ex in sub.select(range(min(500, len(sub)))):
   for t in ex["conversations"]:
       if t["from"] != "gpt": continue
       p = parse_assistant(t["value"])
       for th in p["thoughts"]: think_lens.append(len(th))
       for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
       if p["final"]: ans_lens.append(len(p["final"]))


plt.figure(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
        label=["", "", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Length distributions (log y)")
plt.tight_layout(); plt.show()


class TraceReplayer:
   def __init__(self, ex):
       self.ex = ex
       self.steps = []
       pending = None
       for t in ex["conversations"]:
           if t["from"] == "gpt":
               if pending: self.steps.append(pending)
               pending = {"think": parse_assistant(t["value"]), "responses": []}
           elif t["from"] == "tool" and pending:
               pending["responses"].append(parse_tool(t["value"]))
       if pending: self.steps.append(pending)
   def __len__(self): return len(self.steps)
   def play(self, i):
       s = self.steps[i]
       print(f"\n── Step {i+1}/{len(self)} ──")
       for th in s["think"]["thoughts"]:
           print(f"💭 {textwrap.shorten(th, 280)}")
       for c in s["think"]["tool_calls"]:
           print(f"⚙️  {c.get('name')}({json.dumps(c.get('arguments', {}))[:140]})")
       for r in s["responses"]:
           print(f"📥 {textwrap.shorten(json.dumps(r), 200)}")
       if s["think"]["final"]:
           print(f"💬 {textwrap.shorten(s['think']['final'], 200)}")


rp = TraceReplayer(sample)
for i in range(min(3, len(rp))):
   rp.play(i)


TRAIN = False
if TRAIN:
   import torch
   from transformers import AutoModelForCausalLM
   from trl import SFTTrainer, SFTConfig


   train_subset = ds.select(range(200))


   def to_text(batch):
       msgs = to_openai_messages(batch["conversations"])
       for m in msgs:
           if m["role"] == "tool":
               m["role"] = "user"; m["content"] = "[TOOL]\n" + m["content"]
       batch["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
       return batch


   train_subset = train_subset.map(to_text)


   model = AutoModelForCausalLM.from_pretrained(
       TOK_ID,
       torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
       device_map="auto" if torch.cuda.is_available() else None,
   )


   cfg = SFTConfig(
       output_dir="hermes-sft-demo",
       per_device_train_batch_size=1,
       gradient_accumulation_steps=4,
       max_steps=20,
       learning_rate=2e-5,
       logging_steps=2,
       max_seq_length=1024,
       dataset_text_field="text",
       report_to="none",
       fp16=torch.cuda.is_available(),
   )
   SFTTrainer(model=model, args=cfg, train_dataset=train_subset, processing_class=tok).train()
   print("Fine-tune demo finished.")


print("\n✅ Tutorial complete. You now have parsers, analytics, plots, a replayer, "
     "tokenized + label-masked SFT examples, and an optional training hook.")

We tokenize the conversations and apply label masking so only assistant responses contribute to training. We analyze the length distributions of reasoning, tool calls, and answers to gain further insights. We also implement a trace replayer to step through agent behavior and optionally run a small fine-tuning loop.

In conclusion, we developed a structured workflow to parse, analyze, and work effectively with agent reasoning traces. We were able to break down conversations into meaningful components, examine how agents reason step by step, and measure how they interact with tools during problem solving. Using the visualizations and analytics, we gained insights into common patterns and behaviors across the dataset. In addition, we converted the data into a format suitable for training language models, including handling tokenization and label masking for assistant responses. Also, this process provides a strong foundation for studying, evaluating, and improving tool-using AI systems in a practical, scalable way.


Check out the Full Codes with Notebook. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us


Credit: Source link

ShareTweetSendSharePin

Related Posts

Meta Acquires Robotics AI Startup As It Makes The Push Into Humanoid Machines
AI & Technology

Meta Acquires Robotics AI Startup As It Makes The Push Into Humanoid Machines

May 2, 2026
Undead Co-Op Shooters, Gorgeous Hack-And-Slash Action And Other New Indie Games Worth Checking Out
AI & Technology

Undead Co-Op Shooters, Gorgeous Hack-And-Slash Action And Other New Indie Games Worth Checking Out

May 2, 2026
A New NVIDIA Research Shows Speculative Decoding in NeMo RL Achieves 1.8× Rollout Generation Speedup at 8B and Projects 2.5× End-to-End Speedup at 235B
AI & Technology

A New NVIDIA Research Shows Speculative Decoding in NeMo RL Achieves 1.8× Rollout Generation Speedup at 8B and Projects 2.5× End-to-End Speedup at 235B

May 2, 2026
A Coding Implementation of End-to-End Brain Decoding from MEG Signals Using NeuralSet and Deep Learning for Predicting Linguistic Features
AI & Technology

A Coding Implementation of End-to-End Brain Decoding from MEG Signals Using NeuralSet and Deep Learning for Predicting Linguistic Features

May 1, 2026
Next Post
LIV Golf’s Bryson DeChambeau denies ‘completely untrue’ PGA Tour talks claim – The Guardian

LIV Golf’s Bryson DeChambeau denies ‘completely untrue’ PGA Tour talks claim - The Guardian

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
Intel Delivers Strong AI-Fueled Outlook | Bloomberg Tech 4/24/2026

Intel Delivers Strong AI-Fueled Outlook | Bloomberg Tech 4/24/2026

April 28, 2026
Trump ‘chose mercy’ with Iran ceasefire, Hegseth says

Trump ‘chose mercy’ with Iran ceasefire, Hegseth says

April 29, 2026
Oil prices tumble after Trump says he will pause Iran strikes

Oil prices tumble after Trump says he will pause Iran strikes

April 30, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!