• bitcoinBitcoin(BTC)$67,183.00-0.75%
  • ethereumEthereum(ETH)$1,947.83-1.22%
  • tetherTether(USDT)$1.000.00%
  • binancecoinBNB(BNB)$619.24-1.20%
  • rippleXRP(XRP)$1.35-0.88%
  • usd-coinUSDC(USDC)$1.000.00%
  • solanaSolana(SOL)$82.37-2.01%
  • tronTRON(TRX)$0.2861230.83%
  • Figure HelocFigure Heloc(FIGR_HELOC)$1.02-0.01%
  • dogecoinDogecoin(DOGE)$0.089138-1.60%
  • whitebitWhiteBIT Coin(WBT)$53.71-1.15%
  • USDSUSDS(USDS)$1.000.01%
  • cardanoCardano(ADA)$0.251781-2.47%
  • bitcoin-cashBitcoin Cash(BCH)$447.840.09%
  • leo-tokenLEO Token(LEO)$9.04-0.18%
  • HyperliquidHyperliquid(HYPE)$29.84-1.61%
  • moneroMonero(XMR)$341.34-2.25%
  • chainlinkChainlink(LINK)$8.60-1.85%
  • Ethena USDeEthena USDe(USDE)$1.000.00%
  • CantonCanton(CC)$0.152485-0.10%
  • stellarStellar(XLM)$0.149103-1.76%
  • USD1USD1(USD1)$1.00-0.01%
  • daiDai(DAI)$1.000.05%
  • RainRain(RAIN)$0.008922-1.43%
  • paypal-usdPayPal USD(PYUSD)$1.00-0.01%
  • hedera-hashgraphHedera(HBAR)$0.094693-1.87%
  • litecoinLitecoin(LTC)$53.21-0.63%
  • avalanche-2Avalanche(AVAX)$8.86-1.44%
  • suiSui(SUI)$0.89-1.55%
  • zcashZcash(ZEC)$194.35-6.32%
  • the-open-networkToncoin(TON)$1.31-2.27%
  • shiba-inuShiba Inu(SHIB)$0.000005-1.97%
  • crypto-com-chainCronos(CRO)$0.074441-0.86%
  • tether-goldTether Gold(XAUT)$5,144.230.04%
  • MemeCoreMemeCore(M)$1.553.46%
  • World Liberty FinancialWorld Liberty Financial(WLFI)$0.095405-2.72%
  • pax-goldPAX Gold(PAXG)$5,177.54-0.07%
  • polkadotPolkadot(DOT)$1.44-3.68%
  • uniswapUniswap(UNI)$3.72-2.02%
  • mantleMantle(MNT)$0.67-0.68%
  • okbOKB(OKB)$98.541.79%
  • Circle USYCCircle USYC(USYC)$1.120.00%
  • BlackRock USD Institutional Digital Liquidity FundBlackRock USD Institutional Digital Liquidity Fund(BUIDL)$1.000.00%
  • Pi NetworkPi Network(PI)$0.197532-15.56%
  • Falcon USDFalcon USD(USDF)$1.00-0.06%
  • Global DollarGlobal Dollar(USDG)$1.000.00%
  • AsterAster(ASTER)$0.69-1.09%
  • BittensorBittensor(TAO)$175.17-0.34%
  • SkySky(SKY)$0.0719803.72%
  • aaveAave(AAVE)$107.35-2.17%
TradePoint.io
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop
No Result
View All Result
TradePoint.io
No Result
View All Result

How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines?

September 17, 2025
in AI & Technology
Reading Time: 6 mins read
A A
How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines?
ShareShareShareShareShare

In this tutorial, we build an advanced voice AI agent using Hugging Face’s freely available models, and we keep the entire pipeline simple enough to run smoothly on Google Colab. We combine Whisper for speech recognition, FLAN-T5 for natural language reasoning, and Bark for speech synthesis, all connected through transformers pipelines. By doing this, we avoid heavy dependencies, API keys, or complicated setups, and we focus on showing how we can turn voice input into meaningful conversation and get back natural-sounding voice responses in real time. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser
!pip -q install "transformers>=4.42.0" accelerate torchaudio sentencepiece gradio soundfile


import os, torch, tempfile, numpy as np
import gradio as gr
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM


DEVICE = 0 if torch.cuda.is_available() else -1


asr = pipeline(
   "automatic-speech-recognition",
   model="openai/whisper-small.en",
   device=DEVICE,
   chunk_length_s=30,
   return_timestamps=False
)


LLM_MODEL = "google/flan-t5-base"
tok = AutoTokenizer.from_pretrained(LLM_MODEL)
llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL, device_map="auto")


tts = pipeline("text-to-speech", model="suno/bark-small")

We install the necessary libraries and load three Hugging Face pipelines: Whisper for speech-to-text, FLAN-T5 for generating responses, and Bark for text-to-speech. We set the device automatically so that we can use GPU if available. Check out the FULL CODES here.

YOU MAY ALSO LIKE

OpenAI is reportedly pushing back the launch of its ‘adult mode’ even further

NASA’s DART spacecraft changed a binary asteroid’s orbit around the sun, in a first for a human-made object

Copy CodeCopiedUse a different Browser
SYSTEM_PROMPT = (
   "You are a helpful, concise voice assistant. "
   "Prefer direct, structured answers. "
   "If the user asks for steps or code, use short bullet points."
)


def format_dialog(history, user_text):
   turns = []
   for u, a in history:
       if u: turns.append(f"User: {u}")
       if a: turns.append(f"Assistant: {a}")
   turns.append(f"User: {user_text}")
   prompt = (
       "Instruction:\n"
       f"{SYSTEM_PROMPT}\n\n"
       "Dialog so far:\n" + "\n".join(turns) + "\n\n"
       "Assistant:"
   )
   return prompt

We define a system prompt that guides our agent to stay concise and structured, and we implement a format_dialog function that takes past conversation history along with the user input and builds a prompt string for the model to generate the assistant’s reply. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser
def transcribe(filepath):
   out = asr(filepath)
   text = out["text"].strip()
   return text


def generate_reply(history, user_text, max_new_tokens=256):
   prompt = format_dialog(history, user_text)
   inputs = tok(prompt, return_tensors="pt", truncation=True).to(llm.device)
   with torch.no_grad():
       ids = llm.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           temperature=0.7,
           do_sample=True,
           top_p=0.9,
           repetition_penalty=1.05,
       )
   reply = tok.decode(ids[0], skip_special_tokens=True).strip()
   return reply


def synthesize_speech(text):
   out = tts(text)
   audio = out["audio"]
   sr = out["sampling_rate"]
   audio = np.asarray(audio, dtype=np.float32)
   return (sr, audio)

We create three core functions for our voice agent: transcribe converts recorded audio into text using Whisper, generate_reply builds a context-aware response from FLAN-T5, and synthesize_speech turns that response back into spoken audio with Bark. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser
def clear_history():
   return [], []


def voice_to_voice(mic_file, history):
   history = history or []
   if not mic_file:
       return history, None, "Please record something!"
   try:
       user_text = transcribe(mic_file)
   except Exception as e:
       return history, None, f"ASR error: {e}"


   if not user_text:
       return history, None, "Didn't catch that. Try again?"


   try:
       reply = generate_reply(history, user_text)
   except Exception as e:
       return history, None, f"LLM error: {e}"


   try:
       sr, wav = synthesize_speech(reply)
   except Exception as e:
       return history + [(user_text, reply)], None, f"TTS error: {e}"


   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}\nAssistant: {reply}"


def text_to_voice(user_text, history):
   history = history or []
   user_text = (user_text or "").strip()
   if not user_text:
       return history, None, "Type a message first."
   try:
       reply = generate_reply(history, user_text)
       sr, wav = synthesize_speech(reply)
   except Exception as e:
       return history, None, f"Error: {e}"
   return history + [(user_text, reply)], (sr, wav), f"User: {user_text}\nAssistant: {reply}"


def export_chat(history):
   lines = []
   for u, a in history or []:
       lines += [f"User: {u}", f"Assistant: {a}", ""]
   text = "\n".join(lines).strip() or "No conversation yet."
   with tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w") as f:
       f.write(text)
       path = f.name
   return path

We add interactive functions for our agent: clear_history resets the conversation, voice_to_voice handles speech input and returns a spoken reply, text_to_voice processes typed input and speaks back, and export_chat saves the entire dialog into a downloadable text file. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser
with gr.Blocks(title="Advanced Voice AI Agent (HF Pipelines)") as demo:
   gr.Markdown(
       "##  Advanced Voice AI Agent (Hugging Face Pipelines Only)\n"
       "- **ASR**: openai/whisper-small.en\n"
       "- **LLM**: google/flan-t5-base\n"
       "- **TTS**: suno/bark-small\n"
       "Speak or type; the agent replies with voice + text."
   )


   with gr.Row():
       with gr.Column(scale=1):
           mic = gr.Audio(sources=["microphone"], type="filepath", label="Record")
           say_btn = gr.Button("🎤 Speak")
           text_in = gr.Textbox(label="Or type instead", placeholder="Ask me anything…")
           text_btn = gr.Button("💬 Send")
           export_btn = gr.Button("⬇ Export Chat (.txt)")
           reset_btn = gr.Button("♻ Reset")
       with gr.Column(scale=1):
           audio_out = gr.Audio(label="Assistant Voice", autoplay=True)
           transcript = gr.Textbox(label="Transcript", lines=6)
           chat = gr.Chatbot(height=360)
   state = gr.State([])


   def update_chat(history):
       return [(u, a) for u, a in (history or [])]


   say_btn.click(voice_to_voice, [mic, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   text_btn.click(text_to_voice, [text_in, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   reset_btn.click(clear_history, None, [chat, state])
   export_btn.click(export_chat, state, gr.File(label="Download chat.txt"))


demo.launch(debug=False)

We build a clean Gradio UI that lets us speak or type and then hear the agent’s response. We wire buttons to our callbacks, maintain chat state, and stream results into a chatbot, transcript, and audio player, all launched in one Colab app.

In conclusion, we see how seamlessly Hugging Face pipelines enable us to create a voice-driven conversational agent that listens, thinks, and responds. We now have a working demo that captures audio, transcribes it, generates intelligent responses, and returns speech output, all inside Colab. With this foundation, we can experiment with larger models, add multilingual support, or even extend the system with custom logic. Still, the core idea remains the same: we can bring together ASR, LLM, and TTS into one smooth workflow for an interactive voice AI experience.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines? appeared first on MarkTechPost.

Credit: Source link

ShareTweetSendSharePin

Related Posts

OpenAI is reportedly pushing back the launch of its ‘adult mode’ even further
AI & Technology

OpenAI is reportedly pushing back the launch of its ‘adult mode’ even further

March 7, 2026
NASA’s DART spacecraft changed a binary asteroid’s orbit around the sun, in a first for a human-made object
AI & Technology

NASA’s DART spacecraft changed a binary asteroid’s orbit around the sun, in a first for a human-made object

March 7, 2026
OpenAI’s head of robotics resigns following deal with the Department of Defense
AI & Technology

OpenAI’s head of robotics resigns following deal with the Department of Defense

March 7, 2026
Indonesia announces a social media ban for anyone under 16
AI & Technology

Indonesia announces a social media ban for anyone under 16

March 7, 2026
Next Post
Watch Live: Ousted CDC Director Susan Monarez says RFK Jr. politicizing vaccine decisions "really concerns me" – CBS News

Watch Live: Ousted CDC Director Susan Monarez says RFK Jr. politicizing vaccine decisions "really concerns me" - CBS News

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

No Result
View All Result
U.S. soldiers killed in Operation Epic Fury named

U.S. soldiers killed in Operation Epic Fury named

March 6, 2026
Kornacki: Will incumbents ‘go down’ in tonight’s primaries in Texas & North Carolina?

Kornacki: Will incumbents ‘go down’ in tonight’s primaries in Texas & North Carolina?

March 7, 2026
‘This is an incredible moment’: Iranian-Americans celebrate Khamenei’s death 

‘This is an incredible moment’: Iranian-Americans celebrate Khamenei’s death 

March 7, 2026

About

Learn more

Our Services

Legal

Privacy Policy

Terms of Use

Bloggers

Learn more

Article Links

Contact

Advertise

Ask us anything

©2020- TradePoint.io - All rights reserved!

Tradepoint.io, being just a publishing and technology platform, is not a registered broker-dealer or investment adviser. So we do not provide investment advice. Rather, brokerage services are provided to clients of Tradepoint.io by independent SEC-registered broker-dealers and members of FINRA/SIPC. Every form of investing carries some risk and past performance is not a guarantee of future results. “Tradepoint.io“, “Instant Investing” and “My Trading Tools” are registered trademarks of Apperbuild, LLC.

This website is operated by Apperbuild, LLC. We have no link to any brokerage firm and we do not provide investment advice. Every information and resource we provide is solely for the education of our readers. © 2020 Apperbuild, LLC. All rights reserved.

No Result
View All Result
  • Main
  • AI & Technology
  • Stock Charts
  • Market & News
  • Business
  • Finance Tips
  • Trade Tube
  • Blog
  • Shop

© 2023 - TradePoint.io - All Rights Reserved!