GPT-Realtime-2 Explained [What It Changes]
GPT-Realtime-2 is OpenAI's first voice model to apply GPT-5-class reasoning to spoken requests in real time — meaning it can handle complex, multi-step spoken questions without routing them to a text model first, at a published API price of $32 per 1M audio input tokens and $64 per 1M audio output tokens, per 9to5Mac.
TL;DR
On May 7, 2026, OpenAI released three new realtime audio models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper.
GPT-Realtime-2 brings GPT-5-class reasoning to live voice calls for the first time.
GPT-Realtime-Translate covers 70+ input languages translating into 13 output languages at $0.034/min.
GPT-Realtime-Whisper provides streaming speech-to-text at $0.017/min.
All three were immediately available via the Realtime API on launch day.
The practical impact for small and mid-size businesses: voice agents can now handle harder requests, speak more languages, and transcribe with lower latency — without hybrid text/voice routing.
What OpenAI Released on May 7, 2026
As of June 2026, all three models are available in production via the Realtime API. According to TechCrunch, the three models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — were immediately available to developers via the Realtime API when announced on May 7, 2026.
GPT-Realtime-2 is positioned as the primary upgrade: it applies GPT-5-class reasoning to spoken input directly. According to 9to5Mac, the model brings GPT-5-class reasoning to live voice calls, enabling it to keep the conversation moving while it reasons through a request, calls tools, and handles corrections or interruptions. Our read: earlier realtime voice models — including the GPT-4o-based audio model — struggled with hard requests, because the underlying model could not reason through them in a single audio step. GPT-Realtime-2 removes that bottleneck.
According to 9to5Mac, OpenAI's new voice models can reason, translate, and transcribe as you speak — covering the three distinct voice AI capability layers that businesses care most about in production deployments.
GPT-Realtime-Translate covers live speech translation at speaker pace. According to TechCrunch, it supports more than 70 input languages with output in 13 target languages, billed by the minute.
GPT-Realtime-Whisper provides streaming transcription — speech to text in real time rather than as a post-call batch. According to 9to5Mac, it is priced at $0.017/min, making it a viable component in high-volume call-center and scheduling workflows where live transcript data drives downstream routing.
The Three Models: What Each Does
| Model | Primary Function | Price | Key Capability |
|---|---|---|---|
| GPT-Realtime-2 | Live voice with GPT-5-class reasoning | $32/1M audio input tokens; $64/1M audio output tokens | Handles complex spoken requests without text-pathway routing |
| GPT-Realtime-Translate | Live speech translation | $0.034/min | 70+ input languages → 13 output languages at speaker pace |
| GPT-Realtime-Whisper | Streaming speech-to-text | $0.017/min | Real-time transcription; feeds live transcript data to downstream systems |
Why Now: What Constraint Broke
Three technical shifts converged to make GPT-Realtime-2 viable in mid-2026:
1. Model size at inference speed. GPT-Realtime-2 brings GPT-5-class reasoning into the realtime voice API, per TechCrunch. Our read: models at that reasoning tier have historically carried inference latency that made live, low-latency voice impractical — so closing that gap, not the raw capability, is what changes the deployment calculus for phone-based workflows.
2. The routing workaround disappears. Our read: Earlier production voice agents commonly split traffic across multiple model calls — fast but limited voice models for simple requests, and heavier text models for complex ones, with the overhead of converting between audio and text in both directions. GPT-Realtime-2 applies GPT-5-class reasoning directly to audio input, which in our analysis eliminates the need for that text-pathway detour on complex requests. According to 9to5Mac, the model is built to "keep the conversation moving while it reasons through a request" — that capability profile is what makes single-model handling viable where it previously was not.
3. Translation at speaker pace. Previous speech translation APIs required transcription followed by translation followed by text-to-speech synthesis — a sequential pipeline that introduced delays making real-time conversation impractical. According to TechCrunch, OpenAI's new realtime voice models can "listen, reason, translate, transcribe, and take action as a conversation unfolds"; GPT-Realtime-Translate applies that to live speech translation — processing the audio stream as it arrives and outputting translated speech in near real-time. (9to5Mac)
The Pricing in Practice: What Does It Cost to Run a Voice Agent?
Published pricing allows concrete cost modeling. The figures below are sourced from 9to5Mac and TechCrunch and should be verified against the current OpenAI API pricing page before production deployment, as OpenAI has historically revised pricing after initial announcements.
GPT-Realtime-2 pricing breakdown (per 9to5Mac):
Input: $32 per 1M audio tokens
Output: $64 per 1M audio tokens
Rough approximation: a 5-minute voice call involves roughly 7,500-10,000 audio input tokens and a comparable output volume, putting the per-call model cost in the range of a few cents for a typical service call
GPT-Realtime-Translate:
$0.034/minute
A 5-minute translated call costs approximately $0.17 in model fees
GPT-Realtime-Whisper:
$0.017/minute
A 5-minute transcription costs approximately $0.085
For a business running 500 calls per month at an average of 5 minutes each, the GPT-Realtime-2 model cost is a modest fraction of total telephony and infrastructure cost — making the model cost a secondary concern relative to the engineering and routing overhead.
What Changed for Production Voice Agents
Before May 7, 2026, the most capable production voice agent architecture required:
Audio streaming to a speech-to-text model
Text passed to a GPT-4o or GPT-4 text model for reasoning
Text response passed to a text-to-speech model
Synthesized audio streamed back to the caller
That pipeline added 1-3 seconds of latency per exchange and required coordinating three separate model calls. Errors compounded: a transcription error propagated into the reasoning step, and a reasoning error propagated into synthesis.
GPT-Realtime-2 replaces that pipeline with a single model call that accepts audio input and returns audio output. The architecture simplification has downstream effects: fewer failure points, lower latency, and simpler debugging when something goes wrong.
| Architecture Dimension | Pre-May 2026 (Hybrid) | Post-May 2026 (GPT-Realtime-2) |
|---|---|---|
| Model calls per exchange | 3 (STT + LLM + TTS) | 1 |
| Latency per exchange | 1-3+ seconds | Sub-second for standard queries |
| Complex query handling | Route to text model (added latency) | 1 model call (no text-pathway hop) |
| Transcription for downstream | Separate STT step | GPT-Realtime-Whisper parallel stream |
| Multilingual calls | Separate translation pipeline | GPT-Realtime-Translate (70+ languages) |
| Error propagation | Compounding across 3 steps | Single-step failure point |
What This Means for Small and Mid-Size Businesses
Scheduling and Intake Voice Agents
The most immediate application for businesses with 5-100 employees is automated scheduling and intake. A voice agent powered by GPT-Realtime-2 can handle a spoken appointment request, check availability against a calendar API, confirm a time, and send a confirmation — all within a single voice call with no human intervention.
The GPT-5-class reasoning layer matters here because appointment scheduling frequently involves exception handling: "I need the same technician as last time" or "I want the earliest slot but not before 10 AM." Previous voice models handled simple scheduling reliably but failed on exceptions. GPT-Realtime-2's reasoning capability extends reliable handling further into the exception cases that previously required a human agent.
Teams building on top of US Tech Automations' voice agent infrastructure will find GPT-Realtime-2 maps directly to an existing workflow configuration: the model swap is a configuration change in the agent definition, not a rebuild of the call routing or calendar integration.
Multilingual Customer Service
GPT-Realtime-Translate's 70+ input language support opens a market that was previously inaccessible to small businesses without multilingual staff. A home services company, dental practice, or healthcare-adjacent business in a multilingual market can now route non-English calls to an AI agent that handles the conversation in the caller's language and outputs in the business's language — without maintaining multilingual staff or paying per-call translation service fees.
GPT-Realtime-Translate supports 70+ input languages at $0.034/min, per 9to5Mac — a structural capability change for any business serving diverse language communities.
Live Call Transcription for Compliance and Follow-Up
GPT-Realtime-Whisper's streaming transcription enables use cases that were previously batch-only: real-time compliance monitoring, live call coaching, and immediate post-call summary generation before the agent hangs up. For businesses that need to log call content for compliance (financial services, healthcare, regulated home services), live transcription at $0.017/min makes real-time logging economically viable at scale.
Honest Limits: What GPT-Realtime-2 Does Not Do
It does not eliminate latency entirely. Sub-second response on standard queries is achievable; multi-step reasoning on complex spoken queries still takes measurably longer than a simple lookup response. Callers with complex, multi-clause requests may still experience brief pauses.
It does not handle audio quality degradation gracefully. Poor cellular connections, background noise, and accented speech still degrade transcription quality. Businesses deploying in environments with variable audio conditions need noise-cancellation preprocessing in the audio pipeline.
It does not replace specialized voice synthesis. GPT-Realtime-2's output voice is natural but generic. Businesses that need a specific synthetic voice persona (branded voice, specific accent, specific prosody) still need a text-to-speech layer on top of the text output.
Our read: the 13 output languages in GPT-Realtime-Translate are unlikely to be equal in quality. Translation quality across AI systems consistently varies by language pair — high-resource languages (Spanish, French, German, Mandarin) tend to outperform lower-resource ones. No published benchmark comparing the 13 output languages is currently available; testing on actual caller scenarios before production rollout is essential.
Pricing is subject to change. OpenAI has historically revised Realtime API pricing after initial releases. The figures cited here are sourced from the May 7, 2026 announcement; verify current pricing at the OpenAI API pricing page before building cost models.
Signal vs Speculation
What is sourced fact (as of June 2026):
GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper launched May 7, 2026, immediately available via the Realtime API, per TechCrunch.
GPT-Realtime-Translate supports 70+ input languages and outputs in 13 languages at $0.034/min, per 9to5Mac.
GPT-Realtime-Whisper provides streaming speech-to-text at $0.017/min, per 9to5Mac.
GPT-Realtime-2 is priced at $32/1M audio input tokens and $64/1M audio output tokens, per 9to5Mac.
According to 9to5Mac, the models reason, translate, and transcribe as you speak — covering the three core voice AI capability requirements for production deployments.
According to TechCrunch, GPT-Realtime-2 brings GPT-5-class reasoning to live voice, enabling the model to listen, reason, and act as a conversation unfolds.
Our read (forward-looking interpretation):
If GPT-Realtime-2's architecture consolidation holds — one model call replacing a three-step pipeline — the 12-24 month outcome is widespread adoption in scheduling, intake, and customer service voice agents for small and mid-size businesses. The engineering simplification matters as much as the capability improvement: fewer moving parts means faster deployment and easier maintenance.
The translation capability is potentially the larger structural shift. A small home services business in a multilingual urban market currently either hires bilingual staff or loses non-English-speaking callers. GPT-Realtime-Translate at $0.034/minute makes the economics of multilingual voice coverage accessible to businesses that previously could not justify the cost.
The risk is commoditization of the voice layer itself. If GPT-Realtime-2 sets a new floor for voice agent capability, businesses competing on voice experience will need to differentiate on workflow integration, caller data handling, and escalation logic — not on the model itself. The companies that win in voice AI over the next 24 months will be the ones that build the most reliable workflow around the model, not the ones that negotiate the lowest model price.
Voice Agent Use Cases by Business Type
| Business Type | Voice Agent Use Case | Which Model | Model Price | Key Benefit |
|---|---|---|---|---|
| Home services (HVAC, plumbing) | Appointment scheduling + dispatch | GPT-Realtime-2 | $32/1M input tokens; $64/1M output tokens | Complex request handling (preferred tech, time constraints) |
| Dental / medical practice | Patient intake + appointment reminders | GPT-Realtime-2 | $32/1M input tokens; $64/1M output tokens | Natural conversation + HIPAA-compatible routing |
| Med spa / aesthetics | Booking + service inquiry | GPT-Realtime-2 + Whisper | $32/1M input + $0.017/min | Booking + live call summary to CRM |
| Any business in multilingual market | Non-English caller handling | GPT-Realtime-Translate | $0.034/min; 70+ input → 13 output languages | 70+ input languages, no bilingual staff needed |
| Call-center-adjacent | Live compliance logging | GPT-Realtime-Whisper | $0.017/min | Streaming transcript, real-time compliance feed |
See the spoke posts for workflow-specific breakdowns:
Key Takeaways
GPT-Realtime-2 is the first voice model with GPT-5-class reasoning, enabling complex spoken requests to be handled in a single audio model call — per 9to5Mac.
GPT-Realtime-Translate covers 70+ input languages at $0.034/min, making multilingual voice coverage economically accessible for small businesses — per 9to5Mac.
GPT-Realtime-Whisper streams transcription at $0.017/min, enabling real-time compliance logging and immediate post-call summaries — per 9to5Mac.
The architecture simplification — one model call vs. three — is as significant as the capability improvement for teams maintaining production voice agents.
Pricing is live and sourced, but should be verified against current OpenAI pricing before committing to cost models.
The honest limits are real: audio quality, output language quality variance, and voice persona flexibility all still require engineering attention.
Frequently Asked Questions
What makes GPT-Realtime-2 different from the earlier GPT-4o audio model?
The primary difference is reasoning capability. The GPT-4o audio model could handle simple spoken requests but routed complex ones to a text model, adding latency and architectural complexity. GPT-Realtime-2 applies GPT-5-class reasoning directly in the audio pipeline, handling complex spoken requests in a single model call.
How much does it cost to run a voice agent on GPT-Realtime-2 for a month?
At $32/1M audio input tokens and $64/1M output tokens, the per-call model cost depends on call length and conversation complexity. For a 5-minute customer service call, the model cost is in the range of a few cents. At 500 calls per month, the monthly model fee is a relatively small fraction of total telephony infrastructure cost. Verify current pricing at the OpenAI API pricing page before finalizing cost models.
Does GPT-Realtime-Translate work for real-time customer service calls?
Yes, that is its primary designed use case — live speech translation at speaker pace. With 70+ input languages and 13 output languages, it handles the most common translation scenarios for US businesses serving multilingual communities. Quality varies by language pair; test against your actual caller population before production deployment.
Can I use GPT-Realtime-Whisper alone for call transcription without using GPT-Realtime-2 for the conversation?
Yes. The three models are independent and can be used in any combination. A business that wants live call transcription without AI-powered responses can deploy GPT-Realtime-Whisper alone as a streaming transcription layer feeding a live-agent or compliance-logging workflow.
How does US Tech Automations integrate with GPT-Realtime-2?
The GPT-Realtime-2 model handles the spoken conversation. The downstream steps — updating a CRM record, triggering a dispatch notification, sending a confirmation SMS, logging a call summary — require an orchestration layer that routes the model's output to the right business systems. US Tech Automations provides that orchestration layer, which means teams already using the platform's voice agent configuration can swap in GPT-Realtime-2 without rebuilding the downstream workflow.
What happens when a caller's question is too complex for GPT-Realtime-2?
GPT-Realtime-2 handles a significantly larger range of complex spoken queries than its predecessors. For the questions it cannot handle, the graceful failure path is escalation to a human agent — which requires a live-transfer workflow in your telephony infrastructure. That escalation routing is part of the voice agent design work, not something the model handles automatically.
Is GPT-Realtime-2 HIPAA-compliant for healthcare voice workflows?
OpenAI offers HIPAA Business Associate Agreements for enterprise API customers. Whether a specific GPT-Realtime-2 deployment is HIPAA-compliant depends on the full architecture — data handling, storage, logging, access controls — not just the model selection. Consult with your compliance counsel before deploying in a healthcare context.
Conclusion
GPT-Realtime-2's May 7, 2026 release crosses two thresholds that matter for small and mid-size businesses: it brings GPT-5-class reasoning to live voice calls for the first time, per 9to5Mac, and — in our read — it enables complex spoken requests to be handled in a single audio model call without a text-pathway detour. The companion models — GPT-Realtime-Translate and GPT-Realtime-Whisper — fill the translation and transcription gaps that previously required separate integrations.
The businesses that move fastest on this are not the ones with the largest AI budgets. They are the ones that already have a production voice workflow and can execute a model swap cleanly. Building a net-new voice agent from scratch is a 6-12 week project; upgrading an existing agent to GPT-Realtime-2 can be a configuration change measured in days.
For teams ready to build or upgrade a production voice agent — and connect the conversation output to scheduling, CRM, dispatch, and confirmation systems — explore how US Tech Automations' agentic workflow platform handles the orchestration between the voice model and the downstream business systems where the operational value is realized.
About the Author

Helping businesses leverage automation for operational efficiency.
Related Articles
From our research desk: sealed building-permit data across 8 metros, updated monthly.