OpenAI launched the new gpt-realtime model on August 28 and moved the Realtime API out of beta. I've been fucking around with it in production since Thursday, and here's what you need to know before you blow your budget on this thing.
What's different about the new model:
- Noticeably smarter (way better scores on Big Bench Audio)
- Actually follows instructions now (finally reads disclaimers word-for-word)
- Function calling works way more reliably
- 20% price drop ($32/1M input tokens vs $40/1M previously)
- Two new exclusive voices: Cedar and Marin
- Native support for images, SIP phone calls, and MCP servers
Production deployment reality (the parts nobody talks about)
The new pricing will still hurt your budget
Even with the 20% price reduction, you're looking at:
- $32 per million audio input tokens (roughly $0.032 per minute of user speech)
- $64 per million audio output tokens (roughly $0.064 per minute of AI speech)
- Cached input tokens: $0.40 per million (use this aggressively)
I burned through a shitload of money on the first day because the new model talks way more than expected - longer responses mean higher costs. The cost monitoring features are better now, but set billing alerts before you test or learn the hard way like I did.
WebSocket connection management is still a nightmare
The new model doesn't fix the fundamental WebSocket reliability issues:
- Connections still die every 3-7 minutes under load
- iOS Safari still kills connections when users switch apps
- Chrome still throttles background WebSocket connections
- Regional latency is all over the place (decent in US, shitty everywhere else)
Spent most of the weekend debugging connection issues that turned out to be the exact same problems as the old model.
Connection code that actually works:
// Updated connection for gpt-realtime model
const ws = new WebSocket(
"wss://api.openai.com/v1/realtime?model=gpt-realtime",
[],
{
headers: {
"Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
"OpenAI-Beta": "realtime=v1"
}
}
);
// Aggressive reconnection - WebSocket WILL die, this is mandatory
let reconnectAttempts = 0;
const MAX_RECONNECTS = 10;
ws.onclose = (event) => {
if (reconnectAttempts < MAX_RECONNECTS) {
setTimeout(() => {
reconnectAttempts++;
initializeWebSocket(); // This is ugly but it works
}, Math.pow(2, reconnectAttempts) * 1000); // Exponential backoff because fuck it
}
};
Function calling performance improvements (but still has gotchas)
The 33% accuracy improvement is real - function calls trigger more reliably and with better arguments. But timing is still weird:
// New asynchronous function calling - the model keeps talking while this runs
ws.send(JSON.stringify({
type: "conversation.item.create",
item: {
type: "function_call",
name: "search_knowledge_base",
arguments: {query: "user question"}
}
}));
// Model continues speaking while function executes (this is weird but works)
// Response comes back asynchronously via function_call_output event
Production gotcha: Functions taking >2 seconds still break conversation flow. The model pauses, users think something's broken, then suddenly it continues talking. Build fast functions or fake the response while processing in background.
Image input support (beta but already in production)
This is the killer feature nobody expected. Users can send screenshots, photos, anything visual:
// Send image with voice conversation
ws.send(JSON.stringify({
type: "conversation.item.create",
item: {
type: "message",
role: "user",
content: [
{
type: "input_audio",
audio: base64AudioChunk
},
{
type: "input_image",
image: {
data: base64ImageData,
format: "jpeg"
}
}
]
}
}));
Real production use case: Customer service agents can now say "I'm looking at error X on my screen" and upload a screenshot. The AI actually understands both the speech and the visual context.
Limitation: Images count toward your token limit. A single screenshot can cost 500+ tokens. Monitor this or your costs will explode.
SIP phone integration (actually useful for once)
The new SIP support lets you connect directly to phone systems:
// SIP endpoint configuration
const sipConfig = {
sip_endpoint: "sip:your-endpoint@provider.com",
audio_format: "pcm16",
sample_rate: 8000 // Phone quality, not 24kHz
};
Production reality: This works great for call centers but requires serious telephony infrastructure. Don't try to build this yourself - use Twilio or Vonage as intermediaries.
I spent 3 days trying to connect directly to our PBX system. Gave up and routed through Twilio in 30 minutes. Don't be an idiot like me.
Memory management and performance optimization
Audio buffer cleanup (critical for long sessions)
The new model processes audio faster but still leaks memory if you don't clean up properly:
Spent way too long chasing memory leaks that turned out to be audio buffers not getting garbage collected properly.
// Clean up audio buffers or your RAM will explode
function cleanupAudioResources() {
if (audioContext) {
audioContext.close();
}
if (mediaRecorder && mediaRecorder.stream) {
mediaRecorder.stream.getTracks().forEach(track => track.stop());
}
// Force garbage collection of audio buffers - yes this is ugly
audioBufferArray = null;
outputAudioQueue = [];
}
// Call this every conversation end or connection reset
ws.onclose = () => {
cleanupAudioResources();
};
Context management for long conversations
The new intelligent token limits are a lifesaver for cost control:
// Intelligent context truncation
const sessionConfig = {
max_response_output_tokens: 4096,
temperature: 0.8,
// New: Multi-turn truncation
truncation_strategy: {
type: "last_turns",
last_turns: 10 // Keep last 10 conversation turns
}
};
Production tip: Aggressive context truncation can reduce long session costs by 40-60%. The model handles context loss better now, but still gets confused and repetitive after ~20 turn truncations.
Browser compatibility hell (still exists but improved)
iOS Safari audio issues (slightly better)
iOS Safari is still the worst:
- Audio permission delays reduced from 30+ seconds to ~10 seconds
- Background app switching still kills connections
- Web Audio API resampling still sounds like garbage on some devices
Current iOS workaround that actually works:
// iOS-specific audio handling - because Safari hates developers
if (/iPad|iPhone|iPod/.test(navigator.userAgent)) {
// Force user interaction before audio context or it'll never work
document.addEventListener('touchstart', async () => {
if (audioContext.state === 'suspended') {
await audioContext.resume();
}
}, {once: true});
// Longer timeout for iOS audio permission (learned this the hard way)
setTimeout(() => {
if (!audioPermissionGranted) {
showFallbackTextInput(); // Always have a backup
}
}, 15000); // 15 second timeout because iOS is slow as shit
}
Chrome desktop and mobile differences
Chrome Desktop (pretty solid):
- WebSocket connections stable
- Audio permissions work reliably
- Background tab throttling manageable
Chrome Mobile (still problematic):
- Aggressive background WebSocket killing
- Audio context suspension in background
- Memory pressure kills connections
WebSocket connections WILL die constantly on mobile. Your app needs to handle this or users will be talking to a dead connection.
const isMobile = /Android|webOS|iPhone|iPad|iPod|BlackBerry|IEMobile|Opera Mini/i.test(navigator.userAgent);
const reconnectInterval = isMobile ? 30000 : 60000; // 30s mobile, 60s desktop - mobile sucks