It's an HTTP API that talks to GPT. Send JSON, get JSON back. Started in June 2020 as a basic text interface, now it handles images, audio, and whatever else they've shoved into it.
The architecture is simple: Your app → HTTP POST → OpenAI servers → Neural networks → JSON response → Your app explodes from rate limits.
How it Actually Works
You make HTTP POST requests, they run your shit through neural networks, you get JSON back. Authentication uses API keys that you'll leak to GitHub within a week - guaranteed. Their scanners will find it faster than you can say "oops".
The rate limits will fuck up your demos. You get X requests per minute based on your account tier, and these limits hit exactly when your CEO is watching. Always implement exponential backoff or prepare for 429 errors at the worst possible moments.
Typical 429 error: {"error": {"message": "Rate limit reached for requests", "type": "requests", "param": null, "code": "rate_limit_exceeded"}}
- this will haunt your dreams.
Rate limiting works like a token bucket: You get X requests per minute, when you exceed that limit, everything gets rejected with 429 errors. The bucket refills over time, but during traffic spikes or demos, you'll hit the ceiling.
Pin your fucking SDK versions in requirements.txt or package.json. SDK updates break shit randomly - always pin versions and check GitHub issues. You don't want to find out about breaking changes when your app's on fire at 2am. Check their changelog or prepare for surprise downtime.
Models That'll Drain Your Budget
GPT-4o costs $5.00 input, $15.00 output per million tokens. Sounds cheap until you realize a typical conversation burns through thousands of tokens. Great for code generation and multimodal tasks if you can afford it.
Cost reality check: 10K tokens = $0.05 input + $0.15 output = $0.20 per conversation. Scale that by thousands of users and watch your AWS bill cry.
o3 is their "smart" model at $2.00 input, $8.00 output per million tokens since OpenAI cut prices 80% in June 2025. Use this for complex reasoning tasks where you need the model to actually think. Still expensive as hell - just less budget-destroying than before.
GPT-4o Mini at $0.15 input, $0.60 output per million tokens is your cost-conscious option. Fast and cheap, perfect for simple tasks where you don't need the full brain power.
Reality check: Your bill will be 3x higher than your estimates. Tokens disappear faster than you think, especially with conversational interfaces where context gets expensive. Even after the price cuts, one mistake with o3 can still cost you hundreds.
Cost breakdown example: A 10K token conversation on GPT-4o costs $0.05 input + $0.15 output = $0.20. Same convo on o3 costs $0.02 input + $0.08 output = $0.10 (way better since the price cuts). Scale to 1,000 users daily and you're looking at $200/day for GPT-4o or $100/day for o3.
The Multimodal Mess
Multimodal flow: Text + Images + Audio → Single API call → Combined neural processing → JSON response with interpreted context. One model handles everything, which is convenient until debugging multimodal interactions becomes a nightmare.
DALL-E generates images from text. Works well but costs $0.04 to $0.17 per image depending on resolution. Don't let users generate unlimited images unless you enjoy surprise bills. Check out the DALL-E guide for implementation details.
Whisper transcribes audio at $6.00 per hour. Quality is solid, supports tons of languages. File size limit is 25MB, so you'll need to chunk longer recordings.
GPT-4o handles text + images + audio in one request. Useful for building multimodal chatbots that can see and hear, assuming you can handle the complexity of multimodal debugging.
Production Reality Check
Streaming responses make your UI feel responsive while the model generates text. Set stream: true
in your requests and handle server-sent events. Your first implementation will probably have race conditions.
Caching is mandatory unless you hate money. Hash prompts, store responses in Redis, implement TTL based on your use case. A decent cache will cut your API costs by 60-80%.
Error handling needs to cover rate limits (429), content policy violations (400), authentication failures (401), and the occasional 500 when their servers shit the bed. Always log the full error response - their error messages are sometimes helpful.
Embedding vectors are 1536 floats for text-embedding-3-small
or 3072 for the large version. Don't store these in PostgreSQL with pgvector unless you enjoy 30-second similarity searches. Use Pinecone or prepare to debug slow vector queries for weeks.
Vector search flow: Text → API call → 3072 floats → Vector database → Similarity search → Results that may or may not be relevant.