AWS splits their AI shit into three categories: Ready-made APIs (easy but limited), SageMaker (powerful but complex as hell), and custom hardware (fast but expensive). Here's when to use what based on real production experience.
Ready-Made APIs (Work Until They Don't)
The AI Services like Rekognition and Textract are basically plug-and-play APIs. You send an image, get back JSON with face detection or text extraction. These work great until you need something custom, then you're shit out of luck.
Real talk: Rekognition shits the bed on low-quality images. Learned this when vendor product photos looked like they were taken with a Nokia from 2005. Spent two weeks getting InvalidImageFormatException
and ImageTooLargeException
before realizing we needed to resize everything under 5MB and convert to RGB format first. AWS docs conveniently forget to mention this breaks differently in us-east-1 vs us-west-2 - same exact image, different error codes, no explanation why. Also found out Docker BuildKit 0.12+ breaks image processing in SageMaker containers - spent 4 hours debugging RuntimeError: CUDA out of memory
before realizing Docker was the problem, not CUDA.
Comprehend handles sentiment analysis and entity detection decently, but it's trained on clean text. Feed it social media posts or customer reviews with typos and slang, and accuracy drops to maybe 60%. Had to build our own preprocessing pipeline to clean text before sending it to AWS.
SageMaker: The Kitchen Sink Approach
SageMaker is AWS's answer to "what if we made one service that does everything?" It's powerful but complicated as fuck. You can train custom models, deploy them, manage data, run notebooks - it does everything, which means the learning curve is brutal.
War story: Our first SageMaker deployment took three months because VPC configuration is designed by sadists. Notebooks kept throwing ClientError: RequestTimeTooSkewed
even with correct timestamps - turns out our system clock was 2 minutes off and AWS decided that was unacceptable. Training jobs died with UnknownError
(helpful, AWS). IAM permissions failed with AccessDenied
for policies that looked identical to working examples from their own GitHub. The SageMaker docs are spread across maybe 50 pages that contradict each other - section 4 says use one API, section 12 says that API is deprecated.
The new SageMaker Studio interface is better, but it's still slow as molasses. Jupyter notebooks crash if you have more than 50 tabs open (don't ask why I know this - debugging a data pipeline at 4am makes you do stupid things). Also, the instance pricing will shock you - a ml.p4d.24xlarge
costs $37.69/hour according to the latest pricing, not the $32 they quote in half their examples.
Custom Hardware: Fast but Your Wallet Will Cry
AWS's Trainium2 and Inferentia2 chips are legitimately fast. Training large models on Trainium2 is about 30% cheaper than equivalent NVIDIA hardware, assuming you can get your model to work with their custom PyTorch modifications.
The catch: You rewrite training code to work with their bastardized PyTorch fork. The Neuron SDK examples work until you try torch.einsum
and get RuntimeError: Graph partitioning failed
at 3am before a deadline while you're running on RedBull and regret. Their compiler throws errors like "Unsupported operation detected" without telling you which operation or why, like a teenager giving you attitude.
Bedrock: Expensive but Actually Works
Amazon Bedrock lets you use models like Claude, Llama, and others without dealing with OpenAI's API chaos. The pricing is brutal - Claude 4 costs about 3x what you'd pay directly to Anthropic - but it works reliably.
Production reality: Bedrock rate limits are designed to humiliate you during demos. Watched a senior engineer break down when our investor demo threw ThrottlingException: Rate exceeded
after working perfectly all week. The default quota is 200 requests/minute - sounds like a lot until you need it for an actual demo. Set up CloudWatch alarms or prepare to explain why your million-dollar AI can't write a haiku while investors are staring at you and you're sweating through your shirt like an idiot.
The new AgentCore framework is still in preview, which in AWS-speak means "works great until it doesn't, and when it breaks you're fucked because there's no real support yet."