My gRPC service returns "UNAVAILABLE: connection refused" in Kubernetes

Your service is probably fine, your network config isn't. First thing to check (and what I should have checked first instead of spending 2 hours debugging the wrong thing): ```bash # Check if the pod is actually running kubectl get pods -l app=your-grpc-service # Check service endpoints kubectl get endpoints your-grpc-service # Test internal connectivity kubectl exec -it pod-name -- grpcurl -plaintext your-service:9090 list ``` 99% of the time it's one of these: - Service selector doesn't match pod labels - gRPC port isn't exposed in the Service manifest - Pod isn't ready (health check failing) - Network policy blocking traffic Copy this and fix your Service manifest: ```yaml apiVersion: v1 kind: Service metadata: name: your-grpc-service spec: ports: - port: 9090 targetPort: 9090 protocol: TCP selector: app: your-grpc-service # Make sure this matches your pod labels ```

Getting "DEADLINE_EXCEEDED" on calls that should be fast

Your client timeout is too aggressive or your server is actually slow. Debug with: ```bash # Check if it's a server problem grpcurl -d '{}' -max-time 30 your-server:9090 your.Service/YourMethod # If that works, your client timeout is wrong ``` Common fixes: ```go // Go - increase client deadline ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) defer cancel() ``` ```python # Python - set deadline response = stub.YourMethod(request, timeout=30) ``` ```javascript // Node.js - deadline in call options client.yourMethod(request, {deadline: Date.now() + 30000}, callback); ``` Real talk: if your "fast" calls need 30 second timeouts, you have bigger problems than gRPC. When everything is broken and you don't know why: ```bash export GRPC_GO_LOG_VERBOSITY_LEVEL=99 export GRPC_GO_LOG_SEVERITY_LEVEL=info export GODEBUG=http2debug=2 # Now prepare for log spam that may or may not help ```

Random "Received RST_STREAM with code 0" errors

This is HTTP/2 connection getting reset. Usually happens when: - Load balancer doesn't understand HTTP/2 properly - Server restarts mid-connection - Network hiccup drops the connection Quick fix - enable connection retries: ```go // Go client with retry config conn, err := grpc.Dial("your-server:9090", grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithDefaultServiceConfig(`{ "methodConfig": [{ "name": [{}], "retryPolicy": { "MaxAttempts": 4, "InitialBackoff": ".01s", "MaxBackoff": ".01s", "BackoffMultiplier": 1.0, "RetryableStatusCodes": [ "UNAVAILABLE" ] } }] }`)) ```

Load balancer sending all traffic to one backend

Your load balancer thinks HTTP/2 = HTTP/1.1. Classic mistake. If using NGINX: ```nginx upstream grpc_backend { server backend1:9090; server backend2:9090; server backend3:9090; } server { listen 9090 http2; location / { grpc_pass grpc://grpc_backend; } } ``` If using Kubernetes ingress, switch to Envoy or use gRPC client-side load balancing.

"grpcurl: command not found" when trying to debug

Install the damn debugging tools: ```bash # macOS brew install grpcurl # Linux curl -sSL https://github.com/fullstorydev/grpcurl/releases/download/v1.8.7/grpcurl_1.8.7_linux_x86_64.tar.gz | tar -xz && sudo mv grpcurl /usr/local/bin/ # Test it works grpcurl -plaintext localhost:9090 list ``` Fun fact: this breaks if your username has a space in it on Windows. Nobody tests that shit. Also, if you're on Docker Desktop 4.12.x, gRPC health checks randomly fail - upgrade or downgrade.

Cannot connect to gRPC server from browser

Browsers don't speak gRPC natively. You need gRPC-Web with a proxy: ```bash # Run Envoy proxy for gRPC-Web docker run -d -p 8080:8080 -p 9901:9901 \ -v $(pwd)/envoy.yaml:/etc/envoy/envoy.yaml \ envoyproxy/envoy:v1.27-latest ``` Or just use REST for browser APIs like a normal person.

Why are my gRPC calls slow in production but fast locally?

Three things to check in order: 1. **Network latency**: gRPC uses persistent connections but still suffers from network round trips. Check with: ```bash # Measure actual latency between services kubectl exec -it client-pod -- ping server-service.namespace.svc.cluster.local ``` 2. **Connection reuse**: If you're creating new connections for each request, you're doing it wrong: ```go // Wrong - creates new connection every call func makeCall() { conn, _ := grpc.Dial(\"server:9090\") defer conn.Close() // ... make call } // Right - reuse connection var globalConn *grpc.ClientConn func init() { globalConn, _ = grpc.Dial(\"server:9090\") } ``` 3. **Load balancer overhead**: If your load balancer is terminating gRPC connections instead of proxying them, you're adding extra network hops.

Getting "transport is closing" errors randomly

This usually means your server is shutting down connections unexpectedly. The full error looks like `rpc error: code = Unavailable desc = transport is closing`. Common causes: 1. **Graceful shutdown not implemented**: Your container gets SIGTERM but doesn't drain connections properly 2. **Resource limits**: Pod is getting OOMKilled or CPU throttled - check `kubectl describe pod` for exit code 137 3. **Idle connection timeout**: Load balancer or proxy is closing idle connections after 60 seconds Fix graceful shutdown: ```go func main() { server := grpc.NewServer() // Handle shutdown signals c := make(chan os.Signal, 1) signal.Notify(c, os.Interrupt, syscall.SIGTERM) go func() { <-c log.Println(\"Shutting down gracefully...\") server.GracefulStop() // Not server.Stop()! }() server.Serve(listener) } ```

How do I debug "method not implemented" errors?

This error is misleading. It doesn't mean the method doesn't exist - it means the gRPC server can't find the method you're calling. Common causes: 1. **Package name mismatch** in .proto vs registration 2. **Method name case sensitivity** 3. **Service not registered** on server Debug by listing available services: ```bash # List all services on server grpcurl -plaintext localhost:9090 list # List methods for specific service grpcurl -plaintext localhost:9090 list YourService ```

Why does gRPC use so much memory?

gRPC keeps connections open and buffers data. If you're seeing high memory usage: 1. **Check connection count**: Each client connection uses memory ```bash # Count active connections ss -tlnp | grep :9090 | wc -l ``` 2. **Tune message size limits**: ```go grpc.NewServer( grpc.MaxRecvMsgSize(1024*1024), // 1MB max incoming grpc.MaxSendMsgSize(1024*1024), // 1MB max outgoing ) ``` 3. **Monitor goroutine leaks** (Go): ```bash curl localhost:6060/debug/pprof/goroutine?debug=1 ```

Can I use gRPC with HTTP/1.1?

Technically yes, but don't. gRPC over HTTP/1.1 loses most performance benefits and compatibility is limited. If you're forced to use HTTP/1.1 (old proxies, corporate firewalls), use gRPC-Web instead. It's designed for this scenario.

How do I trace requests across microservices?

Use OpenTelemetry with gRPC interceptors: ```go import \"go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc\" // Client conn, err := grpc.Dial( \"localhost:9090\", grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()), grpc.WithStreamInterceptor(otelgrpc.StreamClientInterceptor()), ) // Server server := grpc.NewServer( grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()), grpc.StreamInterceptor(otelgrpc.StreamServerInterceptor()), ) ``` Then ship traces to Jaeger, Zipkin, or whatever observability stack you're using.

Currently viewing the AI version

Switch to human version

gRPC Production Error Resolution - AI Technical Reference

Critical Production Failures and Solutions

Connection Refused Errors

Primary Cause: Network configuration, not application logic (99% of cases)
Severity: High - Service completely unavailable
Debug Priority: Check these in order:

Pod status: kubectl get pods -l app=service-name
Service endpoints: kubectl get endpoints service-name
Internal connectivity: kubectl exec -it pod -- grpcurl -plaintext service:9090 list

Common Root Causes:

Service selector mismatch with pod labels
gRPC port not exposed in Service manifest
Pod health check failures preventing readiness
Network policy blocking traffic

Working Service Configuration:

apiVersion: v1
kind: Service
metadata:
  name: grpc-service
spec:
  ports:
  - port: 9090
    targetPort: 9090
    protocol: TCP
  selector:
    app: grpc-service  # Must match pod labels exactly

DEADLINE_EXCEEDED Errors

Primary Cause: Client timeout shorter than server processing time
Severity: Medium - Requests failing but service functional
Detection: Server completes requests but clients timeout

Language-Specific Timeout Fixes:

// Go - 30 second timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()

# Python - 30 second timeout
response = stub.YourMethod(request, timeout=30)

// Node.js - 30 second timeout
client.yourMethod(request, {deadline: Date.now() + 30000}, callback);

Performance Threshold: If "fast" calls need 30+ second timeouts, investigate underlying performance issues

HTTP/2 Connection Reset (RST_STREAM)

Primary Cause: Load balancer incompatible with HTTP/2 or connection drops
Frequency: Random occurrences during network instability
Impact: Request failures requiring retry logic

Retry Configuration (Go):

conn, err := grpc.Dial("server:9090",
    grpc.WithTransportCredentials(insecure.NewCredentials()),
    grpc.WithDefaultServiceConfig(`{
        "methodConfig": [{
            "name": [{}],
            "retryPolicy": {
                "MaxAttempts": 4,
                "InitialBackoff": ".01s",
                "MaxBackoff": ".01s",
                "BackoffMultiplier": 1.0,
                "RetryableStatusCodes": [ "UNAVAILABLE" ]
            }
        }]
    }`))

Load Balancing Failures

HTTP/2 Connection Persistence Problem

Issue: All requests route to single backend server
Root Cause: Load balancers treat HTTP/2 connections like HTTP/1.1
Impact: Server overload with 90% of backends idle
Detection Time: 12+ hours if monitoring isn't gRPC-aware

NGINX Configuration Fix:

upstream grpc_backend {
    server app1:9090;
    server app2:9090;
    server app3:9090;
    keepalive 32;  # Critical for HTTP/2
}

server {
    listen 9090 http2;  # Enable HTTP/2
    location / {
        grpc_pass grpc://grpc_backend;
        grpc_set_header Host $host;
    }
}

Kubernetes Service Discovery Issues

Problem: Client-side load balancing broken by default Kubernetes services
Solution: Headless services + gRPC client-side balancing
Cost: 3+ days debugging time if root cause unknown

Working Kubernetes Configuration:

apiVersion: v1
kind: Service
metadata:
  name: grpc-service
spec:
  clusterIP: None  # Headless service - required
  selector:
    app: grpc-server
  ports:
  - port: 9090
    targetPort: 9090

Client Configuration:

conn, err := grpc.Dial("grpc-service.default.svc.cluster.local:9090",
    grpc.WithDefaultServiceConfig(`{
        "loadBalancingConfig": [{"round_robin": {}}],
        "healthCheckConfig": {
            "serviceName": "grpc.health.v1.Health"
        }
    }`),
    grpc.WithTransportCredentials(insecure.NewCredentials()))

Protocol Buffer Version Management

Breaking Changes Impact

Risk: Method not found errors across microservices
Time to Resolution: 1-6 hours rollback + weeks of cleanup
Prevention: Explicit service versioning

Safe Versioning Pattern:

service UserServiceV1 {
  rpc GetUser(GetUserRequest) returns (GetUserResponse);
}

service UserServiceV2 {
  rpc GetUser(GetUserRequestV2) returns (GetUserResponseV2);
  rpc GetUserV1(GetUserRequest) returns (GetUserResponse);  // Backward compatibility
}

Advanced Debugging Strategies

Environment-Specific Testing

Problem: grpcurl works locally but production fails
Root Cause: Network policies, service mesh, authentication, SSL certificates
Solution: Test from inside production environment

# Correct debugging approach
kubectl exec -it client-pod -- grpcurl -plaintext \
  -d '{"id": 123}' \
  service.namespace.svc.cluster.local:9090 \
  UserService/GetUser

Debug Logging Configuration

# Go gRPC debug logging
export GRPC_GO_LOG_VERBOSITY_LEVEL=99
export GRPC_GO_LOG_SEVERITY_LEVEL=info
export GODEBUG=http2debug=1

# Python gRPC debug logging
export GRPC_VERBOSITY=debug
export GRPC_TRACE=all

Silent Failure Detection

Issue: 5% request loss without errors in logs
Root Cause: Client timeout shorter than server processing, causing duplicate processing
Detection: Server request count > client success count

Deduplication Solution:

func (s *server) ProcessRequest(ctx context.Context, req *Request) (*Response, error) {
    requestID := req.GetRequestId()
    
    // Check if already processed
    if result, exists := s.cache.Get(requestID); exists {
        return result, nil
    }
    
    // Process and cache result
    result, err := s.doActualWork(ctx, req)
    if err == nil {
        s.cache.Set(requestID, result, 5*time.Minute)
    }
    return result, err
}

Performance Troubleshooting

Latency Issues

Primary Causes (in order of frequency):

Network latency between services
Connection creation per request (incorrect pattern)
Load balancer terminating connections instead of proxying

Connection Reuse Pattern:

// Incorrect - creates new connection per call
func makeCall() {
    conn, _ := grpc.Dial("server:9090")
    defer conn.Close()
    // ... make call
}

// Correct - reuse global connection
var globalConn *grpc.ClientConn
func init() {
    globalConn, _ = grpc.Dial("server:9090")
}

Health Check Implementation

import "google.golang.org/grpc/health"
import "google.golang.org/grpc/health/grpc_health_v1"

// Server setup
healthServer := health.NewServer()
grpc_health_v1.RegisterHealthServer(server, healthServer)
healthServer.SetServingStatus("YourService", grpc_health_v1.HealthCheckResponse_SERVING)

Kubernetes Health Check Configuration:

livenessProbe:
  exec:
    command: ["/bin/grpc_health_probe", "-addr=:9090"]
  initialDelaySeconds: 5
  periodSeconds: 10

Memory Usage Optimization

Common Issues:

Each client connection consumes memory
Unbounded message sizes
Goroutine leaks in connection handling

Memory Controls:

grpc.NewServer(
    grpc.MaxRecvMsgSize(1024*1024),    // 1MB max incoming
    grpc.MaxSendMsgSize(1024*1024),    // 1MB max outgoing
)

Monitoring and Observability

Critical Metrics

Essential gRPC Status Codes to Monitor:

OK: Success rate baseline
UNAVAILABLE: Infrastructure failures requiring immediate attention
DEADLINE_EXCEEDED: Timeout issues indicating performance problems
RESOURCE_EXHAUSTED: Capacity planning alerts

Prometheus Instrumentation:

import "github.com/grpc-ecosystem/go-grpc-prometheus"

grpcMetrics := grpc_prometheus.NewServerMetrics()
server := grpc.NewServer(
    grpc.UnaryInterceptor(grpcMetrics.UnaryServerInterceptor()),
    grpc.StreamInterceptor(grpcMetrics.StreamServerInterceptor()),
)
grpcMetrics.InitializeMetrics(server)

Alert Configuration

High-Priority Alerts:

# Error rate above 5%
alert: gRPCHighErrorRate
expr: sum(rate(grpc_server_handled_total{grpc_code!="OK"}[5m])) / sum(rate(grpc_server_handled_total[5m])) > 0.05

# 95th percentile latency above 1 second
alert: gRPCHighLatency
expr: histogram_quantile(0.95, rate(grpc_server_handling_seconds_bucket[5m])) > 1.0

Status Codes for Logging Only (Not Alerting):

CANCELLED: Client-side cancellation
DEADLINE_EXCEEDED: Usually client timeout configuration
INVALID_ARGUMENT: Client data validation errors
PERMISSION_DENIED: Expected authentication failures

Request Logging Implementation

func loggingInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
    start := time.Now()
    resp, err := handler(ctx, req)
    duration := time.Since(start)
    code := status.Code(err)
    
    log.WithFields(log.Fields{
        "grpc.method": info.FullMethod,
        "grpc.code": code,
        "grpc.duration": duration,
        "grpc.request_size": proto.Size(req.(proto.Message)),
    }).Info("gRPC request")
    
    return resp, err
}

Infrastructure Cost Implications

Persistent Connection Requirements

Cost Impact: 2-4x higher baseline infrastructure costs for low-traffic services
Root Cause: Cannot scale to zero instances due to persistent client connections
Real Example: REST service scaling to zero: $20/month → gRPC service minimum 2 replicas: $150/month

Architecture Decision Criteria:

High-traffic services: gRPC performance benefits justify costs
Low-traffic services: Consider REST APIs for cost optimization
Internal services: Factor 24/7 minimum instance costs into budgeting

Service Mesh Benefits

Automatic Observability: Istio/Linkerd provide gRPC metrics without application changes
Metrics Available:

istio_requests_total{grpc_response_status="0"}
istio_request_duration_milliseconds
response_total{classification="success"}

Essential Tools and Resources

Debugging Tools

grpcurl: Command-line gRPC testing (brew install grpcurl)
grpcui: Web interface for gRPC services
grpc_health_probe: Kubernetes health checking binary

Implementation Difficulty Ranking

Easy: Basic client-server setup, simple method calls
Medium: Load balancing, health checks, basic monitoring
Hard: Service mesh integration, distributed tracing, performance optimization
Expert: Custom load balancing, protocol-level debugging, streaming optimization

Common Pitfall Prevention

Never assume HTTP monitoring tools work with gRPC
Always implement graceful shutdown for production services
Use headless services in Kubernetes for proper load balancing
Implement request deduplication for critical operations
Plan for higher baseline infrastructure costs

Time Investment Expectations

Initial Setup: 1-2 days for basic service
Production Readiness: 1-2 weeks including monitoring, load balancing, health checks
Debugging Expertise: 3-6 months of production experience
Advanced Features: 6+ months for streaming, custom interceptors, performance optimization

Useful Links for Further Investigation

Essential gRPC Debugging Resources

Link	Description
grpcurl	Command-line tool for testing gRPC services. Like curl but for binary protocols.
grpcui	Web interface for gRPC services. Better than trying to write test clients.

gRPC Production Error Resolution - AI Technical Reference

Critical Production Failures and Solutions

Connection Refused Errors

DEADLINE_EXCEEDED Errors

HTTP/2 Connection Reset (RST_STREAM)

Load Balancing Failures

HTTP/2 Connection Persistence Problem

Kubernetes Service Discovery Issues

Protocol Buffer Version Management

Breaking Changes Impact

Advanced Debugging Strategies

Environment-Specific Testing

Debug Logging Configuration

Silent Failure Detection

Performance Troubleshooting

Latency Issues

Health Check Implementation

Memory Usage Optimization

Monitoring and Observability

Critical Metrics

Alert Configuration

Request Logging Implementation

Infrastructure Cost Implications

Persistent Connection Requirements

Service Mesh Benefits

Essential Tools and Resources

Debugging Tools

Implementation Difficulty Ranking

Common Pitfall Prevention

Time Investment Expectations

Useful Links for Further Investigation

Essential gRPC Debugging Resources

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

gRPC Service Mesh Integration

Debugging Istio Production Issues - The 3AM Survival Guide

Your Traces Are Fucked and Here's How to Fix Them

Build REST APIs in Gleam That Don't Crash in Production

Migrating from REST to GraphQL: A Survival Guide from Someone Who's Done It 3 Times (And Lived to Tell About It)

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management

How to Deploy Istio Without Destroying Your Production Environment

Escape Istio Hell: How to Migrate to Linkerd Without Destroying Production

Stop Debugging Microservices Networking at 3AM

Envoy Proxy - The Network Proxy That Actually Works

Google Cloud Developer Tools - Deploy Your Shit Without Losing Your Mind

Google Cloud Platform - After 3 Years, I Still Don't Hate It

Google Cloud Reports Billions in AI Revenue, $106 Billion Backlog

Fixing Grok Code Fast 1: The Debugging Guide Nobody Wrote

TensorFlow Serving Production Deployment - The Shit Nobody Tells You About

Fix GraphQL N+1 Queries That Are Murdering Your Database

GraphQL vs REST API Design - Choose the Right Architecture for Your Project

GraphQL Performance Issues That Actually Matter