Deployment Strategies: Blue-Green, Canary, and Rolling Updates
Deployment Strategies: Blue-Green, Canary, and Rolling Updates
Deploying code to production is the most dangerous thing most teams do on a regular basis. Every deployment is a controlled introduction of change into a running system, and every change is a potential source of failure. The question is not "will a deployment ever break production?" -- it is "when it does, how fast can we detect and recover?"
Deployment strategies are risk management techniques. They control how much of your traffic is exposed to new code, how quickly problems surface, and how fast you can roll back. The right strategy depends on your infrastructure, your tolerance for complexity, and how bad a bad deploy actually is for your business.
The Simplest Deploy: Replace and Pray
The most basic deployment strategy is "stop the old version, start the new version." This is what you get when you ssh into a server, pull the latest code, and restart the process. Or when you push to a PaaS like Heroku or Railway.
# The "replace and pray" deploy
ssh prod-server
cd /opt/myapp
git pull origin main
systemctl restart myapp
This works fine for low-traffic applications, internal tools, and side projects. The downside is obvious: there is a window where your application is completely down (during restart), and if the new version is broken, 100% of your users are affected immediately.
Every other deployment strategy is a variation on the theme of "how do we avoid exposing all users to potentially broken code at once?"
Blue-Green Deployments
Blue-green deployment maintains two identical production environments. One (blue) serves live traffic. The other (green) is idle or serving the previous version. To deploy, you push the new code to the idle environment, verify it works, and then switch traffic.
How It Works
- Blue is running version 1, serving all traffic.
- Deploy version 2 to green. Run smoke tests against green.
- Switch the load balancer (or DNS) to point traffic to green.
- Green is now live with version 2. Blue is idle.
- If something goes wrong, switch traffic back to blue. Instant rollback.
Implementation with Nginx
For a simple setup, blue-green can be two sets of application processes behind Nginx:
# /etc/nginx/conf.d/myapp.conf
upstream blue {
server 127.0.0.1:3000;
server 127.0.0.1:3001;
}
upstream green {
server 127.0.0.1:4000;
server 127.0.0.1:4001;
}
# Switch by changing this line
upstream active {
server 127.0.0.1:3000; # blue
server 127.0.0.1:3001;
}
server {
listen 80;
server_name myapp.com;
location / {
proxy_pass http://active;
}
}
A deploy script updates the Nginx config and reloads:
#!/bin/bash
set -euo pipefail
CURRENT=$(grep -c "3000" /etc/nginx/conf.d/myapp.conf && echo "blue" || echo "green")
if [ "$CURRENT" = "blue" ]; then
TARGET_PORTS="4000 4001"
TARGET="green"
else
TARGET_PORTS="3000 3001"
TARGET="blue"
fi
echo "Deploying to $TARGET..."
# Deploy new version to target environment
for port in $TARGET_PORTS; do
deploy_to_port "$port"
done
# Run smoke tests against target
for port in $TARGET_PORTS; do
curl -sf "http://localhost:$port/health" || { echo "Health check failed"; exit 1; }
done
# Switch traffic
sed -i "s/upstream active {.*/upstream active {/" /etc/nginx/conf.d/myapp.conf
# (simplified -- actual implementation would rewrite the upstream block)
nginx -s reload
echo "Traffic switched to $TARGET"
Implementation with AWS
On AWS, blue-green deployments commonly use:
- Elastic Load Balancer + Target Groups: Two target groups (blue and green), switch the listener rule to point at the new target group.
- Route 53 weighted routing: Shift DNS from one environment to another.
- ECS with CodeDeploy: AWS CodeDeploy has built-in blue-green support for ECS services.
# AWS CLI: switch ALB listener to green target group
aws elbv2 modify-listener \
--listener-arn arn:aws:elasticloadbalancing:us-east-1:123456:listener/app/myapp/abc123 \
--default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:us-east-1:123456:targetgroup/green/def456
Implementation with Kubernetes
In Kubernetes, blue-green is achieved by running two deployments and switching the Service selector:
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: myapp
image: myregistry/myapp:v1.2.3
---
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: myapp
image: myregistry/myapp:v1.3.0
---
# service.yaml -- switch by changing the version selector
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # Change to "green" to switch
ports:
- port: 80
targetPort: 8080
# Switch traffic to green
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'
# Rollback to blue
kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'
Trade-offs
Pros: Instant rollback (just switch traffic back). Zero downtime. The idle environment serves as a staging/validation environment. Simple mental model.
Cons: Requires double the infrastructure (two full environments running simultaneously). Database migrations are tricky -- both versions need to work with the same database schema. Not gradual -- you switch 100% of traffic at once, so you cannot detect issues that only manifest under partial load.
Canary Deployments
Canary deployments route a small percentage of traffic to the new version while the majority continues hitting the old version. You monitor error rates, latency, and business metrics on the canary. If everything looks good, you gradually increase the canary's traffic share until it reaches 100%.
How It Works
- Deploy the new version alongside the current version.
- Route 1-5% of traffic to the new version (the canary).
- Monitor metrics: error rates, latency, business KPIs.
- If metrics are healthy, increase to 10%, 25%, 50%, 100%.
- If metrics degrade, route all traffic back to the old version.
Implementation with Kubernetes and Istio
Istio's traffic management makes canary deployments declarative:
# virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp.example.com
http:
- route:
- destination:
host: myapp
subset: stable
weight: 95
- destination:
host: myapp
subset: canary
weight: 5
---
# destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp
spec:
host: myapp
subsets:
- name: stable
labels:
version: v1.2.3
- name: canary
labels:
version: v1.3.0
Progressively increase the canary weight:
# Bump canary to 25%
kubectl patch virtualservice myapp --type merge -p '
spec:
http:
- route:
- destination:
host: myapp
subset: stable
weight: 75
- destination:
host: myapp
subset: canary
weight: 25'
Automated Canary Analysis
Manual canary promotion is tedious and error-prone. Tools like Flagger automate the process:
# flagger canary resource
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: myapp
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
service:
port: 80
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
Flagger watches the metrics, automatically promotes the canary if they are healthy, and automatically rolls back if they degrade. The stepWeight of 10 means it goes 10% -> 20% -> 30% -> 40% -> 50% -> promotion, with a 1-minute analysis window at each step.
Trade-offs
Pros: Limits blast radius -- only a fraction of users see potentially broken code. Provides real production validation before full rollout. Allows metric-driven promotion decisions.
Cons: More complex than blue-green. Requires good observability to detect issues at low traffic percentages. Two versions run simultaneously, so you need backward-compatible database schemas and API contracts. Stateful services (WebSockets, sticky sessions) add complexity.
Rolling Updates
Rolling updates replace instances of the old version with the new version one at a time (or in small batches). At any point during the rollout, some instances run the old version and some run the new version.
How It Works
- You have 10 instances running version 1.
- Take down instance 1, deploy version 2 to it, bring it back up.
- Repeat for instances 2, 3, 4... through 10.
- If a health check fails, stop the rollout and roll back.
Implementation in Kubernetes
Rolling updates are the default strategy for Kubernetes Deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Allow 2 extra pods during update
maxUnavailable: 1 # Allow 1 pod to be unavailable
template:
spec:
containers:
- name: myapp
image: myregistry/myapp:v1.3.0
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
When you update the image tag, Kubernetes creates new pods with the new image, waits for them to pass readiness probes, and then terminates old pods. The maxSurge and maxUnavailable parameters control the pace.
# Trigger a rolling update
kubectl set image deployment/myapp myapp=myregistry/myapp:v1.3.0
# Watch the rollout
kubectl rollout status deployment/myapp
# Roll back if something goes wrong
kubectl rollout undo deployment/myapp
Implementation with systemd (Simple Setup)
For non-Kubernetes setups with multiple instances behind a load balancer:
#!/bin/bash
set -euo pipefail
SERVERS="web-1 web-2 web-3 web-4"
for server in $SERVERS; do
echo "Deploying to $server..."
# Remove from load balancer
remove_from_lb "$server"
sleep 5 # drain connections
# Deploy
ssh "$server" "cd /opt/myapp && git pull && systemctl restart myapp"
# Wait for health check
for i in $(seq 1 30); do
if ssh "$server" "curl -sf http://localhost:8080/health"; then
break
fi
sleep 2
done
# Health check passed, add back to load balancer
add_to_lb "$server"
echo "$server deployed successfully"
done
Trade-offs
Pros: No extra infrastructure needed (unlike blue-green). Built into Kubernetes. Gradual rollout limits blast radius. Simple to understand and implement.
Cons: Rollback is slow -- you have to roll forward to the old version or undo the rollout. During the update, both versions coexist, so APIs and database schemas must be backward compatible. No fine-grained traffic control (you cannot send exactly 5% to the new version). Harder to reason about when debugging issues during a rollout.
Feature Flag-Driven Deploys
Feature flags separate deployment from release. You deploy code that includes both the old and new behavior, controlled by a flag. The deploy itself changes nothing user-visible. You then toggle the flag to release the new behavior, independently of the deployment pipeline.
// The deploy ships both code paths
if (featureFlags.isEnabled("new-checkout", { userId: user.id })) {
return renderNewCheckout(cart);
} else {
return renderOldCheckout(cart);
}
This approach works with any deployment strategy. You can do a simple rolling update of the code (low risk, since the flag is off), and then progressively enable the flag for increasing percentages of users.
When to Use Feature Flags vs. Infrastructure-Level Strategies
Feature flags are better for:
- Releasing new user-facing features gradually
- A/B testing
- Long-lived operational kill switches
- When the change is in application logic, not infrastructure
Infrastructure-level strategies (canary, blue-green) are better for:
- Infrastructure changes (new runtime, dependency upgrades)
- Performance-sensitive changes where you need to compare latency at the load balancer level
- Changes that affect the entire request path, not just one feature
- When you want automated rollback based on system metrics
In practice, mature teams use both. Feature flags for business logic, canary deployments for infrastructure changes.
Monitoring During Deploys
No deployment strategy works without observability. You need to know, quickly, whether the new version is healthy. At minimum, monitor:
- Error rate: Are 5xx responses increasing? Are unhandled exceptions spiking?
- Latency: Are p50, p95, and p99 response times increasing?
- Saturation: Is CPU, memory, or connection pool usage increasing?
- Business metrics: Are signups, purchases, or other key actions declining?
Set up deployment markers in your monitoring tools so you can visually correlate metrics changes with deployments:
# Datadog deployment event
curl -X POST "https://api.datadoghq.com/api/v1/events" \
-H "DD-API-KEY: $DD_API_KEY" \
-d '{
"title": "Deployment: myapp v1.3.0",
"text": "Deployed by CI pipeline #1234",
"tags": ["service:myapp", "version:v1.3.0"],
"alert_type": "info"
}'
# Grafana annotation
curl -X POST "http://grafana:3000/api/annotations" \
-H "Authorization: Bearer $GRAFANA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"text": "Deploy myapp v1.3.0",
"tags": ["deployment", "myapp"]
}'
Choosing a Strategy
Replace and restart: Side projects, internal tools, anything where a few seconds of downtime is acceptable. Do not over-engineer.
Blue-green: When you need zero-downtime deploys with instant rollback, and you can afford double the infrastructure. Good for monoliths and applications where the deploy unit is large.
Rolling update: The default for Kubernetes workloads. Good when you have multiple replicas and want gradual rollout without extra infrastructure. Rollback is slower than blue-green.
Canary: When you need fine-grained traffic control and metric-driven promotion. Best for high-traffic services where you want to detect issues at low exposure. Requires good observability.
Feature flags: When you want to decouple deployment from release. Best used alongside one of the above strategies, not as a replacement.
The common thread: every strategy except "replace and pray" requires that your application can run two versions simultaneously. That means backward-compatible database migrations, API versioning, and stateless application servers. If your application cannot handle two versions coexisting, fix that first -- it is a prerequisite for any safe deployment strategy.
Rollback Procedures
Every deployment plan needs a rollback plan. Document it, test it, and make sure it can be executed under pressure (at 3 AM, by the person on call who did not write the code).
For blue-green: switch traffic back to the old environment. Test this switch regularly.
For canary: set the canary weight to 0 and scale down the canary deployment.
For rolling update: kubectl rollout undo or redeploy the previous version.
For feature flags: turn off the flag. This is the fastest rollback mechanism -- no deployment, no infrastructure changes, just a flag flip.
The best deployment strategy is the one your team actually understands and can execute confidently. A simple rolling update that everyone knows how to roll back is better than a sophisticated canary setup that nobody can debug at 3 AM.