Graceful Shutdown: Finishing What You Started

When Kubernetes scales down your deployment or you push a new release, your running containers receive SIGTERM. Then, after a grace period, SIGKILL. The difference between graceful and chaotic shutdown is what happens in those seconds between the two signals.

A request half-processed, a database transaction uncommitted, a file partially written — these are the artifacts of ungraceful shutdown. They create inconsistent state, failed requests, and debugging nightmares.

The Signal Sequence

Basic Signal Handling

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import signal
import sys

shutdown_requested = False

def handle_sigterm(signum, frame):
    global shutdown_requested
    print("SIGTERM received, initiating graceful shutdown...")
    shutdown_requested = True

signal.signal(signal.SIGTERM, handle_sigterm)
signal.signal(signal.SIGINT, handle_sigterm)  # Ctrl+C

# Main loop checks shutdown flag
while not shutdown_requested:
    process_next_item()

# Cleanup after loop exits
cleanup()
sys.exit(0)

Web Servers: Stop Accepting, Finish Processing

Most web frameworks have built-in graceful shutdown. The pattern:

Stop accepting new connections
Wait for active requests to complete
Close the listener

Flask with Gunicorn:

1
gunicorn --graceful-timeout 30 --timeout 60 app:app

Gunicorn handles SIGTERM by stopping new connections and waiting up to graceful-timeout seconds for workers to finish.

FastAPI with Uvicorn:

1
2
3
4
5
6
7
8
9
import uvicorn

if __name__ == "__main__":
    uvicorn.run(
        "app:app",
        host="0.0.0.0",
        port=8000,
        timeout_graceful_shutdown=30
    )

Node.js/Express:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
const server = app.listen(3000);

process.on('SIGTERM', () => {
  console.log('SIGTERM received, closing HTTP server...');
  server.close(() => {
    console.log('HTTP server closed');
    // Close database connections, etc.
    db.end(() => {
      process.exit(0);
    });
  });
});

Background Workers: Finish Current Job

For queue workers, the pattern is similar but focused on job completion:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import signal

class Worker:
    def __init__(self):
        self.should_stop = False
        signal.signal(signal.SIGTERM, self.handle_shutdown)
    
    def handle_shutdown(self, signum, frame):
        print("Shutdown requested, finishing current job...")
        self.should_stop = True
    
    def run(self):
        while not self.should_stop:
            job = self.queue.get(timeout=1)
            if job:
                self.process(job)  # Runs to completion
                self.queue.ack(job)
        
        print("Clean shutdown complete")

The key: check should_stop between jobs, not during. Never abandon a job mid-processing.

Database Connections: Close Cleanly

Connection pools should be drained, not abandoned:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import atexit

def cleanup_db():
    print("Closing database connections...")
    connection_pool.closeall()
    print("Database connections closed")

atexit.register(cleanup_db)

# Or explicitly in signal handler
def handle_sigterm(signum, frame):
    cleanup_db()
    sys.exit(0)

Abandoned connections linger on the database server until timeout. Clean closure frees resources immediately.

Kubernetes Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: v1
kind: Pod
spec:
  terminationGracePeriodSeconds: 60  # Time before SIGKILL
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]  # Wait for LB to drain

The preStop hook runs before SIGTERM. Use it to:

Wait for load balancer to stop sending traffic
Deregister from service discovery
Send notifications

Why the sleep? Load balancers don’t instantly stop routing traffic when a pod starts terminating. A brief sleep ensures in-flight requests have somewhere to go.

The Readiness-Shutdown Coordination

When shutdown begins, your readiness probe should start failing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
shutdown_in_progress = False

@app.route('/health/ready')
def readiness():
    if shutdown_in_progress:
        return {'status': 'shutting_down'}, 503
    return {'status': 'ready'}, 200

def handle_sigterm(signum, frame):
    global shutdown_in_progress
    shutdown_in_progress = True
    # Continue processing existing requests
    # Load balancer will stop sending new ones

This creates a clean drain:

SIGTERM received
Readiness fails
Load balancer removes instance from rotation
Existing requests complete
Process exits

Long-Running Operations

Some operations can’t complete in 30 seconds. Options:

Checkpointing:

1
2
3
4
5
6
7
8
def process_large_batch(items):
    checkpoint = load_checkpoint()
    for i, item in enumerate(items[checkpoint:], start=checkpoint):
        if shutdown_requested:
            save_checkpoint(i)
            return  # Resume later
        process(item)
    clear_checkpoint()

Handoff to queue:

1
2
3
4
5
def handle_sigterm(signum, frame):
    if current_job:
        # Re-queue for another worker to pick up
        queue.nack(current_job)
    sys.exit(0)

Request more time:

1
2
# Kubernetes: extend grace period for batch jobs
terminationGracePeriodSeconds: 300  # 5 minutes

Testing Graceful Shutdown

Don’t deploy untested shutdown logic:

1
2
3
4
5
6
7
8
9
# Send SIGTERM and observe
kill -TERM $(pidof myapp)

# Watch logs for proper sequence
docker stop --time 30 mycontainer

# Kubernetes: watch pod termination
kubectl delete pod mypod --grace-period=60
kubectl logs -f mypod

Verify:

No requests return 5xx during shutdown
All database transactions commit or rollback
Queue jobs complete or return to queue
Log messages confirm clean shutdown

Common Mistakes

Ignoring signals entirely:

1
2
3
# Bad: SIGTERM kills process immediately with no cleanup
if __name__ == "__main__":
    app.run()

Infinite cleanup:

1
2
3
4
# Bad: cleanup can hang forever
def cleanup():
    while pending_items:  # What if this never empties?
        process(pending_items.pop())

Always have timeouts in cleanup code.

Assuming instant propagation:

1
2
3
# Bad: assuming load balancer instantly knows we're gone
def handle_sigterm(signum, frame):
    sys.exit(0)  # Requests in flight get connection reset

Give the infrastructure time to catch up.

Graceful shutdown is the difference between “the deploy went fine” and “we had a spike of 500s during the deploy.” It’s a few dozen lines of signal handling that prevent hours of debugging and customer complaints.

Handle SIGTERM. Stop accepting. Finish processing. Clean up. Exit. Your future self will thank you.

The Signal Sequence#

Basic Signal Handling#

Web Servers: Stop Accepting, Finish Processing#

Background Workers: Finish Current Job#

Database Connections: Close Cleanly#

Kubernetes Configuration#

The Readiness-Shutdown Coordination#

Long-Running Operations#

Testing Graceful Shutdown#

Common Mistakes#

📬 Get the Newsletter