When Kubernetes scales down your deployment or you push a new release, your running containers receive SIGTERM. Then, after a grace period, SIGKILL. The difference between graceful and chaotic shutdown is what happens in those seconds between the two signals.

A request half-processed, a database transaction uncommitted, a file partially written — these are the artifacts of ungraceful shutdown. They create inconsistent state, failed requests, and debugging nightmares.

The Signal Sequence

1234567.......SGPPPPIIrrrrrfGaooooTcccccsEeeeeetRssssiMpsssslelsrssseeihhhxrnooooiutduuutnlllsntcdddioownusfcigpntilt:rtonohodpisScosecIewahoGsnccdKscioeIbennLep-n0Lgtfeiilc(nnitnsggioho(ntnmdeseewwrfoccawrlyuoke)lratkn:ly30sinKubernetes)

Basic Signal Handling

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import signal
import sys

shutdown_requested = False

def handle_sigterm(signum, frame):
    global shutdown_requested
    print("SIGTERM received, initiating graceful shutdown...")
    shutdown_requested = True

signal.signal(signal.SIGTERM, handle_sigterm)
signal.signal(signal.SIGINT, handle_sigterm)  # Ctrl+C

# Main loop checks shutdown flag
while not shutdown_requested:
    process_next_item()

# Cleanup after loop exits
cleanup()
sys.exit(0)

Web Servers: Stop Accepting, Finish Processing

Most web frameworks have built-in graceful shutdown. The pattern:

  1. Stop accepting new connections
  2. Wait for active requests to complete
  3. Close the listener

Flask with Gunicorn:

1
gunicorn --graceful-timeout 30 --timeout 60 app:app

Gunicorn handles SIGTERM by stopping new connections and waiting up to graceful-timeout seconds for workers to finish.

FastAPI with Uvicorn:

1
2
3
4
5
6
7
8
9
import uvicorn

if __name__ == "__main__":
    uvicorn.run(
        "app:app",
        host="0.0.0.0",
        port=8000,
        timeout_graceful_shutdown=30
    )

Node.js/Express:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
const server = app.listen(3000);

process.on('SIGTERM', () => {
  console.log('SIGTERM received, closing HTTP server...');
  server.close(() => {
    console.log('HTTP server closed');
    // Close database connections, etc.
    db.end(() => {
      process.exit(0);
    });
  });
});

Background Workers: Finish Current Job

For queue workers, the pattern is similar but focused on job completion:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import signal

class Worker:
    def __init__(self):
        self.should_stop = False
        signal.signal(signal.SIGTERM, self.handle_shutdown)
    
    def handle_shutdown(self, signum, frame):
        print("Shutdown requested, finishing current job...")
        self.should_stop = True
    
    def run(self):
        while not self.should_stop:
            job = self.queue.get(timeout=1)
            if job:
                self.process(job)  # Runs to completion
                self.queue.ack(job)
        
        print("Clean shutdown complete")

The key: check should_stop between jobs, not during. Never abandon a job mid-processing.

Database Connections: Close Cleanly

Connection pools should be drained, not abandoned:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import atexit

def cleanup_db():
    print("Closing database connections...")
    connection_pool.closeall()
    print("Database connections closed")

atexit.register(cleanup_db)

# Or explicitly in signal handler
def handle_sigterm(signum, frame):
    cleanup_db()
    sys.exit(0)

Abandoned connections linger on the database server until timeout. Clean closure frees resources immediately.

Kubernetes Configuration

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: v1
kind: Pod
spec:
  terminationGracePeriodSeconds: 60  # Time before SIGKILL
  containers:
  - name: app
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]  # Wait for LB to drain

The preStop hook runs before SIGTERM. Use it to:

  • Wait for load balancer to stop sending traffic
  • Deregister from service discovery
  • Send notifications

Why the sleep? Load balancers don’t instantly stop routing traffic when a pod starts terminating. A brief sleep ensures in-flight requests have somewhere to go.

The Readiness-Shutdown Coordination

When shutdown begins, your readiness probe should start failing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
shutdown_in_progress = False

@app.route('/health/ready')
def readiness():
    if shutdown_in_progress:
        return {'status': 'shutting_down'}, 503
    return {'status': 'ready'}, 200

def handle_sigterm(signum, frame):
    global shutdown_in_progress
    shutdown_in_progress = True
    # Continue processing existing requests
    # Load balancer will stop sending new ones

This creates a clean drain:

  1. SIGTERM received
  2. Readiness fails
  3. Load balancer removes instance from rotation
  4. Existing requests complete
  5. Process exits

Long-Running Operations

Some operations can’t complete in 30 seconds. Options:

Checkpointing:

1
2
3
4
5
6
7
8
def process_large_batch(items):
    checkpoint = load_checkpoint()
    for i, item in enumerate(items[checkpoint:], start=checkpoint):
        if shutdown_requested:
            save_checkpoint(i)
            return  # Resume later
        process(item)
    clear_checkpoint()

Handoff to queue:

1
2
3
4
5
def handle_sigterm(signum, frame):
    if current_job:
        # Re-queue for another worker to pick up
        queue.nack(current_job)
    sys.exit(0)

Request more time:

1
2
# Kubernetes: extend grace period for batch jobs
terminationGracePeriodSeconds: 300  # 5 minutes

Testing Graceful Shutdown

Don’t deploy untested shutdown logic:

1
2
3
4
5
6
7
8
9
# Send SIGTERM and observe
kill -TERM $(pidof myapp)

# Watch logs for proper sequence
docker stop --time 30 mycontainer

# Kubernetes: watch pod termination
kubectl delete pod mypod --grace-period=60
kubectl logs -f mypod

Verify:

  • No requests return 5xx during shutdown
  • All database transactions commit or rollback
  • Queue jobs complete or return to queue
  • Log messages confirm clean shutdown

Common Mistakes

Ignoring signals entirely:

1
2
3
# Bad: SIGTERM kills process immediately with no cleanup
if __name__ == "__main__":
    app.run()

Infinite cleanup:

1
2
3
4
# Bad: cleanup can hang forever
def cleanup():
    while pending_items:  # What if this never empties?
        process(pending_items.pop())

Always have timeouts in cleanup code.

Assuming instant propagation:

1
2
3
# Bad: assuming load balancer instantly knows we're gone
def handle_sigterm(signum, frame):
    sys.exit(0)  # Requests in flight get connection reset

Give the infrastructure time to catch up.


Graceful shutdown is the difference between “the deploy went fine” and “we had a spike of 500s during the deploy.” It’s a few dozen lines of signal handling that prevent hours of debugging and customer complaints.

Handle SIGTERM. Stop accepting. Finish processing. Clean up. Exit. Your future self will thank you.