Hardcoded IPs are a maintenance nightmare. Here’s how to let services find each other dynamically.

The Problem

1
2
3
4
5
6
7
# Bad: Hardcoded
api_url = "http://192.168.1.50:8080"

# What happens when:
# - IP changes?
# - Service moves to new host?
# - You add a second instance?

Service discovery solves this: services register themselves, and clients look them up by name.

DNS-Based Discovery

The simplest approach: use DNS.

Internal DNS

1
2
3
# /etc/hosts or internal DNS server
192.168.1.50  api.internal
192.168.1.51  database.internal
1
2
# Code uses names
api_url = "http://api.internal:8080"

Pros: Simple, works everywhere Cons: Manual updates, no health checking, caching issues

DNS with Round-Robin

1
2
3
api.internal.  60  IN  A  192.168.1.50
api.internal.  60  IN  A  192.168.1.51
api.internal.  60  IN  A  192.168.1.52

DNS returns all IPs, client picks one. Low TTL (60s) allows faster updates.

Kubernetes Service Discovery

Built-in and automatic.

ClusterIP Service

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: production
spec:
  selector:
    app: api
  ports:
    - port: 80
      targetPort: 8080

Services are discoverable via DNS:

api.production.svc.cluster.local

From Within Pods

1
2
3
4
5
6
7
import requests

# Same namespace - just use service name
response = requests.get("http://api/users")

# Different namespace - use FQDN
response = requests.get("http://api.other-namespace.svc.cluster.local/users")

Headless Services

For direct pod access (databases, stateful workloads):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: v1
kind: Service
metadata:
  name: database
spec:
  clusterIP: None  # Headless
  selector:
    app: postgres
  ports:
    - port: 5432

DNS returns individual pod IPs instead of a virtual IP.

Consul

HashiCorp’s service mesh and discovery tool.

Register a Service

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "service": {
    "name": "api",
    "port": 8080,
    "check": {
      "http": "http://localhost:8080/health",
      "interval": "10s"
    }
  }
}
1
curl -X PUT -d @service.json http://localhost:8500/v1/agent/service/register

Query Services

1
2
3
4
5
# DNS interface
dig @127.0.0.1 -p 8600 api.service.consul

# HTTP API
curl http://localhost:8500/v1/health/service/api?passing=true

Consul Template

Auto-update config files when services change:

1
2
3
4
5
upstream api {
{{range service "api"}}
  server {{.Address}}:{{.Port}};
{{end}}
}
1
consul-template -template "nginx.ctmpl:nginx.conf:nginx -s reload"

Client-Side Discovery

Client queries registry, picks an instance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import consul

c = consul.Consul()

def get_api_url():
    _, services = c.health.service('api', passing=True)
    if not services:
        raise Exception("No healthy api instances")
    
    # Simple random selection
    service = random.choice(services)
    return f"http://{service['Service']['Address']}:{service['Service']['Port']}"

response = requests.get(f"{get_api_url()}/users")

Pros: Client has full control over load balancing Cons: Every client needs discovery logic

Server-Side Discovery

Load balancer queries registry, routes traffic.

ClientLRoeagdisBtarlyancerServiceInstance
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# nginx with consul-template
upstream api {
    server 192.168.1.50:8080;  # Auto-updated
    server 192.168.1.51:8080;
}

server {
    location /api/ {
        proxy_pass http://api;
    }
}

Pros: Clients stay simple Cons: Extra hop, load balancer becomes critical

Health Checking

Discovery without health checks serves dead instances.

Passive Health Checks

Track failures from real traffic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
class ServiceClient:
    def __init__(self):
        self.failures = defaultdict(int)
        
    def call(self, service_name):
        instances = discover(service_name)
        healthy = [i for i in instances if self.failures[i] < 3]
        
        instance = random.choice(healthy or instances)
        try:
            response = requests.get(instance, timeout=5)
            self.failures[instance] = 0
            return response
        except:
            self.failures[instance] += 1
            raise

Active Health Checks

Proactively test instances:

1
2
3
4
5
6
# Consul health check
check:
  http: "http://localhost:8080/health"
  interval: "10s"
  timeout: "2s"
  deregister_critical_service_after: "1m"

Unhealthy instances are removed from discovery results.

Common Patterns

Retry with Different Instance

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def call_with_retry(service_name, path, max_retries=3):
    instances = list(discover(service_name))
    random.shuffle(instances)
    
    for instance in instances[:max_retries]:
        try:
            return requests.get(f"{instance}{path}", timeout=5)
        except RequestException:
            continue
    
    raise Exception("All instances failed")

Circuit Breaker

1
2
3
4
5
6
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=30)
def call_api(path):
    url = get_api_url()  # From discovery
    return requests.get(url + path)

Caching Discovery Results

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from functools import lru_cache
from time import time

@lru_cache(maxsize=100)
def discover_cached(service_name, ttl_bucket):
    return discover(service_name)

def get_instances(service_name, ttl=60):
    bucket = int(time() / ttl)
    return discover_cached(service_name, bucket)

The Discovery Checklist

  • Services register on startup
  • Services deregister on shutdown
  • Health checks configured
  • Clients handle missing instances
  • Retry logic with different instances
  • Reasonable caching (not too long)
  • Monitoring for registration failures

When to Use What

ScenarioSolution
KubernetesBuilt-in Services
Simple/staticDNS
Dynamic/multi-DCConsul
AWSECS Service Discovery, Cloud Map
Need service meshConsul Connect, Istio

Start simple. DNS works for most cases. Add complexity when you actually need dynamic discovery, health checking, or cross-datacenter routing.


The best service discovery is the one your developers don’t notice — services just find each other, and failures route around automatically.