In static infrastructure, services live at known addresses. Database at 10.0.1.5, cache at 10.0.1.6. Simple, predictable, fragile.

In dynamic infrastructure — containers, auto-scaling, cloud — services appear and disappear constantly. IP addresses change. Instances multiply and vanish. Hardcoded addresses become a liability.

Service discovery solves this: how do services find each other when everything is moving?

The Problem

1
2
3
4
5
6
7
# Hardcoded - works until it doesn't
DATABASE_URL = "postgres://10.0.1.5:5432/mydb"

# What happens when:
# - Database moves to a new server?
# - You add read replicas?
# - The IP changes after maintenance?

DNS-Based Discovery

The simplest approach: use DNS names instead of IPs.

1
DATABASE_URL = "postgres://db.internal:5432/mydb"

Update DNS when the database moves. All clients automatically resolve the new address.

Internal DNS with short TTLs:

1
2
db.internal.    30    IN    A    10.0.1.5
cache.internal. 30    IN    A    10.0.1.6

30-second TTL means clients re-resolve frequently, picking up changes quickly.

Pros:

  • Universal — everything understands DNS
  • No client changes needed
  • Simple to understand

Cons:

  • TTL caching causes propagation delays
  • No health checking built in
  • Limited metadata (just IP addresses)

Client-Side Discovery

Clients query a service registry directly:

Client(directRceogninsetcrtyion)SSSeeerrrvvviiiccceeeAAA
1
2
3
4
# Client queries registry, picks an instance
instances = registry.get_instances("payment-service")
instance = load_balancer.choose(instances)
response = http.get(f"http://{instance.address}/charge")

Pros:

  • Client controls load balancing strategy
  • No proxy hop — direct connection
  • Rich metadata available (version, zone, capacity)

Cons:

  • Client complexity — needs registry library
  • Every language/framework needs implementation
  • Client must handle instance failures

Server-Side Discovery

Clients talk to a load balancer that handles discovery:

ClientLoRaedgiBsatlrayncerSSeerrvviicceeAA
1
2
3
# Client just calls a stable endpoint
response = http.get("http://payment-service.internal/charge")
# Load balancer handles finding healthy instances

Pros:

  • Simple clients — just HTTP calls
  • Language-agnostic
  • Centralized load balancing logic

Cons:

  • Extra network hop
  • Load balancer becomes critical path
  • Less client control

Service Registries

Consul

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
service {
  name = "payment-service"
  port = 8080
  
  check {
    http     = "http://localhost:8080/health"
    interval = "10s"
    timeout  = "2s"
  }
}

Query via DNS or HTTP API:

1
2
3
4
5
# DNS
dig payment-service.service.consul

# HTTP
curl http://consul:8500/v1/health/service/payment-service

etcd

1
2
3
4
5
6
7
8
9
import etcd3

etcd = etcd3.client()

# Register
etcd.put('/services/payment/instance-1', '10.0.1.5:8080')

# Discover
instances = etcd.get_prefix('/services/payment/')

Kubernetes

Built-in service discovery via Services and DNS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  selector:
    app: payment
  ports:
    - port: 80
      targetPort: 8080
1
2
3
# Any pod can reach the service
response = http.get("http://payment-service/charge")
# Kubernetes handles the rest

Registration Patterns

Self-Registration

Service registers itself on startup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
class PaymentService:
    def startup(self):
        registry.register(
            name="payment-service",
            address=get_my_ip(),
            port=8080,
            health_check="/health"
        )
    
    def shutdown(self):
        registry.deregister("payment-service", get_my_ip())

Pros: Service controls its own registration Cons: Every service needs registration logic

Third-Party Registration

External process watches for new services and registers them:

1
2
3
4
5
6
7
8
9
# Registrar watches Docker events
for event in docker.events():
    if event.type == 'container' and event.action == 'start':
        container = docker.inspect(event.id)
        registry.register(
            name=container.labels['service'],
            address=container.ip,
            port=container.labels['port']
        )

Pros: Services don’t need registry awareness Cons: Extra component to manage

Kubernetes uses this pattern — kubelet registers pods automatically.

Health Checking

Registration without health checking is dangerous. Dead instances stay in the registry.

Active checks — registry pings services:

1
2
3
4
# Registry periodically checks each instance
for instance in registry.instances("payment-service"):
    if not http.get(f"http://{instance}/health").ok:
        registry.mark_unhealthy(instance)

Passive checks — services send heartbeats:

1
2
3
4
5
# Service sends heartbeat every 10 seconds
while running:
    registry.heartbeat("payment-service", my_address)
    time.sleep(10)
# Miss 3 heartbeats → marked unhealthy

Hybrid — both active and passive:

1
2
3
4
5
6
7
8
# Consul supports both
check {
  http     = "http://localhost:8080/health"
  interval = "10s"
}
check {
  ttl = "30s"  # Must call /v1/agent/check/pass/:check_id
}

Handling Failures

Service discovery must handle:

Registry unavailable:

1
2
3
4
5
def get_instances(service_name):
    try:
        return registry.query(service_name)
    except RegistryUnavailable:
        return cached_instances[service_name]  # Stale but better than nothing

All instances unhealthy:

1
2
3
4
instances = registry.get_healthy_instances("payment")
if not instances:
    # Fall back to any instance? Return error? Use circuit breaker?
    raise ServiceUnavailable("No healthy payment instances")

Split brain: Multiple registries with inconsistent views. Use consensus-based registries (Consul, etcd) or accept eventual consistency.

Service Mesh

Modern approach: sidecar proxy handles all discovery and routing.

ServicePodS(iEdnevcoayr)Otherservices

The service just calls localhost. The sidecar handles discovery, load balancing, retries, mTLS, observability.

Pros: Zero application changes, consistent behavior Cons: Complexity, resource overhead, another thing to operate

Implementations: Istio, Linkerd, Consul Connect


Service discovery is infrastructure plumbing — invisible when working, catastrophic when broken. Start with DNS for simple cases. Add a registry when you need health checking and metadata. Consider a service mesh when you need advanced traffic management.

The goal: services find each other reliably, without hardcoded addresses, without human intervention. Everything else is implementation details.