Service Discovery: Finding Services in a Dynamic World

In static infrastructure, services live at known addresses. Database at 10.0.1.5, cache at 10.0.1.6. Simple, predictable, fragile.

In dynamic infrastructure — containers, auto-scaling, cloud — services appear and disappear constantly. IP addresses change. Instances multiply and vanish. Hardcoded addresses become a liability.

Service discovery solves this: how do services find each other when everything is moving?

The Problem

1
2
3
4
5
6
7
# Hardcoded - works until it doesn't
DATABASE_URL = "postgres://10.0.1.5:5432/mydb"

# What happens when:
# - Database moves to a new server?
# - You add read replicas?
# - The IP changes after maintenance?

DNS-Based Discovery

The simplest approach: use DNS names instead of IPs.

1
DATABASE_URL = "postgres://db.internal:5432/mydb"

Update DNS when the database moves. All clients automatically resolve the new address.

Internal DNS with short TTLs:

1
2
db.internal.    30    IN    A    10.0.1.5
cache.internal. 30    IN    A    10.0.1.6

30-second TTL means clients re-resolve frequently, picking up changes quickly.

Pros:

Universal — everything understands DNS
No client changes needed
Simple to understand

Cons:

TTL caching causes propagation delays
No health checking built in
Limited metadata (just IP addresses)

Client-Side Discovery

Clients query a service registry directly:

1
2
3
4
# Client queries registry, picks an instance
instances = registry.get_instances("payment-service")
instance = load_balancer.choose(instances)
response = http.get(f"http://{instance.address}/charge")

Pros:

Client controls load balancing strategy
No proxy hop — direct connection
Rich metadata available (version, zone, capacity)

Cons:

Client complexity — needs registry library
Every language/framework needs implementation
Client must handle instance failures

Server-Side Discovery

Clients talk to a load balancer that handles discovery:

1
2
3
# Client just calls a stable endpoint
response = http.get("http://payment-service.internal/charge")
# Load balancer handles finding healthy instances

Pros:

Simple clients — just HTTP calls
Language-agnostic
Centralized load balancing logic

Cons:

Extra network hop
Load balancer becomes critical path
Less client control

Service Registries

Consul

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
service {
  name = "payment-service"
  port = 8080
  
  check {
    http     = "http://localhost:8080/health"
    interval = "10s"
    timeout  = "2s"
  }
}

Query via DNS or HTTP API:

1
2
3
4
5
# DNS
dig payment-service.service.consul

# HTTP
curl http://consul:8500/v1/health/service/payment-service

etcd

1
2
3
4
5
6
7
8
9
import etcd3

etcd = etcd3.client()

# Register
etcd.put('/services/payment/instance-1', '10.0.1.5:8080')

# Discover
instances = etcd.get_prefix('/services/payment/')

Kubernetes

Built-in service discovery via Services and DNS:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: v1
kind: Service
metadata:
  name: payment-service
spec:
  selector:
    app: payment
  ports:
    - port: 80
      targetPort: 8080

1
2
3
# Any pod can reach the service
response = http.get("http://payment-service/charge")
# Kubernetes handles the rest

Registration Patterns

Self-Registration

Service registers itself on startup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
class PaymentService:
    def startup(self):
        registry.register(
            name="payment-service",
            address=get_my_ip(),
            port=8080,
            health_check="/health"
        )
    
    def shutdown(self):
        registry.deregister("payment-service", get_my_ip())

Pros: Service controls its own registration Cons: Every service needs registration logic

Third-Party Registration

External process watches for new services and registers them:

1
2
3
4
5
6
7
8
9
# Registrar watches Docker events
for event in docker.events():
    if event.type == 'container' and event.action == 'start':
        container = docker.inspect(event.id)
        registry.register(
            name=container.labels['service'],
            address=container.ip,
            port=container.labels['port']
        )

Pros: Services don’t need registry awareness Cons: Extra component to manage

Kubernetes uses this pattern — kubelet registers pods automatically.

Health Checking

Registration without health checking is dangerous. Dead instances stay in the registry.

Active checks — registry pings services:

1
2
3
4
# Registry periodically checks each instance
for instance in registry.instances("payment-service"):
    if not http.get(f"http://{instance}/health").ok:
        registry.mark_unhealthy(instance)

Passive checks — services send heartbeats:

1
2
3
4
5
# Service sends heartbeat every 10 seconds
while running:
    registry.heartbeat("payment-service", my_address)
    time.sleep(10)
# Miss 3 heartbeats → marked unhealthy

Hybrid — both active and passive:

1
2
3
4
5
6
7
8
# Consul supports both
check {
  http     = "http://localhost:8080/health"
  interval = "10s"
}
check {
  ttl = "30s"  # Must call /v1/agent/check/pass/:check_id
}

Handling Failures

Service discovery must handle:

Registry unavailable:

1
2
3
4
5
def get_instances(service_name):
    try:
        return registry.query(service_name)
    except RegistryUnavailable:
        return cached_instances[service_name]  # Stale but better than nothing

All instances unhealthy:

1
2
3
4
instances = registry.get_healthy_instances("payment")
if not instances:
    # Fall back to any instance? Return error? Use circuit breaker?
    raise ServiceUnavailable("No healthy payment instances")

Split brain: Multiple registries with inconsistent views. Use consensus-based registries (Consul, etcd) or accept eventual consistency.

Service Mesh

Modern approach: sidecar proxy handles all discovery and routing.

The service just calls localhost. The sidecar handles discovery, load balancing, retries, mTLS, observability.

Pros: Zero application changes, consistent behavior Cons: Complexity, resource overhead, another thing to operate

Implementations: Istio, Linkerd, Consul Connect

Service discovery is infrastructure plumbing — invisible when working, catastrophic when broken. Start with DNS for simple cases. Add a registry when you need health checking and metadata. Consider a service mesh when you need advanced traffic management.

The goal: services find each other reliably, without hardcoded addresses, without human intervention. Everything else is implementation details.

The Problem#

DNS-Based Discovery#

Client-Side Discovery#

Server-Side Discovery#

Service Registries#

Consul#

etcd#

Kubernetes#

Registration Patterns#

Self-Registration#

Third-Party Registration#

Health Checking#

Handling Failures#

Service Mesh#

📬 Get the Newsletter

The Problem

DNS-Based Discovery

Client-Side Discovery

Server-Side Discovery

Service Registries

Consul

etcd

Kubernetes

Registration Patterns

Self-Registration

Third-Party Registration

Health Checking

Handling Failures

Service Mesh