Most monitoring dashboards are useless. Walls of graphs nobody looks at until something breaks — then nobody knows which graph matters. Here’s how to build dashboards that actually help.

The Dashboard Hierarchy

LLLeeevvveeelll123:::E"S"D"xIeWeWesrhehcvapyueittvc'DiieesisvrveyHbeitertOhao(vilkpbenteerrghnrov?kiO("ceeKponw?em?"rp"(o1sneedrnavtsi)hcbeo)ard)

Start at level 1, drill down when needed.

The Four Golden Signals

Google’s SRE book nailed it. Monitor these:

1. Latency

1
2
3
4
# Request duration percentiles
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Dashboard panel: Line graph showing p50, p95, p99 over time

2. Traffic

1
2
3
4
5
# Requests per second
sum(rate(http_requests_total[5m]))

# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

Dashboard panel: Stacked area chart by endpoint

3. Errors

1
2
3
4
5
6
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# By error type
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)

Dashboard panel: Single stat (big number) + time series

4. Saturation

1
2
3
4
5
6
7
8
# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

Dashboard panel: Gauge showing current %, alert thresholds marked

Executive Dashboard

One screen, whole system health:

SU[A[P[SA[Y9pPaecWS9tIyatAT.imriRE9mSecvNM%eenhe]rtHvSISEiSeneAcercaLervirTvidcHicehcenetlLsa4a:t2teme1nsnccyyelevate1Td.r2a-kf/fisincvesti]ga9tLE5iar].ns0r1gt.o9%0r822].4%RH2Ha9e%eDht9aaeoe.llgu9ttrr%hhasyyded

Grafana JSON

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
{
  "panels": [
    {
      "title": "System Uptime",
      "type": "stat",
      "targets": [{
        "expr": "avg(up) * 100",
        "legendFormat": "Uptime %"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "red", "value": 0},
              {"color": "yellow", "value": 99},
              {"color": "green", "value": 99.9}
            ]
          },
          "unit": "percent"
        }
      }
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "green", "value": 0},
              {"color": "yellow", "value": 1},
              {"color": "red", "value": 5}
            ]
          },
          "unit": "percent"
        }
      }
    }
  ]
}

Service Dashboard Template

Every service gets one:

ARCPECTRCMDEGPGPeueruhePEInEOEIqrarrrsUMSdTSTurkoreo::KpTSee:resu:oEsnnhr[[[i///Rtt2RtocnaaaV:.a:letpppIR1tdsiiiCa1ke0:U//Et./.suse2s01asrek2%geda/%ererssrcsh138.49200k/ss]]]sp5L5T05462a0o00281142t:p:3%%%253e:mm0n1EDssmc2raSsymrtesoarDrbv000i|sai...ssc001tp1ee152r92%%%i5mtub:sinumat44evi55oaommui[nsstl2a4|(bh3lp)e[95]9(m:1])120ms

Database Dashboard

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Connections
pg_stat_activity_count

# Connection pool usage
pg_stat_activity_count / pg_settings_max_connections * 100

# Query duration
histogram_quantile(0.95, rate(pg_stat_statements_seconds_bucket[5m]))

# Cache hit ratio
pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) * 100

# Replication lag
pg_replication_lag_seconds

# Dead tuples (needs vacuum)
pg_stat_user_tables_n_dead_tup

Key Panels

CPoSSUTouOnlEParsS4noLDbdeT5ewEAlerG/cCTersR1tQTEs:E0iuH:S0oeueQnrsaLsiFellleRrtiisOshvvMeeS::C>oE9a1rT139c0d.4.h0el202emraMk%sssH)tiW_ddtHleeEoaaRgddEi::.n..41R..52e.k00p(k.a(3Lvasagvbbg:gll:oo3aa41tt08::m0sm32,s%6Q,%uceacrla1yll2slmp:ss9:v51a28c49u))um!

Infrastructure Dashboard

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Node health
up{job="node"}

# Load average
node_load1 / count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode)

# Network I/O
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

K8s Dashboard

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Pod restarts (bad sign)
increase(kube_pod_container_status_restarts_total[1h])

# Pod status
kube_pod_status_phase

# Resource requests vs usage
container_memory_working_set_bytes / kube_pod_container_resource_requests{resource="memory"}

# Pending pods (scheduling issues)
kube_pod_status_phase{phase="Pending"}

Alert-Dashboard Alignment

Every alert should have a dashboard panel:

1
2
3
4
5
6
7
8
# Alert rule
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / sum(rate(http_requests_total[5m])) > 0.05
  annotations:
    dashboard: "https://grafana.example.com/d/api-service?var-timerange=1h"
    panel: "Error Rate"

When alert fires → click link → see the problem visualized.

Dashboard Variables

Make dashboards reusable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "templating": {
    "list": [
      {
        "name": "service",
        "type": "query",
        "query": "label_values(up, service)",
        "multi": false
      },
      {
        "name": "instance",
        "type": "query", 
        "query": "label_values(up{service=\"$service\"}, instance)",
        "multi": true
      },
      {
        "name": "interval",
        "type": "interval",
        "options": [
          {"text": "1m", "value": "1m"},
          {"text": "5m", "value": "5m"},
          {"text": "1h", "value": "1h"}
        ]
      }
    ]
  }
}

Use in queries:

1
rate(http_requests_total{service="$service", instance=~"$instance"}[$interval])

Dashboard as Code

Grafana + Terraform

1
2
3
4
5
6
7
8
resource "grafana_dashboard" "api_service" {
  config_json = file("dashboards/api-service.json")
  folder      = grafana_folder.services.id
}

resource "grafana_folder" "services" {
  title = "Service Dashboards"
}

Grafana + Jsonnet

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;

dashboard.new(
  'API Service',
  tags=['api', 'production'],
)
.addPanel(
  grafana.graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
  ).addTarget(
    prometheus.target(
      'sum(rate(http_requests_total[5m]))',
      legendFormat='Requests/s',
    )
  ),
  gridPos={x: 0, y: 0, w: 12, h: 8},
)

Anti-Patterns

Too Many Graphs

580N-Eo1apb2caohndfeyoalcnskusnswsoeehwdrosswpiwaanhngeesrlpeesevcetirofyitlchoioqnkugestion

No Context

GGrNrGaoarppehihednsew/haiyotewihlsfltosihwptr/i'erksseehdgoolzododsneosrmbaardked

Stale Dashboards

DDaMaPsesahthrbrbtoioacaorsrfddnsforererolvvmoiinec2gweeyrqoeuwaeanrxresitrsesatrhglioyp

The Checklist

  • Executive overview exists (1 screen)
  • Each service has a dashboard
  • Four golden signals covered
  • Thresholds visualized
  • Variables for filtering
  • Alerts link to dashboards
  • Dashboards in version control
  • Quarterly review scheduled

Start Here

  1. Today: Create executive overview with 4 key metrics
  2. This week: Add service dashboard for critical service
  3. This month: Link all alerts to relevant dashboards
  4. This quarter: Implement dashboard as code

The best dashboard is the one that tells you what’s wrong before users notice.


Dashboards are for humans. Design for the 3am on-call engineer who just woke up.