Monitoring Dashboards: Visualize What Actually Matters

Most monitoring dashboards are useless. Walls of graphs nobody looks at until something breaks — then nobody knows which graph matters. Here’s how to build dashboards that actually help.

The Dashboard Hierarchy

Start at level 1, drill down when needed.

The Four Golden Signals

Google’s SRE book nailed it. Monitor these:

1. Latency

1
2
3
4
# Request duration percentiles
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Dashboard panel: Line graph showing p50, p95, p99 over time

2. Traffic

1
2
3
4
5
# Requests per second
sum(rate(http_requests_total[5m]))

# By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)

Dashboard panel: Stacked area chart by endpoint

3. Errors

1
2
3
4
5
6
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# By error type
sum(rate(http_requests_total{status=~"5.."}[5m])) by (status)

Dashboard panel: Single stat (big number) + time series

4. Saturation

1
2
3
4
5
6
7
8
# CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage
100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100)

Dashboard panel: Gauge showing current %, alert thresholds marked

Executive Dashboard

One screen, whole system health:

Grafana JSON

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
{
  "panels": [
    {
      "title": "System Uptime",
      "type": "stat",
      "targets": [{
        "expr": "avg(up) * 100",
        "legendFormat": "Uptime %"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "red", "value": 0},
              {"color": "yellow", "value": 99},
              {"color": "green", "value": 99.9}
            ]
          },
          "unit": "percent"
        }
      }
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [{
        "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "green", "value": 0},
              {"color": "yellow", "value": 1},
              {"color": "red", "value": 5}
            ]
          },
          "unit": "percent"
        }
      }
    }
  ]
}

Service Dashboard Template

Every service gets one:

Database Dashboard

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Connections
pg_stat_activity_count

# Connection pool usage
pg_stat_activity_count / pg_settings_max_connections * 100

# Query duration
histogram_quantile(0.95, rate(pg_stat_statements_seconds_bucket[5m]))

# Cache hit ratio
pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) * 100

# Replication lag
pg_replication_lag_seconds

# Dead tuples (needs vacuum)
pg_stat_user_tables_n_dead_tup

Key Panels

Infrastructure Dashboard

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Node health
up{job="node"}

# Load average
node_load1 / count(node_cpu_seconds_total{mode="idle"}) without (cpu, mode)

# Network I/O
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])

K8s Dashboard

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Pod restarts (bad sign)
increase(kube_pod_container_status_restarts_total[1h])

# Pod status
kube_pod_status_phase

# Resource requests vs usage
container_memory_working_set_bytes / kube_pod_container_resource_requests{resource="memory"}

# Pending pods (scheduling issues)
kube_pod_status_phase{phase="Pending"}

Alert-Dashboard Alignment

Every alert should have a dashboard panel:

1
2
3
4
5
6
7
8
# Alert rule
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / sum(rate(http_requests_total[5m])) > 0.05
  annotations:
    dashboard: "https://grafana.example.com/d/api-service?var-timerange=1h"
    panel: "Error Rate"

When alert fires → click link → see the problem visualized.

Dashboard Variables

Make dashboards reusable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
  "templating": {
    "list": [
      {
        "name": "service",
        "type": "query",
        "query": "label_values(up, service)",
        "multi": false
      },
      {
        "name": "instance",
        "type": "query", 
        "query": "label_values(up{service=\"$service\"}, instance)",
        "multi": true
      },
      {
        "name": "interval",
        "type": "interval",
        "options": [
          {"text": "1m", "value": "1m"},
          {"text": "5m", "value": "5m"},
          {"text": "1h", "value": "1h"}
        ]
      }
    ]
  }
}

Use in queries:

1
rate(http_requests_total{service="$service", instance=~"$instance"}[$interval])

Dashboard as Code

Grafana + Terraform

1
2
3
4
5
6
7
8
resource "grafana_dashboard" "api_service" {
  config_json = file("dashboards/api-service.json")
  folder      = grafana_folder.services.id
}

resource "grafana_folder" "services" {
  title = "Service Dashboards"
}

Grafana + Jsonnet

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;

dashboard.new(
  'API Service',
  tags=['api', 'production'],
)
.addPanel(
  grafana.graphPanel.new(
    'Request Rate',
    datasource='Prometheus',
  ).addTarget(
    prometheus.target(
      'sum(rate(http_requests_total[5m]))',
      legendFormat='Requests/s',
    )
  ),
  gridPos={x: 0, y: 0, w: 12, h: 8},
)

Anti-Patterns

Too Many Graphs

No Context

Stale Dashboards

The Checklist

Executive overview exists (1 screen)
Each service has a dashboard
Four golden signals covered
Thresholds visualized
Variables for filtering
Alerts link to dashboards
Dashboards in version control
Quarterly review scheduled

Start Here

Today: Create executive overview with 4 key metrics
This week: Add service dashboard for critical service
This month: Link all alerts to relevant dashboards
This quarter: Implement dashboard as code

The best dashboard is the one that tells you what’s wrong before users notice.

Dashboards are for humans. Design for the 3am on-call engineer who just woke up.

The Dashboard Hierarchy#

The Four Golden Signals#

1. Latency#

2. Traffic#

3. Errors#

4. Saturation#

Executive Dashboard#

Grafana JSON#

Service Dashboard Template#

Database Dashboard#

Key Panels#

Infrastructure Dashboard#

K8s Dashboard#

Alert-Dashboard Alignment#

Dashboard Variables#

Dashboard as Code#

Grafana + Terraform#

Grafana + Jsonnet#

Anti-Patterns#

Too Many Graphs#

No Context#

Stale Dashboards#

The Checklist#

Start Here#

📬 Get the Newsletter