Every CI/CD tutorial shows you “hello world” pipelines. Then you hit production and realize none of that scales. Here are the patterns that actually work.

The Fundamental Truth

CI/CD pipelines are software. They need:

  • Version control (they’re in your repo, good start)
  • Testing (who tests the tests?)
  • Refactoring (your 500-line YAML file is technical debt)
  • Observability (why did that deploy take 45 minutes?)

Treat them with the same rigor as your application code.

Pipeline Architecture

The Diamond Pattern

Most pipelines should look like a diamond:

[Build][[LDienptl]y[]Test]

Wide in the middle (parallelism), narrow at the ends (coordination points).

GitHub Actions example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
jobs:
  build:
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.build.outputs.tag }}
    steps:
      - uses: actions/checkout@v4
      - id: build
        run: |
          TAG="${GITHUB_SHA::8}"
          docker build -t myapp:$TAG .
          echo "tag=$TAG" >> $GITHUB_OUTPUT

  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm run lint

  test:
    runs-on: ubuntu-latest
    needs: build
    steps:
      - uses: actions/checkout@v4
      - run: npm test

  deploy:
    runs-on: ubuntu-latest
    needs: [build, lint, test]
    steps:
      - run: echo "Deploying ${{ needs.build.outputs.image-tag }}"

Lint and test run in parallel. Deploy waits for everything.

Fail Fast, Fail Cheap

Order jobs by:

  1. How fast they run
  2. How often they fail
  3. How expensive they are
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
jobs:
  # Fast, fails often, cheap
  lint:
    runs-on: ubuntu-latest
    steps:
      - run: npm run lint  # 10 seconds

  # Medium speed, sometimes fails
  unit-test:
    needs: lint
    runs-on: ubuntu-latest
    steps:
      - run: npm test  # 2 minutes

  # Slow, rarely fails, expensive
  integration-test:
    needs: unit-test
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
    steps:
      - run: npm run test:integration  # 10 minutes

  # Slowest, expensive compute
  e2e-test:
    needs: integration-test
    runs-on: ubuntu-latest
    steps:
      - run: npm run test:e2e  # 20 minutes

If lint fails, you’ve wasted 10 seconds, not 30 minutes.

Caching Done Right

Caching can cut build times by 80%. Done wrong, it causes bizarre failures.

The Cache Key Formula

1
2
3
4
5
6
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      npm-${{ runner.os }}-

Key structure: {type}-{os}-{hash}

  • type: What you’re caching (npm, pip, gradle)
  • os: Operating system (caches aren’t cross-platform)
  • hash: Content hash of lock file

Restore keys: Fallback for partial matches. Use sparingly — stale caches cause weird bugs.

What to Cache

Yes:

  • Package manager caches (~/.npm, ~/.cache/pip)
  • Build tool caches (~/.gradle, ~/.m2)
  • Compiled dependencies

No:

  • Your application build output (use artifacts)
  • Anything that changes every commit
  • Large binary blobs

Cache Invalidation

The two hard things in computer science apply here. When in doubt:

1
key: npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}-${{ github.run_id }}

Adding run_id means fresh cache every run. Use when debugging cache issues.

Secrets Management

Never Hardcode. Ever.

1
2
3
4
5
6
7
# WRONG
env:
  API_KEY: sk-1234567890

# RIGHT
env:
  API_KEY: ${{ secrets.API_KEY }}

Limit Secret Scope

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
jobs:
  build:
    # No secrets needed here
    steps:
      - run: npm run build

  deploy:
    # Only this job needs deploy credentials
    environment: production
    steps:
      - run: deploy.sh
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Use OIDC When Possible

Instead of storing AWS credentials:

1
2
3
4
5
6
7
8
9
permissions:
  id-token: write
  contents: read

steps:
  - uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: arn:aws:iam::123456789:role/github-actions
      aws-region: us-east-1

No secrets stored. AWS trusts GitHub’s identity token.

Matrix Builds

Test across versions without copy-paste:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
jobs:
  test:
    strategy:
      matrix:
        node: [18, 20, 22]
        os: [ubuntu-latest, macos-latest]
      fail-fast: false
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: npm test

fail-fast: false means all combinations run even if one fails. You want to see the full picture.

Smart Matrix Exclusions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
strategy:
  matrix:
    node: [18, 20, 22]
    os: [ubuntu-latest, macos-latest, windows-latest]
    exclude:
      - os: windows-latest
        node: 18  # We don't support Node 18 on Windows
    include:
      - os: ubuntu-latest
        node: 22
        experimental: true  # Extra flags for specific combos

Reusable Workflows

Stop copy-pasting between repos.

Shared workflow (in a central repo):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# .github/workflows/node-ci.yml
name: Node CI

on:
  workflow_call:
    inputs:
      node-version:
        type: string
        default: '20'
    secrets:
      NPM_TOKEN:
        required: false

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ inputs.node-version }}
      - run: npm ci
      - run: npm test

Usage in other repos:

1
2
3
4
5
6
7
jobs:
  ci:
    uses: myorg/shared-workflows/.github/workflows/node-ci.yml@main
    with:
      node-version: '22'
    secrets:
      NPM_TOKEN: ${{ secrets.NPM_TOKEN }}

One fix in the shared workflow fixes all repos.

Deployment Strategies

Environment Protection

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
jobs:
  deploy-staging:
    environment: staging
    steps:
      - run: deploy.sh staging

  deploy-production:
    needs: deploy-staging
    environment:
      name: production
      url: https://myapp.com
    steps:
      - run: deploy.sh production

Configure environments in repo settings:

  • Required reviewers
  • Wait timers
  • Branch restrictions

Rollback Pattern

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
deploy:
  steps:
    - name: Deploy
      id: deploy
      run: |
        OLD_VERSION=$(get-current-version)
        echo "old-version=$OLD_VERSION" >> $GITHUB_OUTPUT
        deploy-new-version

    - name: Health Check
      id: health
      continue-on-error: true
      run: |
        sleep 30
        curl --fail https://myapp.com/health

    - name: Rollback on Failure
      if: steps.health.outcome == 'failure'
      run: |
        deploy-version ${{ steps.deploy.outputs.old-version }}
        exit 1  # Fail the workflow

Automatic rollback when health checks fail.

Observability

Timing Matters

1
2
3
4
5
6
- name: Build
  run: |
    START=$(date +%s)
    npm run build
    END=$(date +%s)
    echo "::notice::Build took $((END-START)) seconds"

Track durations. Catch regressions.

Structured Logging

1
2
3
4
5
- name: Deploy
  run: |
    echo "::group::Deploying to production"
    deploy.sh 2>&1
    echo "::endgroup::"

Groups collapse in the UI. Easier to scan.

Annotations

1
2
3
4
5
- name: Lint
  run: |
    npm run lint --format json > lint-results.json
    # Convert to annotations
    jq -r '.[] | "::warning file=\(.filePath),line=\(.line)::\(.message)"' lint-results.json

Warnings appear inline on the PR diff.

Anti-Patterns

1. Mega-workflows 500 lines of YAML is unmaintainable. Split into reusable workflows.

2. Running everything on every push Use path filters:

1
2
3
4
5
on:
  push:
    paths:
      - 'src/**'
      - 'package.json'

3. No timeouts

1
2
3
jobs:
  build:
    timeout-minutes: 15  # Kill if stuck

4. Ignoring flaky tests Track and fix them. continue-on-error is a code smell.

5. Manual version bumps Automate with semantic-release or similar.

Start Here

  1. Today: Add timeout-minutes to all jobs
  2. This week: Implement proper caching
  3. This month: Extract reusable workflows
  4. This quarter: Add deployment environments with protection rules

Your pipeline is part of your product. Build it like one.


The best CI/CD pipeline is the one nobody thinks about — because it just works, every time, predictably.