Every CI/CD tutorial shows you “hello world” pipelines. Then you hit production and realize none of that scales. Here are the patterns that actually work.
The Fundamental Truth# CI/CD pipelines are software. They need:
Version control (they’re in your repo, good start) Testing (who tests the tests?) Refactoring (your 500-line YAML file is technical debt) Observability (why did that deploy take 45 minutes?) Treat them with the same rigor as your application code.
Pipeline Architecture# The Diamond Pattern# Most pipelines should look like a diamond:
[ B u i l d ] [ [ L D i e n p t l ] y [ ] T e s t ] Wide in the middle (parallelism), narrow at the ends (coordination points).
GitHub Actions example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
jobs :
build :
runs-on : ubuntu-latest
outputs :
image-tag : ${{ steps.build.outputs.tag }}
steps :
- uses : actions/checkout@v4
- id : build
run : |
TAG="${GITHUB_SHA::8}"
docker build -t myapp:$TAG .
echo "tag=$TAG" >> $GITHUB_OUTPUT
lint :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- run : npm run lint
test :
runs-on : ubuntu-latest
needs : build
steps :
- uses : actions/checkout@v4
- run : npm test
deploy :
runs-on : ubuntu-latest
needs : [ build, lint, test]
steps :
- run : echo "Deploying ${{ needs.build.outputs.image-tag }}"
Lint and test run in parallel. Deploy waits for everything.
Fail Fast, Fail Cheap# Order jobs by:
How fast they run How often they fail How expensive they are 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
jobs :
# Fast, fails often, cheap
lint :
runs-on : ubuntu-latest
steps :
- run : npm run lint # 10 seconds
# Medium speed, sometimes fails
unit-test :
needs : lint
runs-on : ubuntu-latest
steps :
- run : npm test # 2 minutes
# Slow, rarely fails, expensive
integration-test :
needs : unit-test
runs-on : ubuntu-latest
services :
postgres :
image : postgres:15
steps :
- run : npm run test:integration # 10 minutes
# Slowest, expensive compute
e2e-test :
needs : integration-test
runs-on : ubuntu-latest
steps :
- run : npm run test:e2e # 20 minutes
If lint fails, you’ve wasted 10 seconds, not 30 minutes.
Caching Done Right# Caching can cut build times by 80%. Done wrong, it causes bizarre failures.
1
2
3
4
5
6
- uses : actions/cache@v4
with :
path : ~/.npm
key : npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}
restore-keys : |
npm-${{ runner.os }}-
Key structure: {type}-{os}-{hash}
type: What you’re caching (npm, pip, gradle)os: Operating system (caches aren’t cross-platform)hash: Content hash of lock fileRestore keys: Fallback for partial matches. Use sparingly — stale caches cause weird bugs.
What to Cache# Yes:
Package manager caches (~/.npm, ~/.cache/pip) Build tool caches (~/.gradle, ~/.m2) Compiled dependencies No:
Your application build output (use artifacts) Anything that changes every commit Large binary blobs Cache Invalidation# The two hard things in computer science apply here. When in doubt:
1
key : npm-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}-${{ github.run_id }}
Adding run_id means fresh cache every run. Use when debugging cache issues.
Secrets Management# Never Hardcode. Ever.# 1
2
3
4
5
6
7
# WRONG
env :
API_KEY : sk-1234567890
# RIGHT
env :
API_KEY : ${{ secrets.API_KEY }}
Limit Secret Scope# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
jobs :
build :
# No secrets needed here
steps :
- run : npm run build
deploy :
# Only this job needs deploy credentials
environment : production
steps :
- run : deploy.sh
env :
AWS_ACCESS_KEY_ID : ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY : ${{ secrets.AWS_SECRET_ACCESS_KEY }}
Use OIDC When Possible# Instead of storing AWS credentials:
1
2
3
4
5
6
7
8
9
permissions :
id-token : write
contents : read
steps :
- uses : aws-actions/configure-aws-credentials@v4
with :
role-to-assume : arn:aws:iam::123456789:role/github-actions
aws-region : us-east-1
No secrets stored. AWS trusts GitHub’s identity token.
Matrix Builds# Test across versions without copy-paste:
1
2
3
4
5
6
7
8
9
10
11
12
13
jobs :
test :
strategy :
matrix :
node : [ 18 , 20 , 22 ]
os : [ ubuntu-latest, macos-latest]
fail-fast : false
runs-on : ${{ matrix.os }}
steps :
- uses : actions/setup-node@v4
with :
node-version : ${{ matrix.node }}
- run : npm test
fail-fast: false means all combinations run even if one fails. You want to see the full picture.
Smart Matrix Exclusions# 1
2
3
4
5
6
7
8
9
10
11
strategy :
matrix :
node : [ 18 , 20 , 22 ]
os : [ ubuntu-latest, macos-latest, windows-latest]
exclude :
- os : windows-latest
node : 18 # We don't support Node 18 on Windows
include :
- os : ubuntu-latest
node : 22
experimental : true # Extra flags for specific combos
Reusable Workflows# Stop copy-pasting between repos.
Shared workflow (in a central repo):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# .github/workflows/node-ci.yml
name : Node CI
on :
workflow_call :
inputs :
node-version :
type : string
default : '20'
secrets :
NPM_TOKEN :
required : false
jobs :
build-and-test :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- uses : actions/setup-node@v4
with :
node-version : ${{ inputs.node-version }}
- run : npm ci
- run : npm test
Usage in other repos:
1
2
3
4
5
6
7
jobs :
ci :
uses : myorg/shared-workflows/.github/workflows/node-ci.yml@main
with :
node-version : '22'
secrets :
NPM_TOKEN : ${{ secrets.NPM_TOKEN }}
One fix in the shared workflow fixes all repos.
Deployment Strategies# Environment Protection# 1
2
3
4
5
6
7
8
9
10
11
12
13
jobs :
deploy-staging :
environment : staging
steps :
- run : deploy.sh staging
deploy-production :
needs : deploy-staging
environment :
name : production
url : https://myapp.com
steps :
- run : deploy.sh production
Configure environments in repo settings:
Required reviewers Wait timers Branch restrictions Rollback Pattern# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
deploy :
steps :
- name : Deploy
id : deploy
run : |
OLD_VERSION=$(get-current-version)
echo "old-version=$OLD_VERSION" >> $GITHUB_OUTPUT
deploy-new-version
- name : Health Check
id : health
continue-on-error : true
run : |
sleep 30
curl --fail https://myapp.com/health
- name : Rollback on Failure
if : steps.health.outcome == 'failure'
run : |
deploy-version ${{ steps.deploy.outputs.old-version }}
exit 1 # Fail the workflow
Automatic rollback when health checks fail.
Observability# Timing Matters# 1
2
3
4
5
6
- name : Build
run : |
START=$(date +%s)
npm run build
END=$(date +%s)
echo "::notice::Build took $((END-START)) seconds"
Track durations. Catch regressions.
Structured Logging# 1
2
3
4
5
- name : Deploy
run : |
echo "::group::Deploying to production"
deploy.sh 2>&1
echo "::endgroup::"
Groups collapse in the UI. Easier to scan.
Annotations# 1
2
3
4
5
- name : Lint
run : |
npm run lint --format json > lint-results.json
# Convert to annotations
jq -r '.[] | "::warning file=\(.filePath),line=\(.line)::\(.message)"' lint-results.json
Warnings appear inline on the PR diff.
Anti-Patterns# 1. Mega-workflows
500 lines of YAML is unmaintainable. Split into reusable workflows.
2. Running everything on every push
Use path filters:
1
2
3
4
5
on :
push :
paths :
- 'src/**'
- 'package.json'
3. No timeouts
1
2
3
jobs :
build :
timeout-minutes : 15 # Kill if stuck
4. Ignoring flaky tests
Track and fix them. continue-on-error is a code smell.
5. Manual version bumps
Automate with semantic-release or similar.
Start Here# Today: Add timeout-minutes to all jobsThis week: Implement proper cachingThis month: Extract reusable workflowsThis quarter: Add deployment environments with protection rulesYour pipeline is part of your product. Build it like one.
The best CI/CD pipeline is the one nobody thinks about — because it just works, every time, predictably.