CI/CD for Monorepo: Speed, Caching, Selective Tests and Supply-Chain Security
I used to dread monorepo pipelines until I measured where the time actually went. “Pipeline takes 47 minutes, but when I run it with [skip ci], nobody notices.” A colleague told me this 4 years ago. A week later, we deployed broken code to production - precisely because of that skipped CI.
Since then, I’ve optimized CI/CD for teams from 5 to 50 developers. I got the longest pipeline from 52 minutes down to 8. This article is a complete blueprint of what actually works.
My experience: GitHub Actions, GitLab CI, Jenkins. Monorepo with 15+ services, 200+ tests. All examples in this article I’ve actually implemented and optimized.
What Goes Wrong When Monorepo Grows
Typical symptoms:
- Pipeline takes 40+ minutes - developers lose flow
- Unnecessary builds - README change triggers all tests
- “Works on my machine” - but CI fails
- Release chaos - nobody knows what goes to production
- Security theater - scans run but nobody reads results
Goals We Must Achieve
| Metric | Bad State | Target State |
|---|---|---|
| PR pipeline | 40+ min | < 10 min |
| Main pipeline | 60+ min | < 20 min |
| False positives | Daily | Weekly |
| Security findings | Ignored | Triaged within 24h |
| Rollback time | Hours | Minutes |
Pipeline Architecture
Basic Structure
stages:
- detect # What changed?
- build # Build only changed
- test # Test only affected
- security # SAST, DAST, dependencies
- publish # Artifacts, images
- deploy # Staging, production
Rules for Different Triggers
| Trigger | What Runs | Why |
|---|---|---|
| PR | Affected services + lint | Fast feedback |
| Main | All affected + integration | Gatekeeping before release |
| Tag | Full + security + publish | Release readiness |
| Nightly | Everything + slow tests | Complete regression |
Change Detection: Path Filters
The simplest way to speed up pipeline - don’t do what’s not needed.
GitHub Actions
on:
pull_request:
paths:
- 'services/order-service/**'
- 'libs/common/**'
- 'proto/**'
jobs:
detect-changes:
runs-on: ubuntu-latest
outputs:
order-service: ${{ steps.filter.outputs.order-service }}
steps:
- uses: actions/checkout@v4
- uses: dorny/paths-filter@v3
id: filter
with:
filters: |
order-service:
- 'services/order-service/**'
- 'libs/common/**'
build-order-service:
needs: detect-changes
if: needs.detect-changes.outputs.order-service == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm run build --workspace=order-service
GitLab CI
order-service:build:
rules:
- changes:
- services/order-service/**/*
- libs/common/**/*
- proto/**/*
Services Map
For more complex dependencies, create an explicit map:
# .ci/services-map.yml
services:
order-service:
path: services/order-service
depends_on:
- libs/common
- libs/database
- proto/order.proto
tests:
- tests/order-service
- tests/integration/order
payment-service:
path: services/payment-service
depends_on:
- libs/common
- libs/payment-sdk
- proto/payment.proto
Cache Strategy
Cache is the biggest lever for speed. But it has its pitfalls.
Types of Cache
| Type | What We Cache | When to Invalidate |
|---|---|---|
| Dependency | node_modules, .m2, pip | lockfile change |
| Build | compiled artifacts | source change |
| Docker layers | base images | Dockerfile change |
| Test | test fixtures | test data change |
GitHub Actions Example
- name: Cache dependencies
uses: actions/cache@v4
with:
path: |
~/.npm
node_modules
key: deps-${{ hashFiles('**/package-lock.json') }}
restore-keys: |
deps-
- name: Cache build
uses: actions/cache@v4
with:
path: dist
key: build-${{ github.sha }}
restore-keys: |
build-${{ github.event.pull_request.base.sha }}
build-
Remote Cache for Build Tools
For larger projects, local cache isn’t enough. Use remote cache:
Gradle:
// settings.gradle.kts
buildCache {
remote<HttpBuildCache> {
url = uri("https://cache.mycompany.com/cache/")
isPush = System.getenv("CI") != null
}
}
Bazel:
# .bazelrc
build --remote_cache=grpcs://cache.mycompany.com
build --remote_upload_local_results=true
Turborepo:
{
"remoteCache": {
"teamId": "my-team",
"signature": true
}
}
Parallelization Without Cost Explosion
More parallel jobs = faster. But also more expensive. How to find balance?
Test Sharding
test:
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: npm test -- --shard=${{ matrix.shard }}/4
Fail Fast
strategy:
fail-fast: true # Stops all jobs on first failure
matrix:
service: [order, payment, inventory]
Limit Concurrency
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true # Cancels older runs of the same branch
Selective Testing
Not all tests need to run every time.
Minimal Model
Change in service A → Run:
- Unit tests A
- Integration tests A
- Contract tests A ↔ dependencies
Medium Model (Dependency Graph)
# Script that analyzes imports/dependencies
- name: Detect affected tests
run: |
./scripts/affected-tests.sh > affected.txt
- name: Run affected tests
run: |
cat affected.txt | xargs npm test --
Test Impact Analysis
For enterprise projects, tools exist like:
- Launchable - ML-based test selection
- Gradle Test Retry - smart retry for flaky tests
- Jest —changedSince - built-in for JS
Quality Gates
What to block, what to just report?
| Check | PR Block | Main Block | Comment |
|---|---|---|---|
| Lint | Yes | Yes | Fast, unambiguous |
| Unit tests | Yes | Yes | Basic functionality |
| Integration | No (report) | Yes | Can be flaky |
| Coverage drop | No (report) | Yes | Trend matters more |
| Security HIGH | Yes | Yes | Critical |
| Security MED | No (report) | No | Triage |
Practical Implementation
- name: Check coverage
run: |
COVERAGE=$(npm test -- --coverage | grep "All files" | awk '{print $10}')
THRESHOLD=80
if (( $(echo "$COVERAGE < $THRESHOLD" | bc -l) )); then
echo "::warning::Coverage $COVERAGE% is below $THRESHOLD%"
# Don't block, just warning
fi
Security in Pipeline (Without Theater)
Security scanning is useful only if you do something with the results.
When to Run What
| Type | When | Block? |
|---|---|---|
| SAST (Semgrep, CodeQL) | Every PR | HIGH = yes |
| Dependency scan | Every PR + nightly | Critical = yes |
| DAST | Nightly | No, triage |
| Container scan | Before push to registry | HIGH = yes |
SBOM (Software Bill of Materials)
SBOM is a list of all components in your software. Required for supply-chain security.
- name: Generate SBOM
uses: anchore/sbom-action@v0
with:
format: spdx-json
output-file: sbom.spdx.json
- name: Upload SBOM
uses: actions/upload-artifact@v4
with:
name: sbom
path: sbom.spdx.json
SLSA (Supply-chain Levels for Software Artifacts)
SLSA defines security levels for the build process:
- Level 1: Documented build
- Level 2: Hosted build service
- Level 3: Hardened build (what you want to achieve)
- name: Generate SLSA provenance
uses: slsa-framework/slsa-github-generator/.github/workflows/builder_go_slsa3.yml@v1
with:
go-version: 1.21
Artifact Signing
- name: Sign container image
run: |
cosign sign --key env://COSIGN_PRIVATE_KEY \
${{ env.REGISTRY }}/${{ env.IMAGE }}:${{ github.sha }}
- name: Verify signature
run: |
cosign verify --key env://COSIGN_PUBLIC_KEY \
${{ env.REGISTRY }}/${{ env.IMAGE }}:${{ github.sha }}
Release Strategy
Semantic Versioning
- name: Determine version
id: version
uses: paulhatch/semantic-version@v5
with:
major_pattern: "BREAKING CHANGE:"
minor_pattern: "feat:"
- name: Create release
run: |
gh release create v${{ steps.version.outputs.version }} \
--generate-notes
Changelog Automation
- name: Generate changelog
uses: orhun/git-cliff-action@v2
with:
config: cliff.toml
args: --latest
env:
OUTPUT: CHANGELOG.md
Pipeline Anti-patterns
What to avoid:
- “Build everything always” - 90% of time wasted
- Flaky tests without quarantine - erode CI trust
- Secrets in logs -
set +xisn’t enough, use masking - Mega-jobs - one 30 min job vs. 10 jobs of 3 min
- No job artifacts - debugging impossible
Reference Implementation
Minimal Pipeline (Starter)
# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
detect:
runs-on: ubuntu-latest
outputs:
services: ${{ steps.detect.outputs.services }}
steps:
- uses: actions/checkout@v4
- id: detect
run: ./scripts/detect-changes.sh
build-test:
needs: detect
strategy:
matrix:
service: ${{ fromJson(needs.detect.outputs.services) }}
steps:
- uses: actions/checkout@v4
- uses: actions/cache@v4
with:
path: node_modules
key: deps-${{ hashFiles('package-lock.json') }}
- run: npm ci
- run: npm run build --workspace=${{ matrix.service }}
- run: npm test --workspace=${{ matrix.service }}
Metrics and Observability
What to measure:
| Metric | Source | Target |
|---|---|---|
| Lead time | CI timestamps | < 15 min |
| Pipeline duration | CI metrics | Decreasing trend |
| Flakiness rate | Test reruns | < 1% |
| Change failure rate | Rollback count | < 5% |
Practical Dashboard
- name: Report metrics
run: |
curl -X POST https://metrics.mycompany.com/ci \
-d "pipeline_duration=${{ steps.timer.outputs.duration }}" \
-d "tests_passed=${{ steps.tests.outputs.passed }}" \
-d "tests_failed=${{ steps.tests.outputs.failed }}"
Conclusion: 10 Things to Improve by Tomorrow
- Add path filters - fastest speedup
- Enable dependency cache
- Set
fail-fast: true - Add
concurrencywithcancel-in-progress - Split mega-jobs into smaller ones
- Enable SAST at least for HIGH findings
- Generate SBOM
- Set up secrets masking
- Add timing metrics
- Document pipeline in README
Your next step: Measure your current pipeline time. Implement path filters. Measure again. You’ll see the difference within an hour.
Frequently Asked Questions (FAQ)
Is monorepo the right choice for our team?
Monorepo works best when you have shared libraries between services or need atomic changes across multiple components. If your services are completely independent, polyrepo might be simpler.
How much does remote build cache cost?
Depends on the provider. Self-hosted solutions (e.g., Gradle Enterprise) cost around $50-100k USD/year. Cloud solutions (Turborepo Cloud, NX Cloud) have free tiers and you pay for usage.
How to handle flaky tests?
Implement quarantine - automatically move flaky tests to a “quarantine” suite that runs nightly but doesn’t block PRs. Set up alerts for new flaky tests and fix them within 48 hours.
What is SLSA and do I need it?
SLSA is a framework for supply-chain security. Level 3 is recommended for production software. Yes, you need it - supply-chain attacks are becoming increasingly common.
Related Articles
- Architecture as Code: ADR, C4 Diagrams and CI Quality Gates - How to document architectural decisions and automate their validation
- Zero-Downtime PostgreSQL Migrations: Expand/Contract, Backfill and Rollback Strategies - Safe database migrations you can integrate into your CI/CD
Related posts
Kubernetes TLS Certificate Rotation: The 3AM Outage
Certificate expired at 3AM, service down. cert-manager renewal failed silently. I show monitoring, testing rotation, and preventing cert-related outages.
Kubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Storm
Reproducible lab demonstrating connection storm during K8s rollouts. PgBouncer, preStop hooks and jitter - practical solutions with benchmarks.
Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger
Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.
JWT Revocation Strategies: When Stateless Tokens Need State
User compromised, need to revoke JWT immediately. But JWTs are immutable. I compare allowlist, denylist, and short expiration with performance benchmarks.
Cite this article
If you reference this post, please link to the original URL and credit the author.