How to Scale Security Scanning Across 10,000 Repositories

The scale problem is not what you think

Most AppSec teams start by solving the tool problem: which scanner should we use? At scale, the tool is the easy part. The hard problems are:

Inventory: You cannot scan repos you don't know exist. Large organisations routinely have 30–40% of their repositories undiscovered by the security team.
Result volume: 10,000 repos × 50 findings per repo = 500,000 findings. Without deduplication, prioritisation, and routing, this is noise, not signal.
Ownership: Who fixes a finding in a repo that was last touched three years ago? Findings without owners are findings that stay open forever.
CI cost: Running a 5-minute SAST scan on every PR across 10,000 active repos adds up to significant compute cost and latency.

The unmaintained repo problem: In large organisations, 60–80% of repositories have had no commits in the past 6 months. These repos still contain vulnerabilities, and those vulnerabilities still matter — especially if the code is deployed somewhere. Scanning schedules need to account for repos that have no CI activity.

Step 1: Build a complete repository inventory

Before scanning anything, build a complete, authoritative list of all repositories. This sounds obvious but is rarely done. Sources to enumerate:

GitHub/GitLab organisations — use the API to list all repos, including archived and private
On-premise Git servers (Bitbucket, Gitea, self-hosted GitLab)
CI system job definitions — repos that have pipeline configs but may not be in your primary SCM
Package registries — internal npm/PyPI/Maven packages often point back to source repos

List all repos in a GitHub org via APIshell

# Enumerate all repos (paginated) using GitHub CLI
gh repo list MY_ORG \
  --limit 10000 \
  --json name,url,isArchived,updatedAt \
  --jq '.[] | [.name, .url, .isArchived, .updatedAt] | @csv' \
  > repo_inventory.csv

# Count by language
gh repo list MY_ORG --limit 10000 --json primaryLanguage \
  --jq 'group_by(.primaryLanguage.name) | map({lang: .[0].primaryLanguage.name, count: length})'
              

Enrich the inventory with metadata: primary language, last commit date, deployment status (is this code running in production?), and team ownership. This becomes the basis for prioritisation and routing.

Scan architecture: two complementary patterns

Pattern 1: CI-embedded scanning (for active repos)

For repos with active development, embed scanning directly in the CI pipeline. This catches new vulnerabilities as they are introduced. The challenge at scale is standardising the pipeline config across thousands of repos without manual effort.

Solution: use a centralised, versioned pipeline template that teams include in their CI config. Both GitHub and GitLab support reusable workflow components.

.github/workflows/security.yml — reusable workflow callyaml

jobs:
  security-scan:
    uses: my-org/.github/.github/workflows/security-scan.yml@main
    secrets: inherit
              

Pattern 2: Scheduled bulk scanning (for dormant repos)

For repos with no CI activity, run scheduled scans from a central security scanning service. This requires a scanning worker that can clone repositories, run analysis, and push results to a central findings store.

Bulk scan runner — pseudocodepython

import concurrent.futures, subprocess, json

def scan_repo(repo_url: str) -> dict:
    # Shallow clone (depth=1) to minimise bandwidth
    subprocess.run(["git", "clone", "--depth=1", repo_url, "/tmp/scan"], check=True)
    result = subprocess.run(
        ["semgrep", "scan", "--config=auto", "--json", "/tmp/scan"],
        capture_output=True, text=True
    )
    return json.loads(result.stdout)

# Parallelise across 20 workers
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as pool:
    futures = {pool.submit(scan_repo, url): url for url in repo_urls}
    for future in concurrent.futures.as_completed(futures):
        findings = future.result()
        push_to_findings_store(futures[future], findings)
              

Incremental and diff-based scanning

Full-repo scans on every commit do not scale. The solution is diff-aware scanning: only analyse the files changed in a pull request, not the entire codebase.

Both Semgrep and AquilaX support diff-aware mode. In GitHub Actions, this looks like:

Diff-aware Semgrep scanyaml

- name: Run diff-aware scan
  env:
    SEMGREP_BASELINE_REF: origin/main
  run: |
    semgrep scan \
      --config=auto \
      --baseline-commit=$(git merge-base HEAD origin/main) \
      --json \
      --output=findings.json .
              

Full scans on a schedule: Run diff-aware scans on every PR, but schedule full-repo scans weekly to catch vulnerabilities that exist in unchanged code. This separates "new vulnerabilities introduced by this PR" from "existing vulnerabilities in the codebase".

Result routing and triage at scale

500,000 findings need to go somewhere useful. A central findings store with routing logic:

Deduplicate by fingerprint. The same vulnerability pattern in the same file/line should appear as one finding, even across multiple scans. Most scanners produce stable fingerprints for this.
Route by CODEOWNERS. GitHub's CODEOWNERS file maps file paths to team owners. Use this to automatically assign findings to the right engineering team.
Prioritise by exposure. A critical finding in a repo that serves production traffic at 10M RPM is more urgent than the same finding in an internal tool used by 5 people. Enrich findings with deployment context.
Create tickets automatically. For SLA-governed findings, auto-create Jira/Linear tickets with full context so teams do not need to pull findings from a separate dashboard.

Policy as code: enforce at the platform level

Manual review of security findings does not scale. Policy as code allows you to define rules programmatically: which findings block deployment, which create tickets, which are informational.

security-policy.rego — OPA policy examplerego

package security.gate

# Block deployment if any critical severity finding exists
deny[msg] {
    finding := input.findings[_]
    finding.severity == "CRITICAL"
    finding.suppressed == false
    msg := sprintf("Critical finding in %v: %v", [finding.file, finding.rule_id])
}

# Warn but allow for high severity
warn[msg] {
    finding := input.findings[_]
    finding.severity == "HIGH"
    msg := sprintf("High severity: %v", [finding.rule_id])
}
              

Metrics that actually measure security posture

Vanity metrics (total scans run, total findings found) do not tell you if security is improving. Track instead:

Mean time to remediation (MTTR) by severity — how long does it take from finding to fix?
Coverage rate — what percentage of active repos have been scanned in the past 7 days?
New critical findings per week — is the introduction rate going up or down?
Suppression rate — are findings being suppressed at an increasing rate? This can indicate alert fatigue, not improving security.
Repos without owners — findings in ownerless repos have no one to fix them.

How to scale scanning
across 10,000 repos.

The scale problem is not what you think

Step 1: Build a complete repository inventory

Scan architecture: two complementary patterns

Pattern 1: CI-embedded scanning (for active repos)

Pattern 2: Scheduled bulk scanning (for dormant repos)

Incremental and diff-based scanning

Result routing and triage at scale

Policy as code: enforce at the platform level

Metrics that actually measure security posture

Scale security scanning across your entire organisation

How to scale scanningacross 10,000 repos.

The scale problem is not what you think

Step 1: Build a complete repository inventory

Scan architecture: two complementary patterns

Pattern 1: CI-embedded scanning (for active repos)

Pattern 2: Scheduled bulk scanning (for dormant repos)

Incremental and diff-based scanning

Result routing and triage at scale

Policy as code: enforce at the platform level

Metrics that actually measure security posture

Scale security scanning across your entire organisation

How to scale scanning
across 10,000 repos.