Trust and Evidence¶

This page is the shortest path to the evidence behind drift.

It is designed for teams that need to answer a practical question before rollout: what can we trust today, what should we verify locally, and where is the raw material?

Public claims safe to repeat¶

drift uses 15 scoring signals in the composite score (TVS currently at weight 0.0 pending re-validation)
the conservative public benchmark claim is 77% strict precision / 95% lenient on the historical v0.5 six-signal baseline
the current study corpus covers 15 real-world repositories
all analysis is deterministic and does not use an LLM in the detector pipeline
package metadata currently uses the PyPI Beta classifier, while rollout guidance remains conservative because maturity still varies across the broader product surface

Security boundary evidence¶

For repository-analyzer deployments, the most relevant security question is how drift behaves on untrusted file trees and git metadata.

Current security controls are documented in SECURITY.md, including path normalization, symlink skipping, a 5 MB per-file guardrail, hardened git subprocess usage, and parser non-execution behavior.

Supporting regression tests include:

tests/test_git_history_safety.py
tests/test_file_discovery.py
tests/test_cache_resilience.py

Benchmark evidence¶

Conservative public benchmark claim: 77% strict precision / 95% lenient on a score-weighted sample of 286 findings across 5 repositories in the historical v0.5 six-signal baseline (non-circular heuristic classification; 51 Disputed findings where only score evidence was available — independent multi-rater validation pending)
precision has not been revalidated for the current 15-signal model; treat the v0.5 number as a historical reference point, not as a blanket claim for the current composite model
v0.7.1 (2026-03-27): 88% detection recall on a controlled mutation benchmark of 17 injected patterns across 10 signal types (synthetic repo with git history; 2 patterns undetected: 1 return-pattern fragmentation variant, 1 system-misalignment below threshold)
v0.5 baseline (historical): 86% detection rate on a 14-pattern benchmark
self-analysis of drift reports a score of 0.442 (MEDIUM)

These numbers are summarized from Benchmark Study and should be interpreted together with the limitations below.

What the evidence means¶

The strongest current claim is not that every finding is equally reliable.

The strongest current claim is that drift provides a deterministic and inspectable process, with benchmarked evidence, for surfacing structural drift patterns that teams can calibrate against their own repository.

Known limitations to keep visible¶

classification uses non-circular structural heuristics; 51 of 286 findings are Disputed (score-only evidence, no structural confirmation)
the labeled precision sample is score-weighted and over-represents higher-confidence findings
TVS shows 0% strict precision because titles lack structural keywords — this reflects a classification-method limitation, not necessarily signal quality
AVS sample size (n=20) remains below the n≥30 threshold for reliable per-signal precision
recall was measured on a synthetic mutation benchmark, not on every naturally evolving repository shape
DIA has known precision limitations and received a conservative weight (0.04) when promoted to scoring in v0.7.0
temporal signals depend on repository history quality and clone depth
the composite score is orientation, not a verdict

How to talk about release maturity¶

Use a split message instead of a single blanket adjective:

package classifier: Beta
core Python analysis: stable
CI and SARIF workflow: stable
TypeScript support: experimental
embeddings-based parts: optional / experimental
benchmark methodology: evolving

This keeps external communication honest without turning the Beta classifier into a blanket production guarantee for every surface.

See Stability and Release Status.

Where the detailed material lives¶

Repository artifacts¶

The raw benchmark and audit artifacts live in the repository alongside the docs. That matters because teams can inspect the material instead of relying on a black-box headline metric.

Useful starting points:

benchmark_results/all_results.json — aggregate scores, file counts, and finding totals for every corpus repository in a single file
benchmark_results/ground_truth_analysis.json — the labeled precision-analysis matrix (TP / FP / Disputed per signal)
benchmark_results/archive/validation/ground_truth_labels.json — underlying per-finding ground-truth labels used by the precision analysis
benchmark_results/mutation_benchmark.json — controlled recall benchmark (17 injected mutation patterns, detection rate per signal)
benchmark_results/archive/validation/holdout_validation.json — leave-one-out cross-validation snapshots kept apart from the main analysis narrative
benchmark_results/archive/single_repo/fastapi.json, benchmark_results/archive/single_repo/pydantic.json, and benchmark_results/archive/single_repo/django.json — full per-repository analysis output used by the case studies

These artifacts are most useful when read together with Benchmarking and Trust and Benchmark Study, because the methodology and the limits determine how the numbers should be interpreted.

What to verify locally before making policy decisions¶

whether your strongest findings line up with places the team already finds expensive to change
whether generated or exceptional directories should be excluded
whether boundary violations reflect real architecture intent or acceptable convenience imports
whether temporal signals have enough git history to be meaningful in your clone