Real-World Testing Playbook¶

This playbook defines a repeatable way to evaluate DevCD under real local conditions on Windows. It is designed for a half-day run and answers two questions:

Does DevCD function correctly?
How good is DevCD in practical agent workflows?

Use this together with the score template in examples/reality-testing/scorecard-template.md.

Scope¶

Environment: local Windows developer machine
Timebox: about 4 hours
Focus: real workflows, not synthetic unit-level behavior
Exclusions: hosted/cloud deployment and long soak tests

Entry Criteria¶

Before the session starts, confirm:

Python and DevCD CLI are installed.
Workspace is on a known branch and in a known git state.
You can run local commands in PowerShell.

Recommended baseline commands:

devcd --version
devcd smoke --compact

Test Phases¶

Phase A: Baseline and Warm-Start¶

Goal: prove the first-run and handoff path works quickly.

Commands:

devcd welcome
devcd onboard --preview
devcd onboard --yes
devcd agentic action-packet
devcd context passport

Expected signals:

Clear next-step guidance in welcome and onboarding output.
Action Packet contains a usable next action.
Passport reflects current continuity without needing a recap.

Stop conditions:

Onboard cannot prepare local workspace.
Action Packet is not usable after onboarding.

Phase B: Live Runtime and Control Plane¶

Goal: verify daemon mode, ingestion, and context transparency.

Commands:

devcd run
devcd event ide test_passed --payload '{"suite":"reality","case":"live-loop"}'
devcd context control
devcd context budget

Expected signals:

Daemon starts on loopback.
Event is accepted and reflected in context surfaces.
Control/budget output is understandable and actionable.

Stop conditions:

Event ingestion succeeds but context surfaces do not update.
Control plane does not explain visible/withheld boundaries.

Phase C: Recovery and Resilience¶

Goal: prove degraded-mode behavior and clean recovery.

Commands:

devcd status
devcd doctor
devcd doctor --fix
devcd run
devcd doctor

Expected signals:

Degraded state is explicit when daemon is down.
doctor --fix applies only safe local repairs.
Full pass returns after daemon restart.

Stop conditions:

Recovery path is unclear or non-reproducible.

Phase D: Policy and Security Boundaries¶

Goal: validate local-first and deny-by-default behavior.

Commands:

devcd agentic run --runner codex --json
devcd integrations openclaw --smoke-test

Expected signals:

Runner start is denied by default unless explicitly configured.
MCP integration remains read-only and shape checks pass.

Critical failure conditions:

Unexpected remote export behavior.
Unexpected write-capable MCP surface.
Sensitive/raw payload leakage into user-facing context outputs.

Phase E: Simulated Agent Handoff¶

Goal: measure practical handoff quality, not only technical correctness.

Procedure:

End one working session with devcd handoff or devcd capture metadata.
Start a fresh session using only devcd agentic action-packet and devcd context passport.
Measure time to first useful action and number of recap-style questions.

Pass signal:

Fresh session starts productively from packet/passport without full recap.

Quality Score (0-100)¶

Score DevCD with weighted dimensions:

30% Utility: Fresh session reaches first useful action quickly.
25% Reliability: Scenario pass rate without manual workarounds.
20% Recovery: Degraded mode and recovery are reproducible.
15% Policy fidelity: deny/withheld/read-only behavior is correct.
10% Integration readiness: MCP/OpenClaw smoke path is stable.

Formula:

Total = Utility*0.30 + Reliability*0.25 + Recovery*0.20 + Policy*0.15 + Integration*0.10

Interpretation:

90-100: production-like confidence for local workflows
75-89: strong pre-alpha behavior with focused polish needs
60-74: usable but requires stabilization before broad adoption
below 60: address blockers before wider rollout

Minimal Evidence Pack¶

Collect these artifacts per run:

Command log (executed commands + key outputs)
Completed scorecard
Top 3 strengths
Top 3 improvements
Go/No-Go recommendation for broader team usage

Automated Outcome Evaluation¶

The manual playbook above gives qualitative signal across the full Phase A-E lifecycle. The automated outcome eval loop covers four specific regression-prone failure modes that cannot be detected by unit tests alone.

Run this before any release and after changes to the agentic context, compliance, or handoff paths.

Four Failure Modes Under Automated Test¶

Mode	What it measures	Pytest target
Turn-0 recap risk	Does the Action Packet carry enough context that an agent can start without asking "What is your goal?"	`test_action_packet_turn0_risk_*` in `tests/test_agentic_context.py`
Incorrect resource choice	Does concise variant carry the required handoff fields? Does the escalation to detailed only trigger when needed?	`test_action_packet_json_includes_turn0_risk_low_after_handoff` in `tests/test_cli.py`
Staleness drift	Is a goal older than 24 h flagged with `staleness_flag=true` and a warning in the compliance report?	`test_action_packet_staleness_flag_*` and `test_compliance_json_shows_staleness_warning_for_old_goal`
False completion resistance	Does the completion gate warn when an agent claims done without having read the Action Packet in the current session?	`test_completion_check_warns_when_handoff_exists_but_packet_never_read`

Automated Eval Commands¶

Run the targeted outcome eval suite:

python -m pytest tests/test_agentic_context.py -q -k "turn0_risk or staleness"
python -m pytest tests/test_cli.py -q -k "eval_signal or consumption_gap or staleness"

Run all agentic outcome tests together:

python -m pytest tests/test_agentic_context.py tests/test_cli.py tests/test_internal_agentic_benchmark.py -q -k "agentic or completion_check or compliance or staleness or turn0 or eval_signal or consumption"

Outcome Signals in JSON Output¶

The following signals are now present in devcd agentic action-packet --json and devcd agentic compliance --json / devcd agentic completion-check --json:

Action Packet (devcd agentic action-packet --json):

turn0_risk: "low" when goal and next action are both present, "medium" when only goal is available, "high" when neither is set.
goal_age_seconds: seconds since the most recent goal capture event in the ledger, or null when no goal capture exists.
staleness_flag: true when goal_age_seconds >= 86400 (24 h).

Compliance / Completion Gate (devcd agentic compliance --json):

metrics.action_packet_reads: count of hook:action-packet.before events — how many times an agent read the Action Packet in this session.
completion_gate.signals.turn0_risk, staleness_flag, goal_age_seconds, packet_consumed_this_session: see above.
completion_gate.warnings may include:
consumption_gap: completion claimed but no Action Packet read recorded.
staleness: goal older than 24 h with the age in hours.

Verification Gates¶

Run these before closing the session:

make check
make smoke
make docs

Use make distribution when evaluating release readiness in the same cycle.