Real-World Testing Playbook¶
This playbook defines a repeatable way to evaluate DevCD under real local conditions on Windows. It is designed for a half-day run and answers two questions:
- Does DevCD function correctly?
- How good is DevCD in practical agent workflows?
Use this together with the score template in
examples/reality-testing/scorecard-template.md.
Scope¶
- Environment: local Windows developer machine
- Timebox: about 4 hours
- Focus: real workflows, not synthetic unit-level behavior
- Exclusions: hosted/cloud deployment and long soak tests
Entry Criteria¶
Before the session starts, confirm:
- Python and DevCD CLI are installed.
- Workspace is on a known branch and in a known git state.
- You can run local commands in PowerShell.
Recommended baseline commands:
Test Phases¶
Phase A: Baseline and Warm-Start¶
Goal: prove the first-run and handoff path works quickly.
Commands:
devcd welcome
devcd onboard --preview
devcd onboard --yes
devcd agentic action-packet
devcd context passport
Expected signals:
- Clear next-step guidance in
welcomeand onboarding output. - Action Packet contains a usable next action.
- Passport reflects current continuity without needing a recap.
Stop conditions:
- Onboard cannot prepare local workspace.
- Action Packet is not usable after onboarding.
Phase B: Live Runtime and Control Plane¶
Goal: verify daemon mode, ingestion, and context transparency.
Commands:
devcd run
devcd event ide test_passed --payload '{"suite":"reality","case":"live-loop"}'
devcd context control
devcd context budget
Expected signals:
- Daemon starts on loopback.
- Event is accepted and reflected in context surfaces.
- Control/budget output is understandable and actionable.
Stop conditions:
- Event ingestion succeeds but context surfaces do not update.
- Control plane does not explain visible/withheld boundaries.
Phase C: Recovery and Resilience¶
Goal: prove degraded-mode behavior and clean recovery.
Commands:
Expected signals:
- Degraded state is explicit when daemon is down.
doctor --fixapplies only safe local repairs.- Full pass returns after daemon restart.
Stop conditions:
- Recovery path is unclear or non-reproducible.
Phase D: Policy and Security Boundaries¶
Goal: validate local-first and deny-by-default behavior.
Commands:
Expected signals:
- Runner start is denied by default unless explicitly configured.
- MCP integration remains read-only and shape checks pass.
Critical failure conditions:
- Unexpected remote export behavior.
- Unexpected write-capable MCP surface.
- Sensitive/raw payload leakage into user-facing context outputs.
Phase E: Simulated Agent Handoff¶
Goal: measure practical handoff quality, not only technical correctness.
Procedure:
- End one working session with
devcd handoffordevcd capturemetadata. - Start a fresh session using only
devcd agentic action-packetanddevcd context passport. - Measure time to first useful action and number of recap-style questions.
Pass signal:
- Fresh session starts productively from packet/passport without full recap.
Quality Score (0-100)¶
Score DevCD with weighted dimensions:
- 30% Utility: Fresh session reaches first useful action quickly.
- 25% Reliability: Scenario pass rate without manual workarounds.
- 20% Recovery: Degraded mode and recovery are reproducible.
- 15% Policy fidelity: deny/withheld/read-only behavior is correct.
- 10% Integration readiness: MCP/OpenClaw smoke path is stable.
Formula:
Interpretation:
- 90-100: production-like confidence for local workflows
- 75-89: strong pre-alpha behavior with focused polish needs
- 60-74: usable but requires stabilization before broad adoption
- below 60: address blockers before wider rollout
Minimal Evidence Pack¶
Collect these artifacts per run:
- Command log (executed commands + key outputs)
- Completed scorecard
- Top 3 strengths
- Top 3 improvements
- Go/No-Go recommendation for broader team usage
Automated Outcome Evaluation¶
The manual playbook above gives qualitative signal across the full Phase A-E lifecycle. The automated outcome eval loop covers four specific regression-prone failure modes that cannot be detected by unit tests alone.
Run this before any release and after changes to the agentic context, compliance, or handoff paths.
Four Failure Modes Under Automated Test¶
| Mode | What it measures | Pytest target |
|---|---|---|
| Turn-0 recap risk | Does the Action Packet carry enough context that an agent can start without asking "What is your goal?" | test_action_packet_turn0_risk_* in tests/test_agentic_context.py |
| Incorrect resource choice | Does concise variant carry the required handoff fields? Does the escalation to detailed only trigger when needed? | test_action_packet_json_includes_turn0_risk_low_after_handoff in tests/test_cli.py |
| Staleness drift | Is a goal older than 24 h flagged with staleness_flag=true and a warning in the compliance report? |
test_action_packet_staleness_flag_* and test_compliance_json_shows_staleness_warning_for_old_goal |
| False completion resistance | Does the completion gate warn when an agent claims done without having read the Action Packet in the current session? | test_completion_check_warns_when_handoff_exists_but_packet_never_read |
Automated Eval Commands¶
Run the targeted outcome eval suite:
python -m pytest tests/test_agentic_context.py -q -k "turn0_risk or staleness"
python -m pytest tests/test_cli.py -q -k "eval_signal or consumption_gap or staleness"
Run all agentic outcome tests together:
python -m pytest tests/test_agentic_context.py tests/test_cli.py tests/test_internal_agentic_benchmark.py -q -k "agentic or completion_check or compliance or staleness or turn0 or eval_signal or consumption"
Outcome Signals in JSON Output¶
The following signals are now present in devcd agentic action-packet --json and
devcd agentic compliance --json / devcd agentic completion-check --json:
Action Packet (devcd agentic action-packet --json):
turn0_risk:"low"when goal and next action are both present,"medium"when only goal is available,"high"when neither is set.goal_age_seconds: seconds since the most recent goal capture event in the ledger, ornullwhen no goal capture exists.staleness_flag:truewhengoal_age_seconds >= 86400(24 h).
Compliance / Completion Gate (devcd agentic compliance --json):
metrics.action_packet_reads: count ofhook:action-packet.beforeevents — how many times an agent read the Action Packet in this session.completion_gate.signals.turn0_risk,staleness_flag,goal_age_seconds,packet_consumed_this_session: see above.completion_gate.warningsmay include:consumption_gap: completion claimed but no Action Packet read recorded.staleness: goal older than 24 h with the age in hours.
Verification Gates¶
Run these before closing the session:
Use make distribution when evaluating release readiness in the same cycle.