Giao diện
A/B Testing Framework
1. Objective
The objective of this workflow is to scientifically prove causality. Does Model B actually drive more revenue than Model A, or was it just a sunny Saturday? A/B testing (Split Testing) is the gold standard for validating improvements in production. This workflow ensures tests are statistically valid, powered correctly, and analyzed without bias.
2. Context & Scope
In Scope
This workflow covers Hypothesis formulation, Power Analysis (Sample Size calculation), Randomization strategy (hashing), Experiment Launch, and Statistical Analysis (T-tests/Bayesian).
Assumption: You have an experimentation platform or feature flag system (e.g., LaunchDarkly, Optimizely, or home-grown) to handle the traffic splitting.
Out of Scope
- Offline Evaluation: Calculating AUC/RMSE on historical data is "Train Model". A/B testing is Online Evaluation.
- Canary Release: Releasing to 1% to check for bugs is Deployment, not Experimentation (though they overlap).
3. When to Use / When Not to Use
✅ Use This Workflow When
- Rolling out a new Recommendation Engine.
- Changing the UI layout significantly.
- Validating if a "Personalized" email subject line works better than generic.
❌ Do NOT Use This Workflow When
- You have too little traffic (Need 1000s of samples for significance).
- The change is a bug fix (Just fix it).
- You cannot randomize users (e.g., TV ads).
4. Inputs (Required/Optional)
Required Inputs
| Input | Description | Format | Example |
|---|---|---|---|
| HYPOTHESIS | What we expect to happen. | Text | "New model increases Click Rate by 5%." |
| METRIC | The success signal. | String | CTR, Conversion Rate |
| VARIANTS | Configuration of groups. | JSON | A: Control, B: New Model |
Optional Inputs
| Input | Description | Default | Condition |
|---|---|---|---|
| ALPHA | False Positive acceptable rate. | 0.05 | 5% standard. |
5. Outputs (Artifacts)
| Artifact | Format | Destination | Quality Criteria |
|---|---|---|---|
| Power Analysis | Calculation | Doc | Defines required N samples. |
| Final Report | Notebook/PDF | Wiki | States "Significant" or "Inconclusive". |
6. Operating Modes
⚡ Fast Mode
Timebox: 1 week Scope: Simple "Do No Harm". Details: Running a 50/50 split for 3 days to ensure the new model doesn't crash conversion rates. Low statistical power accepted.
🎯 Standard Mode (Default)
Timebox: 2 weeks Scope: Rigorous Inference. Details: Calculating sample size targeting 80% power. Running for 2 full business cycles (14 days) to eliminate day-of-week seasonality. Analysis using T-test/Z-test.
🔬 Deep Mode
Timebox: 1 month Scope: Heterogeneous Treatment Effects. Details: Analyzing not just "Did it win?" but "Who did it win for?". Breaking down results by user segment. Using CUPED (variance reduction) to speed up significance.
7. Constraints & Guardrails
Technical Constraints
- SRM (Sample Ratio Mismatch): If you aimed for 50/50 but got 52/48, your randomization is broken. STOP. Do not analyze. Fix the bug.
- Interference: If User A interacts with User B (e.g., a marketplace), treating A might affect B. Requires Switchback testing or Cluster randomization.
Security & Privacy
CAUTION
User Consent Ensure your Terms of Service allow for experimentation. Some regulated industries (Health/Finance) have strict rules against treating users differently for critical services (Fairness).
Compliance
- Opt-Out: Respect user "Do Not Track" or experiment opt-out settings where legislated (GDPR/CCPA).
8. Procedure
Phase 1: Design & Power
Objective: Plan the math.
Define the METRIC. Ensure it is sensitive (moves easily) and proximal (happens soon). Estimate the Baseline (p) and Minimum Detectable Effect (MDE). "Current Conversion is 10%. We want to detect a lift to 10.5% (5% relative)." Run Power Analysis: Calculate how many users (N) are needed. N = power_calc(baseline=0.10, mde=0.005, alpha=0.05, power=0.80). If N is greater than available traffic, you cannot run valid this test. Abort or increase MDE.
Verify: A documented Plan stating "We need 50k users per variant over 14 days."
Phase 2: Launch & Monitor
Objective: Gather data.
Configure the Splitter. Use a deterministic hash: if hash(user_id + "salt") % 100 < 50: Group A. Launch test. Day 1 Check: Monitor SRM. Is strictly 50/50? Are errors spiking in Treatment? Do Not Peek: Do not stop the test early just because it looks "Significant" on Day 3. This increases False Positive rate (Peeking Problem). Run for the pre-calculated duration.
Verify: Experiment is running, SRM is < 0.1% delta.
Phase 3: Analysis & Decision
Objective: Read the tea leaves.
Aggregated data. Calculate Mean and Variance for A and B. Calculate p-value.
- If p < 0.05: Statistically Significant. Reject Null.
- If p >= 0.05: Inconclusive. We cannot differentiate from noise. Check Guardrail Metrics (e.g., Latency, Cancellation Rate). If Primary Metric is up but Latency is up 200ms, it might be a bad trade.
Decision:
- Ship: Success + No Guardrail damage.
- Rollback: Inconclusive (Maintenance cost avoidance) or Negative.
Verify: A generated PDF report with Confidence Intervals (e.g., "Lift is +3.2% [1.1%, 5.3%]")
9. Technical Considerations
CUPED: Use "Controlled-experiment Using Pre-Experiment Data" (CUPED) to reduce variance. By regressing against the user's pre-experiment behavior, you can detect smaller lifts with fewer samples.
Novelty Effect: Users might click a new blue button just because it's new. Run long enough (Primacy effect) for this to wear off.
Multiple Hypothesis Testing: If you look at 100 metrics, 5 will be significant by chance (at alpha 0.05). Use Bonferroni correction if checking many metrics.
10. Quality Gates (Definition of Done)
Checklist
- [ ] Power Analysis performed pre-test.
- [ ] SRM check passed.
- [ ] Test ran for full duration.
- [ ] P-value calculated correctly.
Validation
| Criterion | Method | Threshold |
|---|---|---|
| Randomization | Chi-Square Test | p-value > 0.001 (Null is "Split is Even") |
| Data Quality | Outlier check | Remove bots > 99th percentile |
11. Failure Modes & Recovery
| Failure Mode | Symptoms | Recovery Action |
|---|---|---|
| SRM Fail | 60% users in A, 40% in B. | Randomization bug. Invalidate test. Fix hashing logic. |
| Underpowered | "Inconclusive" but clearly positive trend. | You didn't run long enough for that MDE. Extend test (caution) or accept risk. |
| Polluted Data | Bots skew metrics. | Apply heuristic filters (e.g., exclude UserAgents = 'GoogleBot'). |
12. Copy-Paste Prompt
TIP
One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.
text
Role: Act as a Senior AI & ML Engineering Engineer/Specialist.
Task: Execute the A/B Testing Framework workflow.
## Objective & Scope
- **Goal**: Scientifically prove causality of model or product changes via controlled experiments.
- **Scope**: Hypothesis formulation, Power Analysis, Randomization, Launch, and Statistical Analysis.
## Inputs
- [ ] HYPOTHESIS: What we expect to happen.
- [ ] METRIC: The success signal.
- [ ] VARIANTS: Configuration of groups.
- [ ] ALPHA: False Positive acceptable rate (Optional).
## Output Artifacts
- [ ] Power Analysis (Calculation)
- [ ] Final Report (Notebook/PDF)
## Execution Steps
1. **Design & Power**
- Define metric and run Power Analysis to calculate required sample size (N).
2. **Launch & Monitor**
- Configure splitter, launch test, and strictly monitor for Sample Ratio Mismatch (SRM).
3. **Analysis & Decision**
- Compute p-values/Confidence Intervals, check guardrails, and decide Ship/Rollback.
## Quality Gates
- [ ] Power Analysis performed pre-test.
- [ ] SRM check passed.
- [ ] Test ran for full duration.
- [ ] P-value calculated correctly.
## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.
## Constraints
- **Security**: Ensure User Consent and Fairness in regulated industries.
- **Privacy**: Respect Opt-Out/Do Not Track settings.
- **Technical**: Avoid Peeking and Interferences; STOP if SRM fails.
## Command
Now execute this workflow step-by-step.Appendix: Change Log
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-01-14 | AI Engineering Team | Initial release |