Skip to content

A/B Testing Framework


1. Objective

The objective of this workflow is to scientifically prove causality. Does Model B actually drive more revenue than Model A, or was it just a sunny Saturday? A/B testing (Split Testing) is the gold standard for validating improvements in production. This workflow ensures tests are statistically valid, powered correctly, and analyzed without bias.


2. Context & Scope

In Scope

This workflow covers Hypothesis formulation, Power Analysis (Sample Size calculation), Randomization strategy (hashing), Experiment Launch, and Statistical Analysis (T-tests/Bayesian).

Assumption: You have an experimentation platform or feature flag system (e.g., LaunchDarkly, Optimizely, or home-grown) to handle the traffic splitting.

Out of Scope

  • Offline Evaluation: Calculating AUC/RMSE on historical data is "Train Model". A/B testing is Online Evaluation.
  • Canary Release: Releasing to 1% to check for bugs is Deployment, not Experimentation (though they overlap).

3. When to Use / When Not to Use

Use This Workflow When

  • Rolling out a new Recommendation Engine.
  • Changing the UI layout significantly.
  • Validating if a "Personalized" email subject line works better than generic.

Do NOT Use This Workflow When

  • You have too little traffic (Need 1000s of samples for significance).
  • The change is a bug fix (Just fix it).
  • You cannot randomize users (e.g., TV ads).

4. Inputs (Required/Optional)

Required Inputs

InputDescriptionFormatExample
HYPOTHESISWhat we expect to happen.Text"New model increases Click Rate by 5%."
METRICThe success signal.StringCTR, Conversion Rate
VARIANTSConfiguration of groups.JSONA: Control, B: New Model

Optional Inputs

InputDescriptionDefaultCondition
ALPHAFalse Positive acceptable rate.0.055% standard.

5. Outputs (Artifacts)

ArtifactFormatDestinationQuality Criteria
Power AnalysisCalculationDocDefines required N samples.
Final ReportNotebook/PDFWikiStates "Significant" or "Inconclusive".

6. Operating Modes

Fast Mode

Timebox: 1 week Scope: Simple "Do No Harm". Details: Running a 50/50 split for 3 days to ensure the new model doesn't crash conversion rates. Low statistical power accepted.

🎯 Standard Mode (Default)

Timebox: 2 weeks Scope: Rigorous Inference. Details: Calculating sample size targeting 80% power. Running for 2 full business cycles (14 days) to eliminate day-of-week seasonality. Analysis using T-test/Z-test.

🔬 Deep Mode

Timebox: 1 month Scope: Heterogeneous Treatment Effects. Details: Analyzing not just "Did it win?" but "Who did it win for?". Breaking down results by user segment. Using CUPED (variance reduction) to speed up significance.


7. Constraints & Guardrails

Technical Constraints

  • SRM (Sample Ratio Mismatch): If you aimed for 50/50 but got 52/48, your randomization is broken. STOP. Do not analyze. Fix the bug.
  • Interference: If User A interacts with User B (e.g., a marketplace), treating A might affect B. Requires Switchback testing or Cluster randomization.

Security & Privacy

CAUTION

User Consent Ensure your Terms of Service allow for experimentation. Some regulated industries (Health/Finance) have strict rules against treating users differently for critical services (Fairness).

Compliance

  • Opt-Out: Respect user "Do Not Track" or experiment opt-out settings where legislated (GDPR/CCPA).

8. Procedure

Phase 1: Design & Power

Objective: Plan the math.

Define the METRIC. Ensure it is sensitive (moves easily) and proximal (happens soon). Estimate the Baseline (p) and Minimum Detectable Effect (MDE). "Current Conversion is 10%. We want to detect a lift to 10.5% (5% relative)." Run Power Analysis: Calculate how many users (N) are needed. N = power_calc(baseline=0.10, mde=0.005, alpha=0.05, power=0.80). If N is greater than available traffic, you cannot run valid this test. Abort or increase MDE.

Verify: A documented Plan stating "We need 50k users per variant over 14 days."

Phase 2: Launch & Monitor

Objective: Gather data.

Configure the Splitter. Use a deterministic hash: if hash(user_id + "salt") % 100 < 50: Group A. Launch test. Day 1 Check: Monitor SRM. Is strictly 50/50? Are errors spiking in Treatment? Do Not Peek: Do not stop the test early just because it looks "Significant" on Day 3. This increases False Positive rate (Peeking Problem). Run for the pre-calculated duration.

Verify: Experiment is running, SRM is < 0.1% delta.

Phase 3: Analysis & Decision

Objective: Read the tea leaves.

Aggregated data. Calculate Mean and Variance for A and B. Calculate p-value.

  • If p < 0.05: Statistically Significant. Reject Null.
  • If p >= 0.05: Inconclusive. We cannot differentiate from noise. Check Guardrail Metrics (e.g., Latency, Cancellation Rate). If Primary Metric is up but Latency is up 200ms, it might be a bad trade.

Decision:

  • Ship: Success + No Guardrail damage.
  • Rollback: Inconclusive (Maintenance cost avoidance) or Negative.

Verify: A generated PDF report with Confidence Intervals (e.g., "Lift is +3.2% [1.1%, 5.3%]")


9. Technical Considerations

CUPED: Use "Controlled-experiment Using Pre-Experiment Data" (CUPED) to reduce variance. By regressing against the user's pre-experiment behavior, you can detect smaller lifts with fewer samples.

Novelty Effect: Users might click a new blue button just because it's new. Run long enough (Primacy effect) for this to wear off.

Multiple Hypothesis Testing: If you look at 100 metrics, 5 will be significant by chance (at alpha 0.05). Use Bonferroni correction if checking many metrics.


10. Quality Gates (Definition of Done)

Checklist

  • [ ] Power Analysis performed pre-test.
  • [ ] SRM check passed.
  • [ ] Test ran for full duration.
  • [ ] P-value calculated correctly.

Validation

CriterionMethodThreshold
RandomizationChi-Square Testp-value > 0.001 (Null is "Split is Even")
Data QualityOutlier checkRemove bots > 99th percentile

11. Failure Modes & Recovery

Failure ModeSymptomsRecovery Action
SRM Fail60% users in A, 40% in B.Randomization bug. Invalidate test. Fix hashing logic.
Underpowered"Inconclusive" but clearly positive trend.You didn't run long enough for that MDE. Extend test (caution) or accept risk.
Polluted DataBots skew metrics.Apply heuristic filters (e.g., exclude UserAgents = 'GoogleBot').

12. Copy-Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text
Role: Act as a Senior AI & ML Engineering Engineer/Specialist.
Task: Execute the A/B Testing Framework workflow.

## Objective & Scope
- **Goal**: Scientifically prove causality of model or product changes via controlled experiments.
- **Scope**: Hypothesis formulation, Power Analysis, Randomization, Launch, and Statistical Analysis.

## Inputs
- [ ] HYPOTHESIS: What we expect to happen.
- [ ] METRIC: The success signal.
- [ ] VARIANTS: Configuration of groups.
- [ ] ALPHA: False Positive acceptable rate (Optional).

## Output Artifacts
- [ ] Power Analysis (Calculation)
- [ ] Final Report (Notebook/PDF)

## Execution Steps
1. **Design & Power**
   - Define metric and run Power Analysis to calculate required sample size (N).
2. **Launch & Monitor**
   - Configure splitter, launch test, and strictly monitor for Sample Ratio Mismatch (SRM).
3. **Analysis & Decision**
   - Compute p-values/Confidence Intervals, check guardrails, and decide Ship/Rollback.

## Quality Gates
- [ ] Power Analysis performed pre-test.
- [ ] SRM check passed.
- [ ] Test ran for full duration.
- [ ] P-value calculated correctly.

## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.

## Constraints
- **Security**: Ensure User Consent and Fairness in regulated industries.
- **Privacy**: Respect Opt-Out/Do Not Track settings.
- **Technical**: Avoid Peeking and Interferences; STOP if SRM fails.

## Command
Now execute this workflow step-by-step.

Appendix: Change Log

VersionDateAuthorChanges
1.0.02026-01-14AI Engineering TeamInitial release

Cập nhật lần cuối: