Giao diện
Feature Selection Techniques
1. Objective
The objective of this workflow is to remove the noise and keep the signal. Using 500 features (when only 10 matter) leads to the "Curse of Dimensionality", slow training, and overfitting. This workflow systematically reduces the feature space using statistical and model-based methods, resulting in simpler, faster, and often more accurate models.
2. Context & Scope
In Scope
This workflow covers Filter methods (Correlation, Variance), Wrapper methods (RFE, Forward Selection), and Embedded methods (Lasso, Tree Importance). It also addresses Collinearity (removing redundant features).
Assumption: Data is already cleaned and encoded (numerical). Feature selection fails on raw strings.
Out of Scope
- Dimensionality Reduction: Techniques like PCA transform features into new latent variables. This workflow focuses on Selection (keeping original interpretable features).
- Feature Generation: Creating new features is "Feature Engineering".
3. When to Use / When Not to Use
✅ Use This Workflow When
- You have wide data (Columns > Rows).
- The model trains too slowly.
- You suspect overfitting (High Train Score, Low Test Score).
❌ Do NOT Use This Workflow When
- You have very few features (< 10). Keep them all unless proven harmful.
- You are using Deep Learning (which often performs its own feature extraction implicitly).
4. Inputs (Required/Optional)
Required Inputs
| Input | Description | Format | Example |
|---|---|---|---|
| DATASET | The features (X) and target (y). | Dataframe | df_train |
| PROBLEM_TYPE | Classification or Regression. | String | Classification |
Optional Inputs
| Input | Description | Default | Condition |
|---|---|---|---|
| MAX_FEATURES | Target count. | None | If strict limit needed. |
5. Outputs (Artifacts)
| Artifact | Format | Destination | Quality Criteria |
|---|---|---|---|
| Selected Columns | List[String] | JSON file | Subset of original columns. |
| Ranking Report | Plot/Table | Notebook | Explains why features were dropped. |
6. Operating Modes
⚡ Fast Mode
Timebox: 30 minutes Scope: Correlation Filter. Details: Dropping Constant features (Variance=0) and Highly Correlated features (>0.95). Dropping features with low correlation to Target.
🎯 Standard Mode (Default)
Timebox: 2 hours Scope: Model-Based (RFE). Details: Using Recursive Feature Elimination (RFE) with a RandomForest or Linear model to iteratively prune the least important features.
🔬 Deep Mode
Timebox: 1 day Scope: Permutation Importance. Details: Using "Permutation Importance" (shuffling a column to see if model error increases) which is model-agnostic and more reliable than impurity-based importance.
7. Constraints & Guardrails
Technical Constraints
- Leakage: Feature Selection MUST happen only on the Training Set. If you select features based on the whole dataset (Train+Test), you leak information.
- Categorical High Cardinality: Tree-based importance is biased towards high-cardinality features (e.g., User ID). Use Permutation Importance to mitigate this.
Security & Privacy
CAUTION
Bias in Selection Dropping features because they are "uncorrelated" might hide interaction effects. Be careful removing sensitive attributes (Gender/Race) without checking for Fairness/Bias mitigation first.
Compliance
- Explainability: Fewer features = More explainable. This is often a regulatory requirement (Right to Explanation).
8. Procedure
Phase 1: Filter Methods (Univariate)
Objective: Cheap sanity check.
Variance Threshold: Remove columns where std == 0 (Constant). They add no value. Correlation Matrix:
- Identify pairs with Correlation > 0.95 (Multicollinearity).
- Drop one of the pair (usually the one with less correlation to Target).
- Do NOT trust Pearson correlation for non-linear relationships.
Verify: Dataset reduced by ~10-20% (garbage removal).
Phase 2: Wrapper/Embedded Methods
Objective: Interaction check.
Lasso (L1 Regularization): For linear models, fit Lasso. It drives coefficients of weak features to exactly 0. Drop them. Tree Importance: For non-linear, fit a Random Forest. Get feature_importances_. RFE (Recursive Feature Elimination):
- Train model on all features.
- Rank importance.
- Drop the bottom 1 (or 10%).
- Repeat until
MAX_FEATURESreached or performance drops. This captures interactions that univariate filters miss.
Verify: A ranked list of features.
Phase 3: Validation
Objective: Prove it helps.
Train a model on All Features. Measure AUC/MSE. Train a model on Selected Features. Measure AUC/MSE. Success Criteria:
- Performance did not drop significantly (e.g. < 1%).
- Performance might ideally increase (less noise).
- Training speed increased.
Verify: "Model with 20 features performs effectively same as model with 100."
9. Technical Considerations
Boruta Algorithm: An robust "All relevant" feature selection wrapper built around Random Forest. It compares shadow features to real features to determine statistical significance. Highly recommended for Tabular data.
Non-Linearity: Correlation (Pearson) only finds lines. Information Gain (Mutual Information) finds any relationship. Use Mutual Information for non-linear selection.
Time-Series: Random cross-validation for feature selection breaks in time-series (Lookahead bias). Use temporal splits.
10. Quality Gates (Definition of Done)
Checklist
- [ ] Leakage prevented (Train only).
- [ ] Multicollinearity handled.
- [ ] Validation comparison run.
Validation
| Criterion | Method | Threshold |
|---|---|---|
| Dimensionality | Reduction | Dataset size reduced |
| Integrity | CV Score | No significant degradation |
11. Failure Modes & Recovery
| Failure Mode | Symptoms | Recovery Action |
|---|---|---|
| Over-Pruning | Test score crashes. | You removed useful signal. Relax the selection threshold or switch methods. |
| Leakage | Test score suspiciously high. | Check if you ran selection before splitting Train/Test. Move it inside the pipeline. |
| Null Importance | All features have 0 importance. | Target might be random noise, or the model is completely wrong for the data. |
12. Copy-Paste Prompt
TIP
One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.
text
Role: Act as a Senior Data Scientist.
Task: Execute the Feature Selection Techniques workflow.
## Objective & Scope
- **Goal**: Reduce dataset dimensionality to improve model performance and interpretability.
- **Scope**: Filter methods (Correlation), Wrapper methods (RFE), and Verification.
## Inputs
- [ ] DATASET: Pandas DataFrame or File Path.
- [ ] PROBLEM_TYPE: "Classification" or "Regression".
- [ ] TARGET_VARIABLE: Name of the y column.
## Output Artifacts
- [ ] Selected Feature List (JSON)
- [ ] Validation Report (Performance Delta)
## Execution Steps
1. **Filter**
- Drop constant features. Calculate Correlation Matrix and drop highly correlated pairs (>0.95).
2. **Wrapper**
- Run Recursive Feature Elimination (RFE) or Permutation Importance using a Random Forest.
3. **Verify**
- Train baseline (All Features) vs Candidate (Selected Features). Compare AUC/RMSE.
## Quality Gates
- [ ] Multicollinearity resolved.
- [ ] Feature count reduced.
- [ ] Model performance maintained (or improved).
- [ ] No data leakage (Fit on Train only).
## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.
## Constraints
- **Technical**: Handle High Cardinality features carefully (use Permutation Importance).
- **Security**: Don't bias selection by dropping sensitive protected attributes without checks.
## Command
Now execute this workflow step-by-step.Appendix: Change Log
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-01-14 | AI Engineering Team | Initial release |