Skip to content

Feature Selection Techniques


1. Objective

The objective of this workflow is to remove the noise and keep the signal. Using 500 features (when only 10 matter) leads to the "Curse of Dimensionality", slow training, and overfitting. This workflow systematically reduces the feature space using statistical and model-based methods, resulting in simpler, faster, and often more accurate models.


2. Context & Scope

In Scope

This workflow covers Filter methods (Correlation, Variance), Wrapper methods (RFE, Forward Selection), and Embedded methods (Lasso, Tree Importance). It also addresses Collinearity (removing redundant features).

Assumption: Data is already cleaned and encoded (numerical). Feature selection fails on raw strings.

Out of Scope

  • Dimensionality Reduction: Techniques like PCA transform features into new latent variables. This workflow focuses on Selection (keeping original interpretable features).
  • Feature Generation: Creating new features is "Feature Engineering".

3. When to Use / When Not to Use

Use This Workflow When

  • You have wide data (Columns > Rows).
  • The model trains too slowly.
  • You suspect overfitting (High Train Score, Low Test Score).

Do NOT Use This Workflow When

  • You have very few features (< 10). Keep them all unless proven harmful.
  • You are using Deep Learning (which often performs its own feature extraction implicitly).

4. Inputs (Required/Optional)

Required Inputs

InputDescriptionFormatExample
DATASETThe features (X) and target (y).Dataframedf_train
PROBLEM_TYPEClassification or Regression.StringClassification

Optional Inputs

InputDescriptionDefaultCondition
MAX_FEATURESTarget count.NoneIf strict limit needed.

5. Outputs (Artifacts)

ArtifactFormatDestinationQuality Criteria
Selected ColumnsList[String]JSON fileSubset of original columns.
Ranking ReportPlot/TableNotebookExplains why features were dropped.

6. Operating Modes

Fast Mode

Timebox: 30 minutes Scope: Correlation Filter. Details: Dropping Constant features (Variance=0) and Highly Correlated features (>0.95). Dropping features with low correlation to Target.

🎯 Standard Mode (Default)

Timebox: 2 hours Scope: Model-Based (RFE). Details: Using Recursive Feature Elimination (RFE) with a RandomForest or Linear model to iteratively prune the least important features.

🔬 Deep Mode

Timebox: 1 day Scope: Permutation Importance. Details: Using "Permutation Importance" (shuffling a column to see if model error increases) which is model-agnostic and more reliable than impurity-based importance.


7. Constraints & Guardrails

Technical Constraints

  • Leakage: Feature Selection MUST happen only on the Training Set. If you select features based on the whole dataset (Train+Test), you leak information.
  • Categorical High Cardinality: Tree-based importance is biased towards high-cardinality features (e.g., User ID). Use Permutation Importance to mitigate this.

Security & Privacy

CAUTION

Bias in Selection Dropping features because they are "uncorrelated" might hide interaction effects. Be careful removing sensitive attributes (Gender/Race) without checking for Fairness/Bias mitigation first.

Compliance

  • Explainability: Fewer features = More explainable. This is often a regulatory requirement (Right to Explanation).

8. Procedure

Phase 1: Filter Methods (Univariate)

Objective: Cheap sanity check.

Variance Threshold: Remove columns where std == 0 (Constant). They add no value. Correlation Matrix:

  • Identify pairs with Correlation > 0.95 (Multicollinearity).
  • Drop one of the pair (usually the one with less correlation to Target).
  • Do NOT trust Pearson correlation for non-linear relationships.

Verify: Dataset reduced by ~10-20% (garbage removal).

Phase 2: Wrapper/Embedded Methods

Objective: Interaction check.

Lasso (L1 Regularization): For linear models, fit Lasso. It drives coefficients of weak features to exactly 0. Drop them. Tree Importance: For non-linear, fit a Random Forest. Get feature_importances_. RFE (Recursive Feature Elimination):

  1. Train model on all features.
  2. Rank importance.
  3. Drop the bottom 1 (or 10%).
  4. Repeat until MAX_FEATURES reached or performance drops. This captures interactions that univariate filters miss.

Verify: A ranked list of features.

Phase 3: Validation

Objective: Prove it helps.

Train a model on All Features. Measure AUC/MSE. Train a model on Selected Features. Measure AUC/MSE. Success Criteria:

  • Performance did not drop significantly (e.g. < 1%).
  • Performance might ideally increase (less noise).
  • Training speed increased.

Verify: "Model with 20 features performs effectively same as model with 100."


9. Technical Considerations

Boruta Algorithm: An robust "All relevant" feature selection wrapper built around Random Forest. It compares shadow features to real features to determine statistical significance. Highly recommended for Tabular data.

Non-Linearity: Correlation (Pearson) only finds lines. Information Gain (Mutual Information) finds any relationship. Use Mutual Information for non-linear selection.

Time-Series: Random cross-validation for feature selection breaks in time-series (Lookahead bias). Use temporal splits.


10. Quality Gates (Definition of Done)

Checklist

  • [ ] Leakage prevented (Train only).
  • [ ] Multicollinearity handled.
  • [ ] Validation comparison run.

Validation

CriterionMethodThreshold
DimensionalityReductionDataset size reduced
IntegrityCV ScoreNo significant degradation

11. Failure Modes & Recovery

Failure ModeSymptomsRecovery Action
Over-PruningTest score crashes.You removed useful signal. Relax the selection threshold or switch methods.
LeakageTest score suspiciously high.Check if you ran selection before splitting Train/Test. Move it inside the pipeline.
Null ImportanceAll features have 0 importance.Target might be random noise, or the model is completely wrong for the data.

12. Copy-Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text
Role: Act as a Senior Data Scientist.
Task: Execute the Feature Selection Techniques workflow.

## Objective & Scope
- **Goal**: Reduce dataset dimensionality to improve model performance and interpretability.
- **Scope**: Filter methods (Correlation), Wrapper methods (RFE), and Verification.

## Inputs
- [ ] DATASET: Pandas DataFrame or File Path.
- [ ] PROBLEM_TYPE: "Classification" or "Regression".
- [ ] TARGET_VARIABLE: Name of the y column.

## Output Artifacts
- [ ] Selected Feature List (JSON)
- [ ] Validation Report (Performance Delta)

## Execution Steps
1. **Filter**
   - Drop constant features. Calculate Correlation Matrix and drop highly correlated pairs (>0.95).
2. **Wrapper**
   - Run Recursive Feature Elimination (RFE) or Permutation Importance using a Random Forest.
3. **Verify**
   - Train baseline (All Features) vs Candidate (Selected Features). Compare AUC/RMSE.

## Quality Gates
- [ ] Multicollinearity resolved.
- [ ] Feature count reduced.
- [ ] Model performance maintained (or improved).
- [ ] No data leakage (Fit on Train only).

## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.

## Constraints
- **Technical**: Handle High Cardinality features carefully (use Permutation Importance).
- **Security**: Don't bias selection by dropping sensitive protected attributes without checks.

## Command
Now execute this workflow step-by-step.

Appendix: Change Log

VersionDateAuthorChanges
1.0.02026-01-14AI Engineering TeamInitial release

Cập nhật lần cuối: