Feature Selection Techniques

1. Objective

The objective of this workflow is to remove the noise and keep the signal. Using 500 features (when only 10 matter) leads to the "Curse of Dimensionality", slow training, and overfitting. This workflow systematically reduces the feature space using statistical and model-based methods, resulting in simpler, faster, and often more accurate models.

2. Context & Scope

In Scope

This workflow covers Filter methods (Correlation, Variance), Wrapper methods (RFE, Forward Selection), and Embedded methods (Lasso, Tree Importance). It also addresses Collinearity (removing redundant features).

Assumption: Data is already cleaned and encoded (numerical). Feature selection fails on raw strings.

Out of Scope

Dimensionality Reduction: Techniques like PCA transform features into new latent variables. This workflow focuses on Selection (keeping original interpretable features).
Feature Generation: Creating new features is "Feature Engineering".

3. When to Use / When Not to Use

✅ Use This Workflow When

You have wide data (Columns > Rows).
The model trains too slowly.
You suspect overfitting (High Train Score, Low Test Score).

❌ Do NOT Use This Workflow When

You have very few features (< 10). Keep them all unless proven harmful.
You are using Deep Learning (which often performs its own feature extraction implicitly).

4. Inputs (Required/Optional)

Required Inputs

Input	Description	Format	Example
DATASET	The features (X) and target (y).	Dataframe	`df_train`
PROBLEM_TYPE	Classification or Regression.	String	`Classification`

Optional Inputs

Input	Description	Default	Condition
MAX_FEATURES	Target count.	`None`	If strict limit needed.

5. Outputs (Artifacts)

Artifact	Format	Destination	Quality Criteria
Selected Columns	List[String]	JSON file	Subset of original columns.
Ranking Report	Plot/Table	Notebook	Explains why features were dropped.

6. Operating Modes

⚡ Fast Mode

Timebox: 30 minutes Scope: Correlation Filter. Details: Dropping Constant features (Variance=0) and Highly Correlated features (>0.95). Dropping features with low correlation to Target.

🎯 Standard Mode (Default)

Timebox: 2 hours Scope: Model-Based (RFE). Details: Using Recursive Feature Elimination (RFE) with a RandomForest or Linear model to iteratively prune the least important features.

🔬 Deep Mode

Timebox: 1 day Scope: Permutation Importance. Details: Using "Permutation Importance" (shuffling a column to see if model error increases) which is model-agnostic and more reliable than impurity-based importance.

7. Constraints & Guardrails

Technical Constraints

Leakage: Feature Selection MUST happen only on the Training Set. If you select features based on the whole dataset (Train+Test), you leak information.
Categorical High Cardinality: Tree-based importance is biased towards high-cardinality features (e.g., User ID). Use Permutation Importance to mitigate this.

Security & Privacy

CAUTION

Bias in Selection Dropping features because they are "uncorrelated" might hide interaction effects. Be careful removing sensitive attributes (Gender/Race) without checking for Fairness/Bias mitigation first.

Compliance

Explainability: Fewer features = More explainable. This is often a regulatory requirement (Right to Explanation).

8. Procedure

Phase 1: Filter Methods (Univariate)

Objective: Cheap sanity check.

Variance Threshold: Remove columns where std == 0 (Constant). They add no value. Correlation Matrix:

Identify pairs with Correlation > 0.95 (Multicollinearity).
Drop one of the pair (usually the one with less correlation to Target).
Do NOT trust Pearson correlation for non-linear relationships.

Verify: Dataset reduced by ~10-20% (garbage removal).

Phase 2: Wrapper/Embedded Methods

Objective: Interaction check.

Lasso (L1 Regularization): For linear models, fit Lasso. It drives coefficients of weak features to exactly 0. Drop them. Tree Importance: For non-linear, fit a Random Forest. Get feature_importances_. RFE (Recursive Feature Elimination):

Train model on all features.
Rank importance.
Drop the bottom 1 (or 10%).
Repeat until MAX_FEATURES reached or performance drops. This captures interactions that univariate filters miss.

Verify: A ranked list of features.

Phase 3: Validation

Objective: Prove it helps.

Train a model on All Features. Measure AUC/MSE. Train a model on Selected Features. Measure AUC/MSE. Success Criteria:

Performance did not drop significantly (e.g. < 1%).
Performance might ideally increase (less noise).
Training speed increased.

Verify: "Model with 20 features performs effectively same as model with 100."

9. Technical Considerations

Boruta Algorithm: An robust "All relevant" feature selection wrapper built around Random Forest. It compares shadow features to real features to determine statistical significance. Highly recommended for Tabular data.

Non-Linearity: Correlation (Pearson) only finds lines. Information Gain (Mutual Information) finds any relationship. Use Mutual Information for non-linear selection.

Time-Series: Random cross-validation for feature selection breaks in time-series (Lookahead bias). Use temporal splits.

10. Quality Gates (Definition of Done)

Checklist

[ ] Leakage prevented (Train only).
[ ] Multicollinearity handled.
[ ] Validation comparison run.

Validation

Criterion	Method	Threshold
Dimensionality	Reduction	Dataset size reduced
Integrity	CV Score	No significant degradation

11. Failure Modes & Recovery

Failure Mode	Symptoms	Recovery Action
Over-Pruning	Test score crashes.	You removed useful signal. Relax the selection threshold or switch methods.
Leakage	Test score suspiciously high.	Check if you ran selection before splitting Train/Test. Move it inside the pipeline.
Null Importance	All features have 0 importance.	Target might be random noise, or the model is completely wrong for the data.

12. Copy-Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text

Role: Act as a Senior Data Scientist.
Task: Execute the Feature Selection Techniques workflow.

## Objective & Scope
- **Goal**: Reduce dataset dimensionality to improve model performance and interpretability.
- **Scope**: Filter methods (Correlation), Wrapper methods (RFE), and Verification.

## Inputs
- [ ] DATASET: Pandas DataFrame or File Path.
- [ ] PROBLEM_TYPE: "Classification" or "Regression".
- [ ] TARGET_VARIABLE: Name of the y column.

## Output Artifacts
- [ ] Selected Feature List (JSON)
- [ ] Validation Report (Performance Delta)

## Execution Steps
1. **Filter**
   - Drop constant features. Calculate Correlation Matrix and drop highly correlated pairs (>0.95).
2. **Wrapper**
   - Run Recursive Feature Elimination (RFE) or Permutation Importance using a Random Forest.
3. **Verify**
   - Train baseline (All Features) vs Candidate (Selected Features). Compare AUC/RMSE.

## Quality Gates
- [ ] Multicollinearity resolved.
- [ ] Feature count reduced.
- [ ] Model performance maintained (or improved).
- [ ] No data leakage (Fit on Train only).

## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.

## Constraints
- **Technical**: Handle High Cardinality features carefully (use Permutation Importance).
- **Security**: Don't bias selection by dropping sensitive protected attributes without checks.

## Command
Now execute this workflow step-by-step.

Appendix: Change Log

Version	Date	Author	Changes
1.0.0	2026-01-14	AI Engineering Team	Initial release

Feature Selection Techniques ​

1. Objective ​

2. Context & Scope ​

In Scope ​

Out of Scope ​

3. When to Use / When Not to Use ​

✅ Use This Workflow When ​

❌ Do NOT Use This Workflow When ​

4. Inputs (Required/Optional) ​

Required Inputs ​

Optional Inputs ​

5. Outputs (Artifacts) ​

6. Operating Modes ​

⚡ Fast Mode ​

🎯 Standard Mode (Default) ​

🔬 Deep Mode ​

7. Constraints & Guardrails ​

Technical Constraints ​

Security & Privacy ​

Compliance ​

8. Procedure ​

Phase 1: Filter Methods (Univariate) ​

Phase 2: Wrapper/Embedded Methods ​

Phase 3: Validation ​

9. Technical Considerations ​

10. Quality Gates (Definition of Done) ​

Checklist ​

Validation ​

11. Failure Modes & Recovery ​

12. Copy-Paste Prompt ​

Appendix: Change Log ​