Clean Text Data

1. Objective

The objective of this workflow is to transform raw, noisy text (human language) into a structured, numerical-ready format. "Garbage In, Garbage Out" is doubly true for NLP. If you feed HTML tags, emojis, and stop words into a TF-IDF or Embedding model without intent, your results will be noise. This workflow ensures standardization.

2. Context & Scope

In Scope

This workflow covers Encoding normalization (Unicode), Regex cleaning (Removing URLs/HTML), Tokenization, Stopword removal, and Lemmatization/Stemming.

Assumption: You are using standard NLP libraries (NLTK, spaCy, HuggingFace).

Out of Scope

Embedding Generation: Converting text to vectors is "Create Text Embeddings". This workflow stops at the "Text/Token" stage.
OCR: Extracting text from images is a Computer Vision task.

3. When to Use / When Not to Use

✅ Use This Workflow When

Building a Topic Model or Sentiment Analyzer.
Fine-tuning a BERT model (requires specific tokenizer).
Cleaning scraped web data.

❌ Do NOT Use This Workflow When

Using an LLM (GPT-4) that expects raw natural language (Prompt Engineering). Over-cleaning (removing Stopwords) hurts LLM performance as it breaks grammar.
Your data is code (Requires AST parsing, not English tokenization).

4. Inputs (Required/Optional)

Required Inputs

Input	Description	Format	Example
RAW_TEXT	Input strings.	List/Column	`["User uploaded <br> img...", "Hello World!"]`
LANGUAGE	Source language.	Code	`en_core_web_sm` (spaCy)

Optional Inputs

Input	Description	Default	Condition
LOWERCASE	Convert to lower?	`True`	`False` for NER (Names matter).

5. Outputs (Artifacts)

Artifact	Format	Destination	Quality Criteria
Cleaned Text	List[String]	Dataframe	No artifacts (HTML, control chars).
Cleaning Func	Python Func	Repo	Reproducible pipeline.

6. Operating Modes

⚡ Fast Mode

Timebox: 15 minutes Scope: Regex Only. Details: Using Python string methods and Regex to strip whitespace, lowercase, and remove special characters. Fast but low linguistic awareness.

🎯 Standard Mode (Default)

Timebox: 1 hour Scope: Linguistic Pipeline (spaCy). Details: Loading a spaCy pipeline. Performing Tokenization, Lemmatization (running -> run), and Stopword removal. Keeping semantics intact.

🔬 Deep Mode

Timebox: 4 hours Scope: Domain Specific. Details: Building custom Stopword lists (removing "Inc", "Corp" for finance). Handling custom entities (Twitter handles @user). Fixing spelling errors.

7. Constraints & Guardrails

Technical Constraints

Destructive: Stopword removal is destructive. You cannot reconstruct the sentence "To be or not to be" if you remove "to", "be", "or", "not". Save raw data.
Performance: Complex regex on Gigabytes of text is slow. Use vectorised string operations (Pandas/Polars) or parallel processing (Spark).

Security & Privacy

CAUTION

PII Scrubbing Text data often contains emails/phones. Cleaning is the BEST place to run PII redaction (Regex replace email -> [[EMAIL]]). Do this typically before any other processing.

Compliance

Bias: Be careful when removing "rare words". You might accidentally purge minority names or dialects, biasing the model.

8. Procedure

Phase 1: Structural Cleaning (Noise)

Objective: Remove machine artifacts.

Input RAW_TEXT. Fix Encoding: text.encode('utf-8', 'ignore').decode('utf-8') (Fix sticky mojibake). Remove HTML: Use BeautifulSoup or Regex <.*?> to strip tags. Standardize Whitespace: Replace tabs/newlines with single space. Protect PII: Regex replace Emails/Phones with tokens if required.

Verify: Text is readable. No <div class...> remains.

Phase 2: Tokenization & Normalization

Objective: Split and Standardize.

Load the NLP Tool (e.g., nlp = spacy.load("en_core_web_sm")). Process the stream (nlp.pipe(texts) for speed). Iterate tokens:

Case: .lower() (unless collecting Named Entities).
Lemmatize: Convert am, are, is -> be. dogs -> dog. (Better than Stemming).
Filter: Remove token.is_stop, token.is_punct, token.is_digit (based on use case).

Verify: "The quick brown foxes..." -> ["quick", "brown", "fox"].

Phase 3: Validation

Objective: Check distribution.

Calculate vocabulary size. Check "Top N Words". If the top word is still the or html, your cleaning failed. Check "Empty Rows". Did cleaning reduce a string to ""? Handle these (drop or replace with [[EMPTY]]).

Verify: Vocabulary is cleaner. Noise words removed.

9. Technical Considerations

Stemming vs Lemmatization: Stemming (PorterStemmer) chops words blindly (universal -> univers). Lemmatization uses a dictionary (universal -> universe). Use Lemma for quality, Stem for speed/search.

Custom Tokens: If your text has specialized terms like #C++, standard tokenizers split this into #, C, +, +. Add special rules to the tokenizer to preserve domain terms.

Emoji: Don't just delete emojis. They contain sentiment. Convert 😊 to text :smile: using demoji library if needed, or keep as character.

10. Quality Gates (Definition of Done)

Checklist

[ ] HTML/URLs removed.
[ ] Text normalized (lower/lemma).
[ ] PII redacted (if applicable).
[ ] Empty strings handled.

Validation

Criterion	Method	Threshold
Reduction	Char Count	Reduced by 10-40% (typical)
Purity	Regex Search	No HTML tags found

11. Failure Modes & Recovery

Failure Mode	Symptoms	Recovery Action
Over-Cleaning	"To be or not to be" becomes empty string.	Reduce Stopword list agressiveness. Keep verbs.
Encoding Crash	`UnicodeDecodeError`.	Force decode with `errors='replace'`.
Slow Processing	100 docs/sec.	Use `nlp.pipe(n_process=-1)` to parallelize in spaCy.

12. Copy-Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text

Role: Act as a Senior NLP Engineer.
Task: Execute the Clean Text Data workflow.

## Objective & Scope
- **Goal**: Transform raw, noisy text into structured, numerical-ready tokens.
- **Scope**: Encoding, Regex cleaning, Tokenization, Normalization (Stem/Lemma).

## Inputs
- [ ] RAW_TEXT: Dataset/Column to process.
- [ ] LANGUAGE: Language code (e.g., en_core_web_sm).
- [ ] LOWERCASE: Boolean toggle for case normalization.

## Output Artifacts
- [ ] Cleaned Tokens (List[String])
- [ ] Preprocessing Pipeline (Python Function)

## Execution Steps
1. **Structure**
   - Remove artifacts (HTML, control chars) and fix encoding issues. Redact PII.
2. **NLP**
   - Apply Spacy/NLTK pipeline: Tokenize -> Lowercase -> Lemmatize -> Stopword Check.
3. **Verify**
   - Check vocabulary distribution and print "Before vs After" samples.

## Quality Gates
- [ ] HTML/URLs removed.
- [ ] Text normalized (lower/lemma).
- [ ] PII redacted.
- [ ] Empty strings handled.

## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.

## Constraints
- **Privacy**: Scrub PII (Emails/Phones).
- **Technical**: Preserve domain terms (e.g., C++); Handle Emoji.

## Command
Now execute this workflow step-by-step.

Appendix: Change Log

Version	Date	Author	Changes
1.0.0	2026-01-14	AI Engineering Team	Initial release

Clean Text Data ​

1. Objective ​

2. Context & Scope ​

In Scope ​

Out of Scope ​

3. When to Use / When Not to Use ​

✅ Use This Workflow When ​

❌ Do NOT Use This Workflow When ​

4. Inputs (Required/Optional) ​

Required Inputs ​

Optional Inputs ​

5. Outputs (Artifacts) ​

6. Operating Modes ​

⚡ Fast Mode ​

🎯 Standard Mode (Default) ​

🔬 Deep Mode ​

7. Constraints & Guardrails ​

Technical Constraints ​

Security & Privacy ​

Compliance ​

8. Procedure ​

Phase 1: Structural Cleaning (Noise) ​

Phase 2: Tokenization & Normalization ​

Phase 3: Validation ​

9. Technical Considerations ​

10. Quality Gates (Definition of Done) ​

Checklist ​

Validation ​

11. Failure Modes & Recovery ​

12. Copy-Paste Prompt ​

Appendix: Change Log ​