Skip to content

Clean Text Data


1. Objective

The objective of this workflow is to transform raw, noisy text (human language) into a structured, numerical-ready format. "Garbage In, Garbage Out" is doubly true for NLP. If you feed HTML tags, emojis, and stop words into a TF-IDF or Embedding model without intent, your results will be noise. This workflow ensures standardization.


2. Context & Scope

In Scope

This workflow covers Encoding normalization (Unicode), Regex cleaning (Removing URLs/HTML), Tokenization, Stopword removal, and Lemmatization/Stemming.

Assumption: You are using standard NLP libraries (NLTK, spaCy, HuggingFace).

Out of Scope

  • Embedding Generation: Converting text to vectors is "Create Text Embeddings". This workflow stops at the "Text/Token" stage.
  • OCR: Extracting text from images is a Computer Vision task.

3. When to Use / When Not to Use

Use This Workflow When

  • Building a Topic Model or Sentiment Analyzer.
  • Fine-tuning a BERT model (requires specific tokenizer).
  • Cleaning scraped web data.

Do NOT Use This Workflow When

  • Using an LLM (GPT-4) that expects raw natural language (Prompt Engineering). Over-cleaning (removing Stopwords) hurts LLM performance as it breaks grammar.
  • Your data is code (Requires AST parsing, not English tokenization).

4. Inputs (Required/Optional)

Required Inputs

InputDescriptionFormatExample
RAW_TEXTInput strings.List/Column["User uploaded <br> img...", "Hello World!"]
LANGUAGESource language.Codeen_core_web_sm (spaCy)

Optional Inputs

InputDescriptionDefaultCondition
LOWERCASEConvert to lower?TrueFalse for NER (Names matter).

5. Outputs (Artifacts)

ArtifactFormatDestinationQuality Criteria
Cleaned TextList[String]DataframeNo artifacts (HTML, control chars).
Cleaning FuncPython FuncRepoReproducible pipeline.

6. Operating Modes

Fast Mode

Timebox: 15 minutes Scope: Regex Only. Details: Using Python string methods and Regex to strip whitespace, lowercase, and remove special characters. Fast but low linguistic awareness.

🎯 Standard Mode (Default)

Timebox: 1 hour Scope: Linguistic Pipeline (spaCy). Details: Loading a spaCy pipeline. Performing Tokenization, Lemmatization (running -> run), and Stopword removal. Keeping semantics intact.

🔬 Deep Mode

Timebox: 4 hours Scope: Domain Specific. Details: Building custom Stopword lists (removing "Inc", "Corp" for finance). Handling custom entities (Twitter handles @user). Fixing spelling errors.


7. Constraints & Guardrails

Technical Constraints

  • Destructive: Stopword removal is destructive. You cannot reconstruct the sentence "To be or not to be" if you remove "to", "be", "or", "not". Save raw data.
  • Performance: Complex regex on Gigabytes of text is slow. Use vectorised string operations (Pandas/Polars) or parallel processing (Spark).

Security & Privacy

CAUTION

PII Scrubbing Text data often contains emails/phones. Cleaning is the BEST place to run PII redaction (Regex replace email -> [[EMAIL]]). Do this typically before any other processing.

Compliance

  • Bias: Be careful when removing "rare words". You might accidentally purge minority names or dialects, biasing the model.

8. Procedure

Phase 1: Structural Cleaning (Noise)

Objective: Remove machine artifacts.

Input RAW_TEXT. Fix Encoding: text.encode('utf-8', 'ignore').decode('utf-8') (Fix sticky mojibake). Remove HTML: Use BeautifulSoup or Regex <.*?> to strip tags. Standardize Whitespace: Replace tabs/newlines with single space. Protect PII: Regex replace Emails/Phones with tokens if required.

Verify: Text is readable. No <div class...> remains.

Phase 2: Tokenization & Normalization

Objective: Split and Standardize.

Load the NLP Tool (e.g., nlp = spacy.load("en_core_web_sm")). Process the stream (nlp.pipe(texts) for speed). Iterate tokens:

  • Case: .lower() (unless collecting Named Entities).
  • Lemmatize: Convert am, are, is -> be. dogs -> dog. (Better than Stemming).
  • Filter: Remove token.is_stop, token.is_punct, token.is_digit (based on use case).

Verify: "The quick brown foxes..." -> ["quick", "brown", "fox"].

Phase 3: Validation

Objective: Check distribution.

Calculate vocabulary size. Check "Top N Words". If the top word is still the or html, your cleaning failed. Check "Empty Rows". Did cleaning reduce a string to ""? Handle these (drop or replace with [[EMPTY]]).

Verify: Vocabulary is cleaner. Noise words removed.


9. Technical Considerations

Stemming vs Lemmatization: Stemming (PorterStemmer) chops words blindly (universal -> univers). Lemmatization uses a dictionary (universal -> universe). Use Lemma for quality, Stem for speed/search.

Custom Tokens: If your text has specialized terms like #C++, standard tokenizers split this into #, C, +, +. Add special rules to the tokenizer to preserve domain terms.

Emoji: Don't just delete emojis. They contain sentiment. Convert 😊 to text :smile: using demoji library if needed, or keep as character.


10. Quality Gates (Definition of Done)

Checklist

  • [ ] HTML/URLs removed.
  • [ ] Text normalized (lower/lemma).
  • [ ] PII redacted (if applicable).
  • [ ] Empty strings handled.

Validation

CriterionMethodThreshold
ReductionChar CountReduced by 10-40% (typical)
PurityRegex SearchNo HTML tags found

11. Failure Modes & Recovery

Failure ModeSymptomsRecovery Action
Over-Cleaning"To be or not to be" becomes empty string.Reduce Stopword list agressiveness. Keep verbs.
Encoding CrashUnicodeDecodeError.Force decode with errors='replace'.
Slow Processing100 docs/sec.Use nlp.pipe(n_process=-1) to parallelize in spaCy.

12. Copy-Paste Prompt

TIP

One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.

text
Role: Act as a Senior NLP Engineer.
Task: Execute the Clean Text Data workflow.

## Objective & Scope
- **Goal**: Transform raw, noisy text into structured, numerical-ready tokens.
- **Scope**: Encoding, Regex cleaning, Tokenization, Normalization (Stem/Lemma).

## Inputs
- [ ] RAW_TEXT: Dataset/Column to process.
- [ ] LANGUAGE: Language code (e.g., en_core_web_sm).
- [ ] LOWERCASE: Boolean toggle for case normalization.

## Output Artifacts
- [ ] Cleaned Tokens (List[String])
- [ ] Preprocessing Pipeline (Python Function)

## Execution Steps
1. **Structure**
   - Remove artifacts (HTML, control chars) and fix encoding issues. Redact PII.
2. **NLP**
   - Apply Spacy/NLTK pipeline: Tokenize -> Lowercase -> Lemmatize -> Stopword Check.
3. **Verify**
   - Check vocabulary distribution and print "Before vs After" samples.

## Quality Gates
- [ ] HTML/URLs removed.
- [ ] Text normalized (lower/lemma).
- [ ] PII redacted.
- [ ] Empty strings handled.

## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.

## Constraints
- **Privacy**: Scrub PII (Emails/Phones).
- **Technical**: Preserve domain terms (e.g., C++); Handle Emoji.

## Command
Now execute this workflow step-by-step.

Appendix: Change Log

VersionDateAuthorChanges
1.0.02026-01-14AI Engineering TeamInitial release

Cập nhật lần cuối: