Giao diện
Clean Text Data
1. Objective
The objective of this workflow is to transform raw, noisy text (human language) into a structured, numerical-ready format. "Garbage In, Garbage Out" is doubly true for NLP. If you feed HTML tags, emojis, and stop words into a TF-IDF or Embedding model without intent, your results will be noise. This workflow ensures standardization.
2. Context & Scope
In Scope
This workflow covers Encoding normalization (Unicode), Regex cleaning (Removing URLs/HTML), Tokenization, Stopword removal, and Lemmatization/Stemming.
Assumption: You are using standard NLP libraries (NLTK, spaCy, HuggingFace).
Out of Scope
- Embedding Generation: Converting text to vectors is "Create Text Embeddings". This workflow stops at the "Text/Token" stage.
- OCR: Extracting text from images is a Computer Vision task.
3. When to Use / When Not to Use
✅ Use This Workflow When
- Building a Topic Model or Sentiment Analyzer.
- Fine-tuning a BERT model (requires specific tokenizer).
- Cleaning scraped web data.
❌ Do NOT Use This Workflow When
- Using an LLM (GPT-4) that expects raw natural language (Prompt Engineering). Over-cleaning (removing Stopwords) hurts LLM performance as it breaks grammar.
- Your data is code (Requires AST parsing, not English tokenization).
4. Inputs (Required/Optional)
Required Inputs
| Input | Description | Format | Example |
|---|---|---|---|
| RAW_TEXT | Input strings. | List/Column | ["User uploaded <br> img...", "Hello World!"] |
| LANGUAGE | Source language. | Code | en_core_web_sm (spaCy) |
Optional Inputs
| Input | Description | Default | Condition |
|---|---|---|---|
| LOWERCASE | Convert to lower? | True | False for NER (Names matter). |
5. Outputs (Artifacts)
| Artifact | Format | Destination | Quality Criteria |
|---|---|---|---|
| Cleaned Text | List[String] | Dataframe | No artifacts (HTML, control chars). |
| Cleaning Func | Python Func | Repo | Reproducible pipeline. |
6. Operating Modes
⚡ Fast Mode
Timebox: 15 minutes Scope: Regex Only. Details: Using Python string methods and Regex to strip whitespace, lowercase, and remove special characters. Fast but low linguistic awareness.
🎯 Standard Mode (Default)
Timebox: 1 hour Scope: Linguistic Pipeline (spaCy). Details: Loading a spaCy pipeline. Performing Tokenization, Lemmatization (running -> run), and Stopword removal. Keeping semantics intact.
🔬 Deep Mode
Timebox: 4 hours Scope: Domain Specific. Details: Building custom Stopword lists (removing "Inc", "Corp" for finance). Handling custom entities (Twitter handles @user). Fixing spelling errors.
7. Constraints & Guardrails
Technical Constraints
- Destructive: Stopword removal is destructive. You cannot reconstruct the sentence "To be or not to be" if you remove "to", "be", "or", "not". Save raw data.
- Performance: Complex regex on Gigabytes of text is slow. Use vectorised string operations (Pandas/Polars) or parallel processing (Spark).
Security & Privacy
CAUTION
PII Scrubbing Text data often contains emails/phones. Cleaning is the BEST place to run PII redaction (Regex replace email -> [[EMAIL]]). Do this typically before any other processing.
Compliance
- Bias: Be careful when removing "rare words". You might accidentally purge minority names or dialects, biasing the model.
8. Procedure
Phase 1: Structural Cleaning (Noise)
Objective: Remove machine artifacts.
Input RAW_TEXT. Fix Encoding: text.encode('utf-8', 'ignore').decode('utf-8') (Fix sticky mojibake). Remove HTML: Use BeautifulSoup or Regex <.*?> to strip tags. Standardize Whitespace: Replace tabs/newlines with single space. Protect PII: Regex replace Emails/Phones with tokens if required.
Verify: Text is readable. No
<div class...>remains.
Phase 2: Tokenization & Normalization
Objective: Split and Standardize.
Load the NLP Tool (e.g., nlp = spacy.load("en_core_web_sm")). Process the stream (nlp.pipe(texts) for speed). Iterate tokens:
- Case:
.lower()(unless collecting Named Entities). - Lemmatize: Convert
am, are, is->be.dogs->dog. (Better than Stemming). - Filter: Remove
token.is_stop,token.is_punct,token.is_digit(based on use case).
Verify: "The quick brown foxes..." ->
["quick", "brown", "fox"].
Phase 3: Validation
Objective: Check distribution.
Calculate vocabulary size. Check "Top N Words". If the top word is still the or html, your cleaning failed. Check "Empty Rows". Did cleaning reduce a string to ""? Handle these (drop or replace with [[EMPTY]]).
Verify: Vocabulary is cleaner. Noise words removed.
9. Technical Considerations
Stemming vs Lemmatization: Stemming (PorterStemmer) chops words blindly (universal -> univers). Lemmatization uses a dictionary (universal -> universe). Use Lemma for quality, Stem for speed/search.
Custom Tokens: If your text has specialized terms like #C++, standard tokenizers split this into #, C, +, +. Add special rules to the tokenizer to preserve domain terms.
Emoji: Don't just delete emojis. They contain sentiment. Convert 😊 to text :smile: using demoji library if needed, or keep as character.
10. Quality Gates (Definition of Done)
Checklist
- [ ] HTML/URLs removed.
- [ ] Text normalized (lower/lemma).
- [ ] PII redacted (if applicable).
- [ ] Empty strings handled.
Validation
| Criterion | Method | Threshold |
|---|---|---|
| Reduction | Char Count | Reduced by 10-40% (typical) |
| Purity | Regex Search | No HTML tags found |
11. Failure Modes & Recovery
| Failure Mode | Symptoms | Recovery Action |
|---|---|---|
| Over-Cleaning | "To be or not to be" becomes empty string. | Reduce Stopword list agressiveness. Keep verbs. |
| Encoding Crash | UnicodeDecodeError. | Force decode with errors='replace'. |
| Slow Processing | 100 docs/sec. | Use nlp.pipe(n_process=-1) to parallelize in spaCy. |
12. Copy-Paste Prompt
TIP
One-Click Agent Invocation Copy the prompt below, replace placeholders, and paste into your agent.
text
Role: Act as a Senior NLP Engineer.
Task: Execute the Clean Text Data workflow.
## Objective & Scope
- **Goal**: Transform raw, noisy text into structured, numerical-ready tokens.
- **Scope**: Encoding, Regex cleaning, Tokenization, Normalization (Stem/Lemma).
## Inputs
- [ ] RAW_TEXT: Dataset/Column to process.
- [ ] LANGUAGE: Language code (e.g., en_core_web_sm).
- [ ] LOWERCASE: Boolean toggle for case normalization.
## Output Artifacts
- [ ] Cleaned Tokens (List[String])
- [ ] Preprocessing Pipeline (Python Function)
## Execution Steps
1. **Structure**
- Remove artifacts (HTML, control chars) and fix encoding issues. Redact PII.
2. **NLP**
- Apply Spacy/NLTK pipeline: Tokenize -> Lowercase -> Lemmatize -> Stopword Check.
3. **Verify**
- Check vocabulary distribution and print "Before vs After" samples.
## Quality Gates
- [ ] HTML/URLs removed.
- [ ] Text normalized (lower/lemma).
- [ ] PII redacted.
- [ ] Empty strings handled.
## Failure Handling
- If blocked, output a "Clarification Brief" detailing missing info or blockers.
## Constraints
- **Privacy**: Scrub PII (Emails/Phones).
- **Technical**: Preserve domain terms (e.g., C++); Handle Emoji.
## Command
Now execute this workflow step-by-step.Appendix: Change Log
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0.0 | 2026-01-14 | AI Engineering Team | Initial release |