Text Normalization
The Wolfram Language provides powerful knowledge-based tools for normalizing text in preparation for text analysis, visualization, etc.
Character-Level Normalization
ToLowerCase, ToUpperCase — convert all characters to lower, uppercase
IgnoreCase — option to ignore case of letters
RemoveDiacritics — remove diacritics such as accents, umlauts, etc.
CharacterNormalize — reduce or decompose characters to normal forms (e.g. ¼ 1⁄4, ï ī )
Transliterate — transliterate to ASCII or other writing scripts
PrintableASCIIQ — test if a string contains only printable ASCII characters
CharacterEncoding — specify the character encoding to assume
Structural String Normalization
StringSplit — split a string at newlines or other delimiters
StringDelete — delete substrings or patterns
StringReplace — replace substrings or patterns
StringDrop ▪ StringTake ▪ StringCases
StringTrim — trim whitespace or other patterns from strings
StringPadLeft, StringPadRight — pad to fixed width
StringExtract — extract specified parts of strings
Text-Level Normalization
TextSentences — extract a list of sentences
TextWords — extract a list of words
DeleteStopwords — delete standard stopwords ("the", "and", etc.)
Content Extraction
TextCases — extract symbolically specified elements
Containing ▪ Alternatives ▪ Entity
Morphological & Linguistic Normalization
WordStem — reduce a word to its stem
DictionaryLookup — look up a word in dictionaries
Interpreter — convert to many forms from natural language
SpellingCorrectionList — list of spelling suggestions for misspelled words
DictionaryWordQ — test if a word is a correctly spelled dictionary word
Language Translation
LanguageIdentify — identify what language a text is in
WordTranslation — give translations for a word
TextTranslation — translate text using an integrated external service
Word List Normalization
AlphabeticSort — sort strings into alphabetic order
WordCounts ▪ LetterCounts ▪ CharacterCounts
WordFrequency — frequency of words or -grams in text
WordFrequencyData — data on overall word frequencies in typical text
LLM-Based Normalization »
LLMResourceFunction — apply operations from the Wolfram Prompt Repository
LLMExampleFunction ▪ LLMFunction ▪ LLMTool ▪ ...
Normalization of External Data
Import — import data from files or the web
"Text", "PDF", "TeX", "HTML" — pick out plain text, table data, etc.
ImportString — convert a string with a particular external format