Text Normalization

The Wolfram Language provides powerful knowledge-based tools for normalizing text in preparation for text analysis, visualization, etc.

Character-Level Normalization

ToLowerCase, ToUpperCase convert all characters to lower, uppercase

IgnoreCase option to ignore case of letters

RemoveDiacritics remove diacritics such as accents, umlauts, etc.

CharacterNormalize reduce or decompose characters to normal forms (e.g. ¼ 1⁄4, ï )

Transliterate transliterate to ASCII or other writing scripts

PrintableASCIIQ test if a string contains only printable ASCII characters

CharacterEncoding specify the character encoding to assume

Structural String Normalization

StringSplit split a string at newlines or other delimiters

StringDelete delete substrings or patterns

StringReplace replace substrings or patterns

StringDrop  ▪  StringTake  ▪  StringCases

StringTrim trim whitespace or other patterns from strings

StringPadLeft, StringPadRight pad to fixed width

StringExtract extract specified parts of strings

Text-Level Normalization

TextSentences extract a list of sentences

TextWords extract a list of words

DeleteStopwords delete standard stopwords ("the", "and", etc.)

Content Extraction

TextCases extract symbolically specified elements

Containing  ▪  Alternatives  ▪  Entity

Morphological & Linguistic Normalization

WordStem reduce a word to its stem

DictionaryLookup look up a word in dictionaries

Interpreter convert to many forms from natural language

SpellingCorrectionList list of spelling suggestions for misspelled words

DictionaryWordQ test if a word is a correctly spelled dictionary word

Language Translation

LanguageIdentify identify what language a text is in

WordTranslation give translations for a word

TextTranslation translate text using an integrated external service

Word List Normalization

AlphabeticSort sort strings into alphabetic order

WordCounts  ▪  LetterCounts  ▪  CharacterCounts

WordFrequency frequency of words or -grams in text

WordFrequencyData data on overall word frequencies in typical text

LLM-Based Normalization »

LLMResourceFunction apply operations from the Wolfram Prompt Repository

LLMExampleFunction  ▪  LLMFunction  ▪  LLMTool  ▪  ...

Normalization of External Data

Import import data from files or the web

"Text", "PDF", "TeX", "HTML" pick out plain text, table data, etc.

ImportString convert a string with a particular external format