Wolfram Language & System 10.4 (2016)|Legacy Documentation

This is documentation for an earlier version of the Wolfram Language.View current documentation (Version 11.2)

Text Normalization

The Wolfram Language provides powerful knowledge-based tools for normalizing text in preparation for text analysis, visualization, etc.

ReferenceReference

Character-Level Normalization

ToLowerCase, ToUpperCase convert all characters to lower, upper case

IgnoreCase option to ignore case of letters

RemoveDiacritics remove diacritics such as accents, umlauts, etc.

Transliterate transliterate to ASCII or other writing scripts

PrintableASCIIQ test if a string contains only printable ASCII characters

CharacterEncoding specify the character encoding to assume

Structural String Normalization

StringSplit split a string at newlines or other delimiters

StringDelete delete substrings or patterns

StringReplace replace substrings or patterns

StringDrop  ▪  StringTake  ▪  StringCases

StringTrim trim whitespace or other patterns from strings

StringPadLeft, StringPadRight pad to fixed width

StringExtract extract specified parts of strings

Text-Level Normalization

TextSentences extract a list of sentences

TextWords extract a list of words

DeleteStopwords delete standard stopwords ("the", "and", etc.)

Content Extraction

TextCases extract symbolically specified elements

Containing  ▪  Alternatives  ▪  Entity

Morphological & Linguistic Normalization

LanguageIdentify identify what language a text is in

WordTranslation give translations for a word

WordStem reduce a word to its stem

DictionaryLookup look up a word in dictionaries

Interpreter convert to many forms from natural language

SpellingCorrectionList list of spelling suggestions for misspelled words

DictionaryWordQ test if a word is a correctly spelled dictionary word

Word List Normalization

AlphabeticSort sort strings into alphabetic order

WordCounts  ▪  LetterCounts  ▪  CharacterCounts

WordFrequency frequency of words or -grams in text

WordFrequencyData data on overall word frequencies in typical text

Normalization of External Data

Import import data from files or the web

"Text", "PDF", "TeX", "HTML" pick out plaintext, table data, etc.

ImportString convert a string with a particular external format