Processing Textual Data

Mathematica has uniquely flexible capabilities for processing large volumes of textual data. Most often data represented as a string is converted to lists or other constructs, which can then be manipulated using Mathematica's powerful symbolic language constructs.


Import import data from files or the web

"Text", "PDF", "TeX", "HTML" pick out plain text, table data, etc.

FindList search files for records containing particular strings

StringSplit split a string into words, sentences, etc.

StringCount count occurrences of words etc.

StringCases find instances of a string pattern

StringExpression match symbolic string patterns

Sort sort into alphabetical order

Tally tally numbers of identical strings

Nearest find the closest-matching string from a list

FindClusters find clusters in string data

EditDistance edit or Levenshtein distance

SequenceAlignment find matching sequences in strings

Hash find a hash code using a variety of schemes

DictionaryLookup look up words in English and other dictionaries

WordData find semantic, grammatical, morphological, etc. properties of words