Processing Textual Data

The Wolfram Language has uniquely flexible capabilities for processing large volumes of textual data. Most often data represented as a string is converted to lists or other constructs, which can then be manipulated using the Wolfram Language's powerful symbolic language constructs.

ReferenceReference

Import import data from files or the web

"Text", "PDF", "TeX", "HTML" pick out plain text, table data, etc.

FindList search files for records containing particular strings

StringSplit split a string into words, sentences, etc.

StringCount count occurrences of words etc.

StringCases find instances of a string pattern

StringExpression match symbolic string patterns

Sort sort into alphabetical order

Counts give counts of how many times strings occur

Classify classify strings based on training data or built-in classifiers

Nearest find the closest-matching string from a list

FindClusters find clusters in string data

EditDistance edit or Levenshtein distance

SequenceAlignment find matching sequences in strings

Hash find a hash code using a variety of schemes

DictionaryLookup look up words in English and other dictionaries

WordData find semantic, grammatical, morphological, etc. properties of words

Interpreter attempt to interpret strings in a wide variety of types

SemanticInterpretation  ▪  SemanticImportString

TextRecognize do OCR on text in an image