Mathematica has uniquely flexible capabilities for processing large volumes of textual data. Most often data represented as a string is converted to lists or other constructs which can then be manipulated using
Mathematica's powerful symbolic language constructs.
Import — import data from files or the web
"Text",
"PDF",
"TeX",
"HTML" — pick out plain text, table data, etc.
FindList — search files for records containing particular strings
StringSplit — split a string into words, sentences, etc.
StringCount — count occurrences of words, etc.
StringCases — find instances of a string pattern
StringExpression — match symbolic string patterns
Sort — sort into alphabetical order
Nearest — find the closest-matching string from a list
FindClusters — find clusters in string data
EditDistance — edit or Levenshtein distance
Hash — find a hash code using a variety of schemes
DictionaryLookup — look up words in an English dictionary
WordData — find semantic, grammatical, morphological etc. properties of words