Text Manipulation
The Wolfram Language has uniquely flexible capabilities for processing textual data. It can operate at the level of strings and characters or at the level of words and sentences. It can also operate semantically, through its extensive built-in natural language understanding capabilities as well as its ability to use LLM functionality, including through the Wolfram Prompt Repository.
Text Acquisition
Import — import text from files or the web
"Text", "PDF", "TeX", "HTML" — pick out plaintext, table data, etc.
NotebookImport — import text from a notebook
FindList — search files for records containing particular strings
TextString — convert arbitrary expressions to text
TextRecognize — extract text from images using OCR
Text Normalization »
ToLowerCase ▪ ToUpperCase ▪ RemoveDiacritics ▪ CharacterEncoding ▪ ...
DeleteStopwords — delete standard stopwords ("the", "and", etc.) from a string
StringSplit — split a string at newlines or other delimiters
StringReplace ▪ StringDelete ▪ StringTrim ▪ ...
Structural Text Manipulation
TextCases — extract symbolically specified elements
TextSentences — extract a list of sentences
TextWords — extract a list of words
SequenceAlignment — find matching sequences in text
Searching & Pattern Matching »
StringExpression — general string pattern
StringMatchQ ▪ StringCases ▪ StringCount ▪ ...
LLM-Based Text Manipulation »
LLMResourceFunction — apply operations from the Wolfram Prompt Repository
LLMFunction — apply operations specified by natural language descriptions
LLMExampleFunction — apply operations based on examples
LLMSynthesize ▪ LLMPrompt ▪ LLMTool ▪ ...
Text Analysis »
WordCounts — count occurrences of words and -grams
LetterCounts ▪ CharacterCounts ▪ WordCount
Classify — classify strings based on training data or built-in classifiers
Natural Language Processing
LanguageIdentify — determine the language of a text
DictionaryLookup — look up words in English and other dictionaries
WordData — find semantic, grammatical, morphological, etc. properties of words
TextStructure — parse text into its grammatical structure
TextContents — generate a dataset of identified elements in text
SpellingCorrectionList — list of spelling suggestions for misspelled words
Natural Language Understanding »
Interpreter — attempt to interpret strings of a wide variety of types
SemanticInterpretation ▪ SemanticImportString ▪ AmbiguityFunction ▪ ...
Text Generation »
StringTemplate ▪ StringRiffle ▪ TextString ▪ LLMSynthesize ▪ ...