Text Analysis
The Wolfram Language includes increasingly sophisticated tools for analyzing and visualizing text, both structurally and semantically.
Sources of Text
Import ▪ ExampleData ▪ WikipediaData
WordCount — total number of words in a text
WordCounts — count of words or -grams
WordFrequency — frequency of words or -grams
LetterCounts ▪ CharacterCounts
Sort — sort into alphabetical order
Classify — classify strings based on training data or built-in classifiers
Nearest — find the closest-matching string from a list
FindClusters — find clusters in string data
ClusteringTree ▪ ClusteringComponents ▪ ClusterClassify
Dendrogram — hierarchical plot of similarities
EditDistance — edit or Levenshtein distance
LanguageIdentify — identify what language a text is in
DictionaryLookup ▪ WordData ▪ WordStem ▪ PartOfSpeech ▪ Transliterate
WordFrequencyData — data on word frequencies in typical current and historical text
SemanticImport — import text with semantic understanding
LLM-Based Analysis »
LLMFunction — apply LLM-based operations specified by natural language to text
LLMResourceFunction — apply LLM-based operations from the Wolfram Prompt Repository
LLMExampleFunction ▪ LLMPrompt ▪ LLMSynthesize ▪ LLMTool
Text Visualization
Style — style text with color, font, or size
WordCloud — generate a word cloud from word frequencies or weights
Snippet — extract a snippet of text
StringPartition — partition a string into equal-size blocks
InsertLinebreaks — break a string onto multiple lines
Text Parsing
TextStructure — parse text into its grammatical structure
Text Comparison »
SequenceAlignment ▪ Diff ▪ Diff3 ▪ LongestCommonSubsequence ▪ DistanceMatrix ▪ ...
Content Analysis
TextContents — generate a dataset of identified elements in text
Content Extraction
TextCases — extract symbolically specified elements
Containing ▪ Alternatives ▪ Entity
TextPosition — positions of symbolically specified elements
FindTextualAnswer — attempt to find answers to questions from text
Text Normalization »
TextWords ▪ TextSentences ▪ DeleteStopwords ▪ RemoveDiacritics ▪ ...