WordFrequencyData

WordFrequencyData[word]

gives the frequency of word in typical published English text.

WordFrequencyData[{word1,word2,}]

gives an association of frequencies of the wordi.

WordFrequencyData[word,"TimeSeries"]

gives a time series for the frequency of word in typical published English text.

WordFrequencyData[word,"TimeSeries",datespec]

gives a time series for dates specified by datespec.

WordFrequencyData[word,"prop"]

gives property prop of the word frequency.

Details and Options

  • WordFrequencyData[word1|word2|] gives the total frequencies of all the wordi.
  • WordFrequencyData[word,"Total",datespec] gives the total frequency of word for the dates specified by datespec.
  • By default, WordFrequencyData uses the Google Books English n-gram public dataset.
  • Possible options include:
  • IgnoreCaseFalsewhether to ignore case in word
    Language"English"what language of source corpus to use
  • In WordFrequencyData[word,"prop"], possible properties include:
  • "Total"give total frequencies over a date range
    "TimeSeries"give a time series of frequencies
    "CaseVariants"give results for all variants of upper and lower case
    "PartsOfSpeechVariants"give results for all variants of parts of speech
    {prop1,prop2,}give results for combinations of properties
  • Possible date specifications include:
  • Alluse all available dates for the specified source corpus
    DateObject[]use DateObject
    yearuse specific year
    {yearmin,yearmax}use year range between yearmin and yearmax
    {{d1,d2,}}use explicit dates {d1,d2,}

Examples

open allclose all

Basic Examples  (4)

Get the frequency of the word "dog" in typical English:

Get the typical frequencies of several words:

Compute the ratio between the words "war" and "peace" in published text:

Plot the historical time series for the frequency of the word "economy":

Scope  (4)

Get the overall frequency of "atlas":

Find the frequency of multiple words at once:

WordFrequencyData accepts as input TextElement with a specific "GrammaticalUnit":

Plot the historical time series for the frequency of the word "computer" since 1900:

Generalizations & Extensions  (1)

When Alternatives is used as an input, the result is the total frequency for any of the alternatives:

Alternatives may be used in combination with other properties, such as "TimeSeries":

Options  (6)

IgnoreCase  (1)

Returns the frequency of a word, ignoring any lower- or uppercase variants. The default value is False:

This value is usually greater than the default:

Language  (5)

Find the frequency of a common Spanish word in a Spanish-language text corpus:

Spanish words might appear in the other languages, but with a much lower frequency:

A common word in French returns a high frequency value:

Popularity of the word "peace" in Spanish:

The word "Sputnik" in Russian:

Get a time series of the word "Haus" in German between 1900 and now and plot the result:

Properties & Relations  (14)

"CaseVariants"  (3)

A word can have many lower- and uppercase variants:

Getting the frequency of the word with IgnoreCase->True should be equivalent to getting the Total for the previous list:

Get the most popular case variation of "DOS":

When asking for multiple words, the association will contain all variants of each word:

"PartOfSpeechVariants"  (4)

Calculate the frequency of a word in an specific year for all part of speech variants:

Show different usages of the word "nuke" in 1944:

Some words may return many part of speech variants:

Combining this argument with "CaseVariants":

Combining with "CaseVariants" and "TimeSeries":

"TimeSeries"  (2)

Get the frequency of the word "war" throughout the twentieth century:

This can be plotted directly using DateListPlot:

Compare the usage of "peace" and "war" over time:

And compare their usage in another language too:

Plot the ratio of the words "war" and "peace" for both languages:

"Total"  (5)

"Total" is the default property:

For a simple date range:

The usage of DateObject objects in the date specification is allowed:

The "Total" can be computed over a specific list of years:

Infinity can be used to specify an unbound range:

Possible Issues  (1)

ToLowerCase might fail with non-Latin alphabets, so to use the IgnoreCase option or the "CaseVariants" argument, the input should be in lower case:

Neat Examples  (11)

Popularity of the word "dog" and its translations in different languages:

The words "gold" versus "oil" over time:

Frequency of terms for telephone and television over time:

Joining synonyms:

Common diseases:

Sorting day names by popularity:

Some words have lost their old orthography:

The word "democracy" gets more frequent usage in the twentieth century:

"Apple" with initial uppercase A became popular after 1980:

The relative frequency of part of speech variants may change over time. "Tackle" as a verb and as a noun is a good example:

Regularization of irregular verbs may explain the changes in the part of speech and orthography of some words, such as "burnt" versus "burned":

Evolution of "ustedes" versus "vosotros" in Spanish:

Introduced in 2016
 (10.4)