SpeechRecognize

SpeechRecognize[audio]

recognizes speech in audio and returns it as a string.

SpeechRecognize[audio,level]

returns a list of strings at the specified structural level.

SpeechRecognize[audio,level,prop]

returns prop for text at the given level.

Details and Options

  • Speech recognition aims to convert a spoken audio signal to text. It is also known as speech-to-text and is typically used in voice-enabled human-machine interactions and digital personal assistants.
  • SpeechRecognize[audio] returns all recognized speech in audio as a single string.
  • Structural elements specified in level include:
  • Automaticspeech found in the whole audio signal (default)
    "Segment"a list of transcription segments
    "Sentence"a list of sentences
    "Word"a list of words
  • The property prop can be one of the following:
  • "Audio"trimmed audio containing the recognized text
    "Confidence"strength of the recognized text
    "Interval"interval containing the text
    "SubtitleRules"a list of time intervals and texts
    "Text"recognized text (default)
    {prop1,prop2,}a list of properties
  • The following options can be given:
  • LanguageAutomaticthe language to recognize
    Masking Allinterval of interest
    Method Automaticthe method to use
    PerformanceGoal $PerformanceGoalaspects of performance to try to optimize
    ProgressReporting$ProgressReportingwhether to report the progress of the computation
    TargetDevice"CPU"the device on which to perform recognition
  • Use Languagelang1lang2 to recognize speech assumed to be in language lang1 and return translated text in language lang2.
  • By default, speech in the whole signal is recognized. Use Masking->{int1,int2,} to limit the recognition to intervals inti.
  • Possible settings for Method are:
  • Automaticautomatic method
    "GoogleSpeech"uses Google speech-to-text
    "NeuralNetwork"uses built-in neural networks
    "OpenAI"uses OpenAI speech-to-text
  • By default, if a method returns non-speech tokens (e.g. [applause]), they are returned in the result. Use Method{method,"NonSpeechReplacement"replacements} to specify different replacements. Use "NonSpeechReplacement""" to remove them.
  • SpeechRecognize works for English speech as well as various other languages, such as Chinese, Dutch, French, Japanese and Spanish.
  • SpeechRecognize uses machine learning. Its methods, training sets and biases included therein may change and yield varied results in different versions of the Wolfram Language.
  • SpeechRecognize may download resources that will be stored in your local object store at $LocalBase, and can be listed using LocalObjects[] and removed using ResourceRemove.

Examples

open allclose all

Basic Examples  (2)

Recognize speech in an audio signal:

Recognize speech from a recording:

Scope  (4)

Basic Uses  (2)

Recognize speech in a short audio track:

Recognize speech in an audio track of a video file:

Recognize speech in a non-English language:

Classify the language from the recognized text:

Classify the language from the original audio:

Level Specification  (1)

By default, all recognized text is returned as one string:

Extract a list of recognized sentences:

Extract a list of words:

Extract a list of segments, typically used for splitting text for subtitles:

Properties  (1)

By default, recognized speech is returned as a string or as lists of strings:

Return the speech interval, corresponding chunk of the audio and recognition strength:

Options  (3)

Masking  (1)

Use the Masking option to recognize parts of a signal:

Method  (1)

By default, a local model is used for speech recognition:

Use OpenAI speech recognition:

Use GoogleSpeech speech recognition:

PerformanceGoal  (1)

By default, a medium-speed model with moderate quality is used:

Get the result fast:

Get the higher-quality result:

A balanced speed and quality result:

Applications  (4)

Use AudioIntervals to select which parts of the signal to recognize:

Interpret a spoken city:

Show the recognized city on the map:

Find the answer from a spoken question in a text:

Build an automatic assistant based on Wolfram|Alpha:

Wolfram Research (2019), SpeechRecognize, Wolfram Language function, https://reference.wolfram.com/language/ref/SpeechRecognize.html (updated 2024).

Text

Wolfram Research (2019), SpeechRecognize, Wolfram Language function, https://reference.wolfram.com/language/ref/SpeechRecognize.html (updated 2024).

CMS

Wolfram Language. 2019. "SpeechRecognize." Wolfram Language & System Documentation Center. Wolfram Research. Last Modified 2024. https://reference.wolfram.com/language/ref/SpeechRecognize.html.

APA

Wolfram Language. (2019). SpeechRecognize. Wolfram Language & System Documentation Center. Retrieved from https://reference.wolfram.com/language/ref/SpeechRecognize.html

BibTeX

@misc{reference.wolfram_2024_speechrecognize, author="Wolfram Research", title="{SpeechRecognize}", year="2024", howpublished="\url{https://reference.wolfram.com/language/ref/SpeechRecognize.html}", note=[Accessed: 06-January-2025 ]}

BibLaTeX

@online{reference.wolfram_2024_speechrecognize, organization={Wolfram Research}, title={SpeechRecognize}, year={2024}, url={https://reference.wolfram.com/language/ref/SpeechRecognize.html}, note=[Accessed: 06-January-2025 ]}