SpeechRecognize

✖
`SpeechRecognize`

Updated in 14.1[Experimental]

✖

SpeechRecognize[audio]

recognizes speech in audio and returns it as a string.

✖

SpeechRecognize[audio,level]

returns a list of strings at the specified structural level.

✖

SpeechRecognize[audio,level,prop]

returns prop for text at the given level.

Details and Options

Speech recognition aims to convert a spoken audio signal to text. It is also known as speech-to-text and is typically used in voice-enabled human-machine interactions and digital personal assistants.
SpeechRecognize[audio] returns all recognized speech in audio as a single string.

Structural elements specified in level include:

	Automatic	speech found in the whole audio signal (default)
	"Segment"	a list of transcription segments
	"Sentence"	a list of sentences
	"Word"	a list of words

The property prop can be one of the following:

	"Audio"	trimmed audio containing the recognized text
	"Confidence"	strength of the recognized text
	"Interval"	interval containing the text
	"SubtitleRules"	a list of time intervals and texts
	"Text"	recognized text (default)
	{prop₁,prop₂,…}	a list of properties

The following options can be given:

Language	Automatic	the language to recognize
Masking	All	interval of interest
Method	Automatic	the method to use
PerformanceGoal	$PerformanceGoal	aspects of performance to try to optimize
ProgressReporting	$ProgressReporting	whether to report the progress of the computation
TargetDevice	"CPU"	the device on which to perform recognition

Use Languagelang₁lang₂ to recognize speech assumed to be in language lang₁ and return translated text in language lang₂.
By default, speech in the whole signal is recognized. Use Masking->{int₁,int₂,…} to limit the recognition to intervals int_i.
Possible settings for Method are:

	Automatic	automatic method
	"GoogleSpeech"	uses Google speech-to-text
	"NeuralNetwork"	uses built-in neural networks
	"OpenAI"	uses OpenAI speech-to-text

By default, if a method returns non-speech tokens (e.g. [applause]), they are returned in the result. Use Method{method,"NonSpeechReplacement"replacements} to specify different replacements. Use "NonSpeechReplacement""" to remove them.
SpeechRecognize works for English speech as well as various other languages, such as Chinese, Dutch, French, Japanese and Spanish.
SpeechRecognize uses machine learning. Its methods, training sets and biases included therein may change and yield varied results in different versions of the Wolfram Language.
SpeechRecognize may download resources that will be stored in your local object store at $LocalBase, and can be listed using LocalObjects[] and removed using ResourceRemove.

Examples

open allclose all

Basic Examples (2)Summary of the most common use cases

Recognize speech in an audio signal:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-5y2w0k

Out[1]=1

Recognize speech from a recording:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-242snb

Out[1]=1

Scope (4)Survey of the scope of standard use cases

Basic Uses (2)

Recognize speech in a short audio track:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-9p4jbh

Out[1]=1

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-pcz9n

Out[2]=2

Recognize speech in an audio track of a video file:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-0emb15

Out[1]=1

Recognize speech in a non-English language:

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-nhkjxj

Out[2]=2

Classify the language from the recognized text:

In[3]:=3

✖

https://wolfram.com/xid/0fq236pu2wn7y-wiihtq

Out[3]=3

Classify the language from the original audio:

In[4]:=4

✖

https://wolfram.com/xid/0fq236pu2wn7y-7wpumt

Out[4]=4

Level Specification (1)

By default, all recognized text is returned as one string:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-qymfux

Out[1]=1

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-k1wldd

Out[2]=2

Extract a list of recognized sentences:

In[3]:=3

✖

https://wolfram.com/xid/0fq236pu2wn7y-0e94kh

Out[3]=3

Extract a list of words:

In[4]:=4

✖

https://wolfram.com/xid/0fq236pu2wn7y-oouem8

Out[4]=4

Extract a list of segments, typically used for splitting text for subtitles:

In[5]:=5

✖

https://wolfram.com/xid/0fq236pu2wn7y-xgnwzu

Out[5]=5

Properties (1)

By default, recognized speech is returned as a string or as lists of strings:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-c7r64w

Out[1]=1

Return the speech interval, corresponding chunk of the audio and recognition strength:

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-up4d5b

Out[2]=2

Options (3)Common values & functionality for each option

Masking (1)

Use the Masking option to recognize parts of a signal:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-j6mbm3

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-phiskt

Out[2]=2

Method (1)

By default, a local model is used for speech recognition:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-8ovmgw

Out[1]=1

Use OpenAI speech recognition:

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-mbqjxk

Out[2]=2

Use GoogleSpeech speech recognition:

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-bujstd

Out[2]=2

PerformanceGoal (1)

By default, a medium-speed model with moderate quality is used:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-u2b01g

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-qvqm2h

Out[2]=2

Get the result fast:

In[3]:=3

✖

https://wolfram.com/xid/0fq236pu2wn7y-cusxr6

Out[3]=3

Get the higher-quality result:

In[4]:=4

✖

https://wolfram.com/xid/0fq236pu2wn7y-7vc2ob

Out[4]=4

A balanced speed and quality result:

In[5]:=5

✖

https://wolfram.com/xid/0fq236pu2wn7y-lzjoya

Out[5]=5

Applications (4)Sample problems that can be solved with this function

Use AudioIntervals to select which parts of the signal to recognize:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-xwcoq6

In[4]:=4

✖

https://wolfram.com/xid/0fq236pu2wn7y-tgpm1s

Out[4]=4

In[5]:=5

✖

https://wolfram.com/xid/0fq236pu2wn7y-guw8p2

Out[5]=5

Interpret a spoken city:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-qx04uz

Out[1]=1

Show the recognized city on the map:

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-vrwtfy

Out[2]=2

Find the answer from a spoken question in a text:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-voti84

In[2]:=2

✖

https://wolfram.com/xid/0fq236pu2wn7y-6gj9gs

Out[2]=2

Build an automatic assistant based on Wolfram|Alpha:

In[1]:=1

✖

https://wolfram.com/xid/0fq236pu2wn7y-vwfjed

Out[1]=1

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

SpeechRecognize

✖
`SpeechRecognize`

Details and Options

Examples

Basic Examples (2)Summary of the most common use cases

Scope (4)Survey of the scope of standard use cases

Basic Uses (2)

Level Specification (1)

Properties (1)

Options (3)Common values & functionality for each option

Masking (1)

Method (1)

PerformanceGoal (1)

Applications (4)Sample problems that can be solved with this function

Text

CMS

APA

BibTeX

BibLaTeX

SpeechRecognize ✖ SpeechRecognize

Details and Options

Examples

Basic Examples (2)Summary of the most common use cases

Scope (4)Survey of the scope of standard use cases

Basic Uses (2)

Level Specification (1)

Properties (1)

Options (3)Common values & functionality for each option

Masking (1)

Method (1)

PerformanceGoal (1)

Applications (4)Sample problems that can be solved with this function

See Also

Related Guides

History

Text

CMS

APA

BibTeX

BibLaTeX

SpeechRecognize

✖
`SpeechRecognize`