"GoogleSpeech" (Service Connection)

Use Google Text-to-Speech and Speech-to-Text APIs with the Wolfram Language.

Connecting & Authenticating

ServiceConnect["GoogleSpeech"] creates a connection to the Google Speech-to-Text and Text-to-Speech APIs. If a previously saved connection can be found, it will be used; otherwise, a new authentication request will be launched.

Use of this connection requires internet access and a Google API account.

Requests

ServiceExecute["GoogleSpeech","request",params] sends a request to either of the Google Speech-to-Text or Text-to-Speech APIs, using parameters params. The following give possible requests.

Synthesize Audio from Text

Request:

"ListVoices" — returns a list of available voice styles

Parameters:

Language

All

restrict the query to voices able to synthesize a given language

Request:

"Synthesize" — returns speech synthesized from text

Parameters:

"Input"	(required)	text to synthesize
"Voice"	Automatic	name of the synthesis voice
Language	Automatic	language of the synthesis voice
"Pitch"	Automatic	semitone deviation from the native voice pitch
"Rate"	Automatic	factor by which to change the native voice speed
AudioEncoding	Automatic	output audio encoding
GeneratedAssetLocation	$GeneratedAssetLocation	storage location of the synthesized audio
GeneratedAssetFormat	Automatic	output format of the synthesized audio
"EffectsProfileID"	Automatic	post-processing effect name applied to speech

Recognize Text from Audio

Request:

"Recognize" — returns text transcribed from audio

Parameters:

"Input"	(required)	audio to transcribe
Language	"English"	language(s) of the contained speech
"ChannelRecognition"	False	whether to transcribe each channel separately
MaxItems	1	maximum number of hypotheses to return
"ProfanityFilter"	False	whether to attempt to replace profanities
"SpeechContexts"	{}	phrase hints to assist transcription
"WordTimeOffsets"	True	return word time offsets with the result
"WordConfidence"	False	return word confidence values with the result
"Punctuation"	True	include punctuation in the transcription
"SpokenPunctuation"	False	replace spoken punctuation with ASCII character
"SpokenEmojis"	False	replace spoken emojis with Unicode character
"SpeakerDiarization"	False	tag distinct speakers in the result
"Model"	Automatic	specify a model to use for the request
MetaInformation	None	metadata describing the input audio

Parameter Details

Possible values for "Voice" can be retrieved using the "ListVoices" request.

Possible values for "Rate" are real numbers representing a factor (1 is the natural rate).

Possible values for "Pitch" are real numbers or quantities representing semitones (0 is the natural pitch).

"SpeakerDiarization" accepts the speaker count to detect as {max} or {min,max}.

Possible settings for "SpeechContexts" include:

	strw	give weight w to the string str
	{str₁w₁,str₂w₂,…}	give weight w_i to the string str_i

Examples of possible settings for "EffectsProfileID" include:

	"large-automotive-class-device"	optimized for car speakers
	"small-bluetooth-speaker-class-device"	optimized for small home speakers

Examples of possible settings for "Model" include:

	"latest_long"	optimized for long-form content
	"latest_short"	optimized for short-form content
	"command_and_search"	optimized for short queries

Examples

open allclose all

Basic Examples (1)

Connect to Google speech service:

Perform text-to-speech:

Perform speech-to-text:

Scope (2)

Speech Synthesis (1)

Synthesize audio from text:

Synthesize text in a different language. Setting "Language" to Automatic will infer the language from the input text, or a particular language can be specified. The service will attempt to select a voice style with the requested language:

Use an explicit language:

List available voice styles:

Synthesize speech using a particular voice:

Make the speech faster and lower in pitch:

Speech Recognition (1)

Transcribe text from audio containing speech:

By default, everything from the API response is returned, including information about recognized words:

Return multiple guesses of the transcription:

Separate different speakers from a recording:

Specify the minimum and maximum number of speakers:

Display labeled words in a Dataset. The API currently returns speaker labels in the second result:

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

"GoogleSpeech" (Service Connection)

Connecting & Authenticating

Requests

Synthesize Audio from Text

Recognize Text from Audio

Parameter Details

Examples

Basic Examples (1)

Scope (2)

Speech Synthesis (1)

Speech Recognition (1)

"GoogleSpeech" (Service Connection)

Connecting & Authenticating

Requests

Synthesize Audio from Text

Recognize Text from Audio

Parameter Details

Examples

Basic Examples (1)

Scope (2)

Speech Synthesis (1)

Speech Recognition (1)

See Also

History