"GoogleSpeech" (Service Connection)

Use Google Text-to-Speech and Speech-to-Text APIs with the Wolfram Language.

Connecting & Authenticating

ServiceConnect["GoogleSpeech"] creates a connection to the Google Speech-to-Text and Text-to-Speech APIs. If a previously saved connection can be found, it will be used; otherwise, a new authentication request will be launched.
Use of this connection requires internet access and a Google API account.


ServiceExecute["GoogleSpeech","request",params] sends a request to either of the Google Speech-to-Text or Text-to-Speech APIs, using parameters params. The following give possible requests.

Synthesize Audio from Text


"ListVoices" returns a list of available voice styles

  • LanguageAllrestrict the query to voices able to synthesize a given language
  • Request:

    "Synthesize" returns speech synthesized from text

  • "Input"(required)text to synthesize
    "Voice"Automaticname of the synthesis voice
    LanguageAutomaticlanguage of the synthesis voice
    "Pitch"Automaticsemitone deviation from the native voice pitch
    "Rate"Automaticfactor by which to change the native voice speed
    AudioEncodingAutomaticoutput audio encoding
    GeneratedAssetLocation$GeneratedAssetLocationstorage location of the synthesized audio
    GeneratedAssetFormatAutomaticoutput format of the synthesized audio
    "EffectsProfileID"Automaticpost-processing effect name applied to speech
  • Recognize Text from Audio


    "Recognize" returns text transcribed from audio

  • "Input"(required)audio to transcribe
    Language"English"language(s) of the contained speech
    "ChannelRecognition"Falsewhether to transcribe each channel separately
    MaxItems1maximum number of hypotheses to return
    "ProfanityFilter"Falsewhether to attempt to replace profanities
    "SpeechContexts"{}phrase hints to assist transcription
    "WordTimeOffsets"Truereturn word time offsets with the result
    "WordConfidence"Falsereturn word confidence values with the result
    "Punctuation"Trueinclude punctuation in the transcription
    "SpokenPunctuation"Falsereplace spoken punctuation with ASCII character
    "SpokenEmojis"Falsereplace spoken emojis with Unicode character
    "SpeakerDiarization"Falsetag distinct speakers in the result
    "Model"Automaticspecify a model to use for the request
    MetaInformationNonemetadata describing the input audio
  • Parameter Details

    Possible values for "Voice" can be retrieved using the "ListVoices" request.
    Possible values for "Rate" are real numbers representing a factor (1 is the natural rate).
    Possible values for "Pitch" are real numbers or quantities representing semitones (0 is the natural pitch).
    "SpeakerDiarization" accepts the speaker count to detect as {max} or {min,max}.
    Possible settings for "SpeechContexts" include:
  • strwgive weight w to the string str
    {str1w1,str2w2,}give weight wi to the string stri
  • Examples of possible settings for "EffectsProfileID" include:
  • "large-automotive-class-device"optimized for car speakers
    "small-bluetooth-speaker-class-device"optimized for small home speakers
  • Examples of possible settings for "Model" include:
  • "latest_long"optimized for long-form content
    "latest_short"optimized for short-form content
    "command_and_search"optimized for short queries
  • Examples

    open allclose all

    Basic Examples  (1)

    Connect to Google speech service:

    Perform text-to-speech:

    Perform speech-to-text:

    Scope  (2)

    Speech Synthesis  (1)

    Synthesize audio from text:

    Synthesize text in a different language. Setting "Language" to Automatic will infer the language from the input text, or a particular language can be specified. The service will attempt to select a voice style with the requested language:

    Use an explicit language:

    List available voice styles:

    Synthesize speech using a particular voice:

    Make the speech faster and lower in pitch:

    Speech Recognition  (1)

    Transcribe text from audio containing speech:

    By default, everything from the API response is returned, including information about recognized words:

    Return multiple guesses of the transcription:

    Separate different speakers from a recording:

    Specify the minimum and maximum number of speakers:

    Display labeled words in a Dataset. The API currently returns speaker labels in the second result: