Using Google Speech API from Python

Deprecated: use gtts instead

Google Chrome provides Speech Recognition abilities for filling in web forms. It’s a great possibility. I’m assuming you probably want it for your app? Well, we can take advantage of the Chromium Project – the open source project behind the Chrome browser to discover the inner workings.

Google’s Speech Engine works through an https server. There is no official API, but you can connect to that server using the unofficial api for Speech API v1 or Speech API v2(which has a tentatively correct documentary) published on github.

Should I use the Google Speech API?

Probably not.  There is a plethora of other services. I think one should assume that if Google didn’t provide a proper documentation for the Speech API they don’t intend for you to use it. Moreover, Google limits you to 50 requests a day, and they don’t sell the service. No matter how much money you have – 50 is the limit. (You might want to take a  look at Dragon Nuance for developers).

Well, when should I use Google’s Speech API?

To test it. in general:

“Many of the Google APIs used by Chromium code are specific to Google Chrome and not intended for use in derived products”

Step 1: First, you’ll need any API key.

The chromium projects explain in detail how to obtain a general API key. I’ll narrow it down to what you need to use the Speech API: (Quick guide by ORION)

  1. Go to this link : https://cloud.google.com/console and create your own project.
  2. Join this group here : https://groups.google.com/a/chromium.org/forum/?fromgroups#!forum/chromium-dev.
  3. In your project go to APIs & auth > APIs , and activate Speech API (only 50 requests for each key).
  4. Go to Credentials and make your client.
  5. Generate a Browser key.

Step 2: Planning it out.

You now have a key. And you can send it along your requests to the server for authentication.

So how does it work? Well, it’s pretty simple. All you need to do is send your file (using an http request) to Google’s server, supplying your key and the recognition language.


GOOGLE_SPEECH_URL_V2 = "https://www.google.com/speech-api/v2/recognize?output=json&lang=%s&key=%s" % ("en-US", "YOUR_KEY");

and in response – Google will give you the transcript. Like magic!

Step 3: The Trick

Trick 1: Uh-huh. But then, nothing in life is simple. Google’s Speech Recognition engine only works with mono-channeled FLAC files, and works best with a bit rate of 16000. So will just convert, no prob! Download FLAC from the official website to get the command line tool for converting to flac.

Trick 2: At the same time, we might want to consider what happens if we have a long sound file. You obviously can’t send it over the web all at once. Google won’t accept it. We’ll have to split it up. I chose to first split to wav and only then convert to flac.

Step 4: Let’s do this!

Right, we got everything set up and we got ourselves a plan!

  1. Split the large wav file to smaller wav files
  2. Convert each wav file to flac
  3. Send each one over to for mentioned GOOGLE_SPEECH_URL_V2, and concatenate the short transcripts to one long transcript.

It’s so much easier to understand with code:


import urllib2
import os
import sys
import json
import httplib
import wave
import subprocess
import codecs
from time import sleep

FLAC_CONV = "\"C:\\Program Files\FLAC\\flac\" -f" # Path to the flac command line tool
GOOGLE_SPEECH_URL_V2 = "https://www.google.com/speech-api/v2/recognize?output=json&lang=he&key=AIzaSyCCBnLCZAtUcVHi94fRklcc3VOMeYzgFDs";

def google_stt_long_file(filename):
 short_files = split_wav_by_silence(filename) # Split the long file
 complete_transcript = ""
 for filename in short_files:
   filename = convert_to_flac(filename)
   sleep(.5) # Make sure the command tool finished. We're not in a hurry!
   transcript = stt_google(filename)
   complete_transcript += " "+transcript #add the splitted transcript
   sleep(.5) # Make sure we're not overloading the server, it tends to get angry.
 return complete_transcript.strip()

def stt_google(filename):
 f = open(filename, 'rb')
 flac_cont = f.read()
 f.close()

 # Headers. A common Chromium (Linux) User-Agent
 hrs = {"User-Agent": "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.63 Safari/535.7",
 'Content-type': 'audio/x-flac; rate=16000'} 

 req = urllib2.Request(GOOGLE_SPEECH_URL_V2, data=flac_cont, headers=hrs)
 print "Sending request to Google TTS"
 p = urllib2.urlopen(req)
 response = p.read()
 response = response.split('\n', 1)[1]
 print response
 # Try to get something out of the complicated json response:
 res = json.loads(response)['result'][0]['alternative'][0]['transcript']

 return res

def convert_to_flac(filename):
 print "Converting ", filename
 if '.flac' not in filename: # Check it's not already a flac file
   print "Converting to flac"
   command = FLAC_CONV + ' ' + filename # prepare the command that uses the command line tool
   subprocess.call(command, shell=True) # Run the command
   filename = filename.split('.')[0] + '.flac' # Get the new file's name
   return filename
 else:
   return filename

def split_wav_by_silence(filename, min_length_secs = 5):
 w = wave.open(filename, 'r')
 count = 0
 indices = []
 start_frame = 0
 end_frame = 0
 for i in range(w.getnframes()):
   ### read 1 frame and the position will updated ###
   frame = w.readframes(1)

   quiet = True
   for j in range(len(frame)):
     # check if amplitude is greater than 1
     if ord(frame[j]) > 1:
       quiet = False
       break

   if quiet:
     count += 1
   else:
     count = 0

 last = (i == w.getnframes()-1)
 if count > 1 or last: # Detected a silent part
   end_frame = w.tell()
   start_second = start_frame/w.getframerate()
   end_second = end_frame/w.getframerate()
   if end_second - start_second > min_length_secs:
     indices.append({'start':start_frame, 'end':end_frame})
     start_frame = end_frame
   elif last: # If it's the last frame, we need to add that last part.
     indices[-1]['end'] = end_frame

 files = []
 count = 0
 for location in indices:
   start = location['start']
   end = location['end']
   print str(start) + ' to ' + str(end)
   w.setpos(start) # Set position on the original wav file
   chunkData = w.readframes(end-start) # And read to where we need

   chunkAudio = wave.open('file_'+str(count)+".wav",'w')
   chunkAudio.setnchannels(w.getnchannels())
   chunkAudio.setsampwidth(w.getsampwidth())
   chunkAudio.setframerate(w.getframerate())
   chunkAudio.writeframes(chunkData)
   chunkAudio.close()
   files.append('file_'+str(count)+".wav")
   count+=1

 return files

3 thoughts on “Using Google Speech API from Python

  1. Hi,

    I have tried this code but it’s giving following error. I am new to python. Please sort this error if possible. Thanks.

    in stt_google
    response = response.split(‘\n’, 1)[1]
    TypeError: Type str doesn’t support the buffer API

    • Hello I think that you’re searching for this script? Right ? I’m new to Python too…

      from gtts import gTTS
      import os
      blabla = (“Hello “)
      tts = gTTS(text=blabla, lang=’en’)
      tts.save(“test.mp3”)
      os.system(“test.mp3”)

Leave a comment