Recently I came across a situation where I needed to collect a large volume of data from google drive. This is a quick write-up of how I used Python to download docs from google drive.
Table of Contents
Enabling API Access
Authentication with APIs is often the hardest part of using them, this wasn’t an exception. You can follow the guide here for the full directions on how to accomplish this but I’ve added some additional details below.
Step 1: Setting up a Project
First, create a new project in the google cloud console. Where you see “Google docs downloader” in the following screenshot, click it.
You’ll get a popup like:
Select “New project”, enter a name for your project and click “create”.
Click back into the project selection screen and select your new project.
Step 2: Enabling access to your API
Once you’re in your new project, select “APIs and services” from the menu at the left of the screen to help with download docs from google drive.
Select the “ENABLE APIS AND SERVICES” button
Search for “drive” in the search box to get a response like:
Select Google Drive API, then click “Enable”.
Select the “Create credentials” button
Select the gdrive API from the dropdown list:
Complete the OAuth screen with the name of your App.
Select the scopes you want:
Once you’ve picked your scopes click continue, then “Save and continue”.
Select “Desktop app” as your Application type.
Then click “create” and finally “download” to download your secret.
Rename the downloaded file as “credentials.json”.
Using the API
First, make sure that the “credentials.json” file is in your project. You’ll also need the appropriate libraries are installed using:
pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib
Importing dependencies
from __future__ import print_function import os.path from googleapiclient.discovery import build from google_auth_oauthlib.flow import InstalledAppFlow from google.auth.transport.requests import Request from google.oauth2.credentials import Credentials import io from io import StringIO from googleapiclient.http import MediaIoBaseDownload
The first step is to get all of the libraries imported. In this instance, we’re predominantly looking at importing google API libraries. We also import the IO library to help us with writing our document to a file.
Entering your scope
When requesting an Authorisation token, you need to state what level of access (or scope) you’re looking to access the API with. In our case, we’re only interested in reading from the “drive” API so have selected that appropriate scope.
# If modifying these scopes, delete the file token.json. # Scopes https://developers.google.com/drive/api/v2/about-auth SCOPES = ['https://www.googleapis.com/auth/drive.readonly']
Identifying the document to download
In this example, we’re downloading a single document based on the document GUID. If you navigate to a document in google drive, you’re likely to get a URL like “https://docs.google.com/document/d/1H2VA9MvMdfg6Dwfghb_ecEbxlgCm0DTdK_KPxigx8Ag” the bold section is the document ID that we’re using.
DOCUMENT_ID = '1H2VA9MvMdfg6Dwfghb_ecEbxlgCm0DTdK_KPxigx8Ag'
Getting a Token
This section is a common aspect of accessing google APIs. If you have your “credentials.json” file in the same folder as the application, the code below will verify the credentials and generate a new “Token.json” file which will allow you to access the API. This token will be limited in its access dependent on the “scope” you set above. If you have already run this code and a token file already exists, the token will be used.
creds = None # The file token.json stores the user's access and refresh tokens, and is # created automatically when the authorization flow completes for the first # time. if os.path.exists('token.json'): creds = Credentials.from_authorized_user_file('token.json', SCOPES) # If there are no (valid) credentials available, let the user log in. if not creds or not creds.valid: if creds and creds.expired and creds.refresh_token: creds.refresh(Request()) else: flow = InstalledAppFlow.from_client_secrets_file( 'credentials.json', SCOPES) creds = flow.run_local_server(port=0) # Save the credentials for the next run with open('token.json', 'w') as token: token.write(creds.to_json())
Creating a download service and downloading metadata
To help with downloading docs from google drive we need to create a download service using our authentication token.
downloadService = build('drive', 'v3', credentials=creds) results = downloadService.files().get(fileId=DOCUMENT_ID, fields="id, name,mimeType,createdTime").execute() docMimeType = results['mimeType']
Once a download service is created, this code will look to query it to populate the “results” value. It accomplishes this by running a “get” method over the “files” data. The get method passes arguments to:
a) Identify a specific document (e.g. the fileId)
b) Request any fields you want to be returned by this query.
This initial call is looking to retrieve metadata only as we’ll handle the download later. Note that we’re also taking a note of the document’s mime-type so we can convert the document appropriately later.
Downloading the document
The first thing we’re doing in this code is looking up which export format is appropriate for the document’s mime type. This will ensure that the export doesn’t fail.
mimeTypeMatchup = { "application/vnd.google-apps.document": { "exportType":"application/vnd.openxmlformats-officedocument.wordprocessingml.document","docExt":"docx" } }
Then we’re looking to get the document’s name and file extension so that when we download it we can give it the right name.
exportMimeType =mimeTypeMatchup[docMimeType]['exportType'] docExt =mimeTypeMatchup[docMimeType]['docExt'] docName = results['name'] request = downloadService.files().export_media(fileId=DOCUMENT_ID, mimeType=exportMimeType) # Export formats : https://developers.google.com/drive/api/v3/ref-export-formats fh = io.FileIO(docName+"."+docExt, mode='w')
Next, we set-up a request to export media based on the fileId and the mimeType that’s needed for exporting. Finally, we create a FileIO handler in write mode to create the file.
The final download
Once we’ve set-up our IO handler and built a request to download the file, we then just need to loop through each chunk of data being downloaded until the download has been completed.
downloader = MediaIoBaseDownload(fh, request) done = False while done is False: status, done = downloader.next_chunk() print("Download %d%%." % int(status.progress() * 100))
And there we have it, by following these steps you’ll be able to download docs from google drive using the API.