How to Use a Dataset from Google Drive on Your Local System
Datasets are often stored in Google Drive because it’s convenient for sharing and collaboration. However, when training machine learning models or processing large amounts of data, you’ll usually want to work with the dataset directly on your local machine.
In this guide, we’ll walk through multiple ways to access and use a dataset stored in Google Drive locally.
Why Move a Dataset to Your Local System?
Working with datasets locally offers several benefits:
- Faster file access and loading times
- Reduced dependency on internet connectivity
- Better compatibility with training pipelines
- Easier debugging and experimentation
- Improved performance for large image datasets
This is especially useful when working with image datasets containing annotation files such as .txt labels for YOLO object detection models.
Method 1: Download the Dataset Manually
The simplest approach is to download the dataset directly from Google Drive.
Step 1: Open Google Drive
Navigate to your dataset folder in Google Drive.
Step 2: Download the Folder
- Right-click the dataset folder.
- Select Download.
- Google Drive will compress the folder into a ZIP archive.
- Save the ZIP file to your local machine.
Step 3: Extract the Dataset
After downloading, extract the archive:
unzip dataset.zip
A typical dataset structure may look like:
dataset/
├── images/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── ...
├── labels/
│ ├── image1.txt
│ ├── image2.txt
│ └── ...
Method 2: Sync Google Drive with Your Computer
If your dataset changes frequently, manually downloading it each time can become inconvenient.
Google Drive for Desktop allows you to sync files directly to your machine.
Benefits
- Automatic synchronization
- No repeated downloads
- Files appear like local folders
- Easy integration with scripts and training pipelines
Once synced, you can access your dataset using a normal file path.
Example
dataset_path = "G:/My Drive/datasets/object-detection"
Loading the Dataset in Python
After downloading or syncing the dataset, you can access it directly using Python.
List Images
from pathlib import Path
images = list(Path("dataset/images").glob("*.jpg"))
print(f"Found {len(images)} images")
List Annotation Files
from pathlib import Path
labels = list(Path("dataset/labels").glob("*.txt"))
print(f"Found {len(labels)} label files")
Method 3: Access a Dataset from Google Drive Without Downloading It
In some cases, you may not want to download an entire dataset to your local machine. For example:
- The dataset is very large.
- Storage space is limited.
- The dataset is frequently updated.
- You only need a subset of files at a time.
Using the Google Drive API, you can list files and access them on demand without maintaining a local copy.
Step 1: Install Required Libraries
pip install google-api-python-client google-auth-httplib2 google-auth-oauthlib
Step 2: Authenticate with Google Drive
from google.oauth2 import service_account
from googleapiclient.discovery import build
SCOPES = ['https://www.googleapis.com/auth/drive.readonly']
SERVICE_ACCOUNT_FILE = 'credentials.json'
creds = service_account.Credentials.from_service_account_file(
SERVICE_ACCOUNT_FILE,
scopes=SCOPES
)
service = build('drive', 'v3', credentials=creds)
Step 3: List Files in a Dataset Folder
folder_id = "YOUR_FOLDER_ID"
results = service.files().list(
q=f"'{folder_id}' in parents",
fields="files(id, name)"
).execute()
files = results.get("files", [])
for file in files:
print(file["name"], file["id"])
Example output:
image1.jpg 1AbCdEfGhIj
image1.txt 2XyZaBcDeFg
image2.jpg 3MnOpQrStUv
image2.txt 4QrStUvWxYz
Step 4: Access Files When Needed
Instead of downloading all files, keep track of file IDs and request them only when your application needs them.
for file in files:
print(f"Processing {file['name']}")
Step 5: Use Google Drive as Your Dataset Source
You can maintain a mapping between image files and label files:
dataset = {}
for file in files:
dataset[file["name"]] = file["id"]
print(dataset)
Example:
{
"image1.jpg": "1AbCdEfGhIj",
"image1.txt": "2XyZaBcDeFg",
"image2.jpg": "3MnOpQrStUv"
}
When to Use This Approach
This method is useful when:
- Working with large datasets.
- Accessing shared datasets maintained by a team.
- Building cloud-native machine learning workflows.
- Avoiding duplicate local storage.
Advantages
- No local dataset copy required.
- Always accesses the latest version of the dataset.
- Saves disk space.
- Works well for large collections of images and annotations.
Limitations
- Requires an internet connection.
- Access speed depends on network performance.
- Not ideal for high-speed model training where thousands of files must be read repeatedly.
For most training workloads, downloading or syncing the dataset locally is faster. However, for dataset management, exploration, and occasional access, using Google Drive directly can be a convenient alternative.
Incorrect Paths
Verify that your training script points to the correct dataset location.
Large Dataset Downloads
Google Drive may take time to compress large folders before downloading.
For datasets larger than several gigabytes, syncing with Google Drive for Desktop is often more efficient.
Conclusion
Using datasets stored in Google Drive on a local machine is straightforward. For one-time use, downloading the dataset manually is the easiest option. For ongoing projects, syncing with Google Drive for Desktop or automating downloads through the Google Drive API provides a more scalable solution.
Whether you’re training a YOLO model, building a computer vision application, or conducting data analysis, keeping your dataset accessible locally can significantly improve development speed and workflow efficiency.
Discussion