Blog

Top 5 CLI Tools Data Scientists Use to Bulk Download Files From Google Drive and S3 Buckets When Preparing ML Datasets

Preparing machine learning datasets often involves collecting files from different sources, with cloud storage platforms like Google Drive and Amazon S3 being among the most commonly used. While web interfaces offer ease of use, many data scientists prefer the precision and automation of command-line tools to bulk download datasets. These CLI tools streamline large-scale file transfers, automate dataset synchronization, and allow seamless scaling — crucial elements in any data-centric pipeline.

TL;DR: When working with large datasets located on Google Drive or Amazon S3, CLI tools like rclone, awscli, gdown, and others are go-to solutions for data scientists. They offer speed, automation, and more control compared to GUI approaches. This article outlines the top 5 CLI tools to efficiently bulk download files from these cloud platforms. Whether you’re setting up a one-time download or building a repeatable pipeline, these tools can save hours of manual effort.

1. Rclone

Overview: Rclone is a powerful, open-source command-line tool designed for managing files across cloud storage providers. It’s best known for its flexibility and compatibility with over 40 cloud storage services, including both Google Drive and Amazon S3.

Why Data Scientists Use It:

  • Works across both Google Drive and S3 from one interface
  • Supports syncing, mounting, and copying files
  • Highly scriptable for use in data pipelines

Example Command:

rclone copy gdrive:dataset-folder /local/destination --progress

With Rclone, data scientists can even mount a cloud storage directory as if it’s a part of the local file system — a very helpful feature when experimenting with large datasets across different machines or cloud-based notebooks.

2. AWS CLI

Overview: The AWS Command Line Interface is Amazon’s own tool to interact with AWS services. It supports APIs for every AWS product, but it’s particularly useful for managing data in S3 — Amazon’s object storage service.

Why Data Scientists Use It:

  • Native support for large-scale S3 data operations
  • Built-in configuration for access/secret keys and session tokens
  • Supports multipart downloads and error retry mechanism

Example Command:

aws s3 sync s3://my-dataset-bucket /local/path/ --no-sign-request

Data scientists often appreciate the ability to sync entire directories with the sync command, ensuring that only new or changed files are downloaded, which saves considerable bandwidth and time.

3. gdown

Overview: Gdown is a lightweight Python-based CLI tool built specifically for downloading files from Google Drive.

Why Data Scientists Use It:

  • Simple installation via pip install gdown
  • Can bypass Google Drive virus checks and confirmation dialogs
  • Ideal for downloading shared public links or files via ID

Example Command:

gdown https://drive.google.com/uc?id=FILE_ID

For small to medium-sized datasets hosted on personal Google Drive or shared by collaborators, gdown is a minimalist yet reliable choice. It requires no setup, and integrates easily into Python workflows and shell scripts.

4. S5cmd

Overview: S5cmd is a fast and efficient S3 and local filesystem command line tool. It’s designed to outperform the AWS CLI in speed, especially for batch operations.

Why Data Scientists Use It:

  • Extremely fast, designed for parallel operation
  • Great for high-performance environments
  • Supports wildcards and bulk actions

Example Command:

s5cmd cp 's3://bucket-name/*.csv' ./local_folder/

When working with tens of thousands of files or very large datasets from S3 buckets, s5cmd offers a dramatic improvement in speed over the AWS CLI. It’s especially popular in situations where I/O throughput is a bottleneck.

5. GCSFUSE or Google Cloud SDK (gsutil)

Overview: Although primarily used with Google Cloud Storage (GCS), the Google Cloud SDK’s gsutil tool enables data scientists to handle downloads and uploads with precision. gcsfuse can mount GCS buckets to the local file system for easy access.

Why Data Scientists Use It:

  • Highly suitable for GCP-based machine learning workflows
  • Supports mirroring with gsutil rsync
  • Good alternative when data is stored in Google Cloud instead of Drive

Example Command (gsutil):

gsutil -m cp -r gs://my-gcs-bucket/datasets /local/path

For large Google-hosted datasets in public buckets, gsutil can leverage parallel downloads with the -m flag, making the process incredibly efficient and robust against network interruptions or access timeouts.

Choosing the Right Tool

The ideal CLI tool depends on your specific storage provider and project requirements. Here’s a quick comparison to help make the decision easier:

Tool Supports Drive Supports S3 Speed Ease of Use
Rclone ✔️ ✔️ Medium Medium
AWS CLI ✔️ Medium Medium
Gdown ✔️ Low High
S5cmd ✔️ High Medium
Gsutil ✅ (via GCS) High Low

FAQs

  • Q: Which tool should I use for downloading multiple Google Drive files?
    A: Use rclone for folder-based downloads or gdown for individual file downloads by ID.
  • Q: I want the fastest way to sync my S3 bucket to local storage. What tool is best?
    A: S5cmd is generally faster than awscli, especially for bulk operations.
  • Q: Can I automate these tools in a scheduled job or within Python?
    A: Yes, most tools can be invoked via shell scripts or subprocesses in Python.
  • Q: Are there cross-platform CLI tools for both Google Drive and S3?
    A: Rclone is the most versatile option supporting both platforms.
  • Q: How can I handle authentication when using these tools on cloud VMs?</
Gift Wrapper Plus for WooCommerce © 2025 WebFactory Ltd All Rights Reserved.
The WordPress® trademark is the intellectual property of the WordPress Foundation. Uses of the WordPress® name in this website are for identification purposes only and do not imply an endorsement by WordPress Foundation. WebFactory Ltd is not endorsed or owned by, or affiliated with, the WordPress Foundation.