docs/en/tutorials/juicefs_on_colab.md
Colaboratory, or "Colab" for short, is a product by Google Research. Colab enables users to write and execute arbitrary Python code through the browser. It is particularly well suited for machine learning, data analysis, and educational purposes.
Colab supports Google Drive for uploading files to or downloading files from Colab instances. However, in some cases, Google Drive might not be that convenient to use with Colab. This is where JuiceFS can a valuable tool, enabling easy file synchronization between Colab instances or between a Colab instance and a local or on-premises machine.
A demo Colab notebook using JuiceFS is available here.
This document outlines the necessary steps for using JuiceFS in the Colab environment. We use Google Cloud SQL as the JuiceFS metadata engine and Google Cloud Storage (GCS) as the JuiceFS object storage.
For other types of metadata engines or object storages, see How to Set Up a Metadata Engine and How to Set Up Object Storage.
Many of the steps mentioned here will be quite similar to the Getting Started document, which you can also use for reference.
juicefs file system from any machine or instance with access to Google Cloud resources.juicefs file system in a Colab NotebookThis demo uses Google Cloud Platform's Cloud SQL and Google Cloud Storage (GCS) to create a high-performance file storage system of JuiceFS. You need a Google Cloud Platform account to follow this demo document.
If you have another cloud vendor's resources (such as AWS RDBS and S3), you can still use this guide as a reference and with other reference documents provided by JuiceFS to achieve a similar solution.
To make JuiceFS reach the best performance, you might also want the Colab instance is in the same zone or close to the region where Cloud SQL and GCS are deployed. The tutorial works for a randomly hosted Colab instance, but you might notice slower performance due to the latency between the Colab instance and the Cloud SQL/GCS regions. To start Colab instances in a specific region, see instructions for starting a GCE VM on Colab via GCP Marketplace.
Before diving into the detailed steps, ensure you have the following resources ready:
juicefs-learning.juicefs-learning:europe-west1:juicefs-sql-example-1 instance as the metadata service.gs://juicefs-bucket-example-1 as the bucket to store file chunks.This step needs to be done only once, and you can choose to execute it on any machine or instance where you have good connectivity and access to your Google Cloud resources.
Use gcloud auth application-default login to prepare a local credential, or use GOOGLE_APPLICATION_CREDENTIALS to set up a JSON key file.
Use cloud_sql_proxy to open a port (in
this case, 5432) locally to expose your cloud Postgres service to your local machine:
gcloud auth application-default login
# Or set up the json key file via GOOGLE_APPLICATION_CREDENTIALS=/path/to/key
cloud_sql_proxy -instances=juicefs-learning:europe-west1:juicefs-sql-example-1=tcp:0.0.0.0:5432
Use the following command to create a new file system named myvolume using the juicefs format command. Later, you can mount this file system on any other machines or instances where you have access to your cloud resources.
You can download juicefs here.
juicefs format \
--storage gs \
--bucket gs://juicefs-bucket-example-1 \
"postgres://postgres:mushroom1@localhost:5432/juicefs?sslmode=disable" \
myvolume
Note that this step is only required once on any machine you prefer to work on.
Once you have completed Step 1, it means you already have a JuiceFS file system (named myvolume in this case) defined and ready to be used.
Now, let's open a Colab page and execute the following commands to mount our file system into a folder named mnt.
Firstly, download the juicefs binary and do the same as Step 1 to get GCP credentials and open the Cloud SQL proxy.
Note that the following commands are run in the Colab environment, so there is a ! mark at the beginning for running shell commands.
Download juicefs to the Colab runtime instance:
! curl -sSL https://d.juicefs.com/install | sh -
Set up Google Cloud credentials:
! gcloud auth application-default login
Open cloud_sql_proxy:
! wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy
! chmod +x cloud_sql_proxy
! GOOGLE_APPLICATION_CREDENTIALS=/content/.config/application_default_credentials.json nohup ./cloud_sql_proxy -instances=juicefs-learning:europe-west1:juicefs-sql-example-1=tcp:0.0.0.0:5432 >> cloud_sql_proxy.log &
Mount the myvolumn JuiceFS file system onto the mnt folder:
! GOOGLE_APPLICATION_CREDENTIALS=/content/.config/application_default_credentials.json nohup juicefs mount "postgres://postgres:mushroom1@localhost:5432/juicefs?sslmode=disable" mnt > juicefs.log &
Now you should be able to use the mnt folder as if it were a local file system folder to write and read folders and files in it.
With data stored in the JuiceFS file system in Step 2, you can repeat all the operations mentioned in Step 2 at any time on any other machines to access the previously stored data or to store more data into it.
Congratulations! Now you have learned how to use JuiceFS, specifically with Google Colab to conveniently share and store data files in a distributed fashion.
Feel free to explore a demo Colab notebook using JuiceFS here.
Happy coding :)