Vertex AI Workbench Comparison

Vertex AI
Vertex AI Workbench and comparison with similar solutions, like Colab Enterprise
Author

Rafa Sanchez

Published

September 7, 2023

This post describes Vertex AI Workbench. First of all, some terminology:

Cloud-based notebook offering

  • Vertex AI Workbench: Vertex AI Notebook offering, composed by: Vertex Instances and the old Managed and User-managed notebooks.
  • Colab Enterprise: managed and collaborative service that combines the ease-of-use of Google’s Colab notebooks with enterprise-level security and compliance support capabilities of Google Cloud. Colab Enterprise also powers a notebook experience for BigQuery Studio.
  • Cloud Datalab: old and deprecated Google Cloud managed notebook product (non Jupyter), enterprise-grade, built over a virtual machine.
  • Colaboratory: based on Jupyter, ig’s Google offering targeted for research, it’s a research product to help and disseminate ML education. Provides a free-of-charge environment, including free GPU and TPU, and is integrated with Google Drive (you can “mount” Drive on Colaboratory).
  • Colaboratory Pro: Google offering announced in February 2020, it’s a paid-based product (USD 9.99 / month), which provides faster GPUs(T4, P100), high-memory VMs and longer runtimes (before runtimes were resetted every 12 hours)
  • Apache Zeppelin: web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
  • AWS EMR notebook: part of the EMR (Elastic MapReduce) platform, it’s a managed Jupyter notebook integrated with Spark workloads and other services likes Amazon S3 (see the 2018 announcement). It runs Jupyter Notebook version 5.7.0 and Python 3.6.5. EMR Notebooks is pre-configured with the following kernels and library packages installed: PySpark, PySpark3, Python3, Spark and SparkR
  • AWS Sagemaker Studio notebook: different from the EMR notebook, and recently announced (2019 announcement). It’s integrated with ML tools, but still in early stages.
  • Azure ML notebooks: free hosted service to develop and run Jupyter notebooks in the cloud with no installation. Docs here.

Colab vs Colab pro vs Vertex AI Workbench Instances

Colab Colab Pro Workbench Instances
GPUs Mostly K80 K80, P100, T4 K80, P100, T4, V100, A100
CPUs 2 vCPU 2 vCPU Up to 96 vCPU
Session length 12 hours Unlimited Unlimited
Storage 5 GB Up to 1 TB Unlimited
Guaranteed resources NO NO YES
Enterprise support NO NO YES
Remote SSH access NO NO YES
Scheduler NO NO YES
Security IAM IAM IAM, VPC-SC, CMEK, AxT, VPC-peering, DRZ
Price Free USD 8 to 49 pupm Pay-per-use GCP
Dataproc/Dataflow engines NO NO YES
Region All 7 countries-only All

Colab Enterprise vs Vertex AI Workbench Instances

Colab Enterprise Workbench Instances
Environment Zero config, serverless, collaborative JupyterLab (laptop replacement)
Add-ons Assistive JupyterLab extensions
Projects Reading/running notebooks Projects with multiple notebooks, files
Integration BigQuery, Spark (Dataproc serverless Spark) BigQuery, Spark (Dataproc serverless Spark)
Others BigQuery Studio Idle shutdown

Tips

  • If you use a TPU with a Vertex AI Workbench, both the VM (notebook) and the TPU must be in the same region. In Europe, the only region that has both Cloud TPU and notebooks is europe-west1-b.
  • As usual with notebooks, cells must be idempotent, so make sure to initialize the variables at the beginnning of each cell.
  • There is no direct integration with Google Cloud Source Repositories (CSR). If need to clone from CSR, first authenticate with your account and then clone the directory:
gcloud config set account <YOUR_ACCOUNT>
gcloud auth login
gcloud config set project <YOUR_PROJECT> # get token from browser
gcloud source repos clone <YOUR_REPO_IN_CSR> --project=<YOUR_PROJECT>

After cloning, you can see the git submenus are enabled on the UI, since it detects it is a git directory.

Keyboard shortcuts

  • Autocomplete inside the cells work. If you import cifar10 and then write cif and press tab, autocomplete works.
  • Notebook edit mode: a adds a cell; b adds a cell below; x removes the cell
  • CTRL+SHFT+M: split current cell into two; SHFT+M: merge cells (do the opposite)
  • Terminal commands can be run from any cell. Examples:
!ls
!1+1
!nvidia-smi
  • To reset all runtimes: Menu Runtime->reset all runtimes

Installing TensorFlow on an R instance on AI Platform notebooks

Create new R instance on AI Platform Notebooks (R3.6)

Create new R notebook and install TensorFlow:

First, install the Tensorflow R package::

install.packages("tensorflow")
Installing package into ‘/home/jupyter/.R/library’
(as ‘lib’ is unspecified)

Install TensorFlow and create an r-reticulate environment, in case does not exist:

library(tensorflow)
install_tensorflow(method = 'conda', envname = 'r-reticulate')

Optionally, for an older version:

install_tensorflow(version="1.4", method = 'conda', envname = 'r-reticulate')

Finally, check the TensorFlow installation:

use_condaenv("r-reticulate")
tf$constant("Hellow Tensorflow")
tf.Tensor(b'Hellow Tensorflow', shape=(), dtype=string)

Additionally, you may want to install cloudml package for training and prediction in AI platform:

install.packages("cloudml")
library(cloudml)
gcloud_install()

Check package versions

Use this commands:

conda info
conda info --envs
conda activate r-reticulate
conda list

Submitting Dataproc jobs from R

R is one of the best languages for data scientists, it’s simple to use and with lot of capabilities. But the biggest limitation is the amount of data, many times limited to memory in one single node.

This limitation can be solved with tools in the cloud including Dataproc (Spark), Vertex AI for ML and many others, also on-premises like Spark.

SparkR provides an interface from R to Spark. It supports data cleaning oprations and other tools, but is limited to many ML algorithms.

Example 1: using SparkR with Dataproc:

r_file <- "/home/rafaelsf80/r-dataproc-tensorflow/rstudio-server/shakespeare.R"
CLUSTER <- "caip-telefonica-demo"
REGION <- "europe-west1"
command = paste0("gcloud dataproc jobs submit spark-r ", r_file,
                 " --cluster=", CLUSTER,
                 " --region=", REGION
)
print(command)
system(command, intern = TRUE)

Source SparkR code:

# Load SparkR library into your R session
library(SparkR)

# Initialize SparkSession
sparkR.session(sparkPackages = "org.apache.spark:spark-avro_2.12:3.0.1", appName = "SparkR-ML-Natality-regression-example")

file <- "https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt"

sentences<-scan(file,"character",sep="\n");

#remove periods and commas
sentences<-gsub("\\.","",sentences)
sentences<-gsub("\\,","",sentences)

#split sentences into a list of words delimited by spaces
words<-strsplit(sentences," ")

#determine frequencies and show final output
words.freq<-table(unlist(words));
cbind.data.frame(names(words.freq),as.integer(words.freq))

sparkR.session.stop()

Example 2: Spark job in Dataproc from R (note source code is SparkPi):

command = paste0("gcloud dataproc jobs submit spark ", 
                  " --cluster=", CLUSTER,
                  " --region=", REGION,
                  " --class org.apache.spark.examples.SparkPi", 
                  " --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000");
print(command)
 
system(command, intern = TRUE)

Running Data Studio server on Dataproc

Refer to installation instructions here

Connections is through an SSH SOCKS tunnel

R in Dataproc

Dataproc supports R in two ways: * Sparklyr package, developed by RStudio team * SparkR package, developed by Spark team, built in and integrated with gcloud CLI tool

Comparison between SparkR and sparklyr here