Vertex AI Workbench Comparison

Vertex AI

Vertex AI Workbench and comparison with similar solutions, like Colab Enterprise

Author

Rafa Sanchez

Published

September 7, 2023

This post describes Vertex AI Workbench. First of all, some terminology:

ipython: interactive computational environment, in which you can combine code, tich text, mathematics, … ipython continues to exist todat as python shell and kernel. Notebook interface of ipython is now part of Jupyter.
Jupyter (composed name of JUlia script + ipython + R) next-gen UI of ipython notebooks. Jupyter notebooks are formerly known as ipython notebooks (.ipynb)

Cloud-based notebook offering

Vertex AI Workbench: Vertex AI Notebook offering, composed by: Vertex Instances and the old Managed and User-managed notebooks.
Colab Enterprise: managed and collaborative service that combines the ease-of-use of Google’s Colab notebooks with enterprise-level security and compliance support capabilities of Google Cloud. Colab Enterprise also powers a notebook experience for BigQuery Studio.
Cloud Datalab: old and deprecated Google Cloud managed notebook product (non Jupyter), enterprise-grade, built over a virtual machine.
Colaboratory: based on Jupyter, ig’s Google offering targeted for research, it’s a research product to help and disseminate ML education. Provides a free-of-charge environment, including free GPU and TPU, and is integrated with Google Drive (you can “mount” Drive on Colaboratory).
Colaboratory Pro: Google offering announced in February 2020, it’s a paid-based product (USD 9.99 / month), which provides faster GPUs(T4, P100), high-memory VMs and longer runtimes (before runtimes were resetted every 12 hours)
Apache Zeppelin: web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
AWS EMR notebook: part of the EMR (Elastic MapReduce) platform, it’s a managed Jupyter notebook integrated with Spark workloads and other services likes Amazon S3 (see the 2018 announcement). It runs Jupyter Notebook version 5.7.0 and Python 3.6.5. EMR Notebooks is pre-configured with the following kernels and library packages installed: PySpark, PySpark3, Python3, Spark and SparkR
AWS Sagemaker Studio notebook: different from the EMR notebook, and recently announced (2019 announcement). It’s integrated with ML tools, but still in early stages.
Azure ML notebooks: free hosted service to develop and run Jupyter notebooks in the cloud with no installation. Docs here.

Colab vs Colab pro vs Vertex AI Workbench Instances

	Colab	Colab Pro	Workbench Instances
GPUs	Mostly K80	K80, P100, T4	K80, P100, T4, V100, A100
CPUs	2 vCPU	2 vCPU	Up to 96 vCPU
Session length	12 hours	Unlimited	Unlimited
Storage	5 GB	Up to 1 TB	Unlimited
Guaranteed resources	NO	NO	YES
Enterprise support	NO	NO	YES
Remote SSH access	NO	NO	YES
Scheduler	NO	NO	YES
Security	IAM	IAM	IAM, VPC-SC, CMEK, AxT, VPC-peering, DRZ
Price	Free	USD 8 to 49 pupm	Pay-per-use GCP
Dataproc/Dataflow engines	NO	NO	YES
Region	All	7 countries-only	All

Colab Enterprise vs Vertex AI Workbench Instances

	Colab Enterprise	Workbench Instances
Environment	Zero config, serverless, collaborative	JupyterLab (laptop replacement)
Add-ons	Assistive	JupyterLab extensions
Projects	Reading/running notebooks	Projects with multiple notebooks, files
Integration	BigQuery, Spark (Dataproc serverless Spark)	BigQuery, Spark (Dataproc serverless Spark)
Others	BigQuery Studio	Idle shutdown

Tips

If you use a TPU with a Vertex AI Workbench, both the VM (notebook) and the TPU must be in the same region. In Europe, the only region that has both Cloud TPU and notebooks is europe-west1-b.
As usual with notebooks, cells must be idempotent, so make sure to initialize the variables at the beginnning of each cell.
There is no direct integration with Google Cloud Source Repositories (CSR). If need to clone from CSR, first authenticate with your account and then clone the directory:

gcloud config set account <YOUR_ACCOUNT>
gcloud auth login
gcloud config set project <YOUR_PROJECT> # get token from browser
gcloud source repos clone <YOUR_REPO_IN_CSR> --project=<YOUR_PROJECT>

After cloning, you can see the git submenus are enabled on the UI, since it detects it is a git directory.

Keyboard shortcuts

Autocomplete inside the cells work. If you import cifar10 and then write cif and press tab, autocomplete works.
Notebook edit mode: a adds a cell; b adds a cell below; x removes the cell
CTRL+SHFT+M: split current cell into two; SHFT+M: merge cells (do the opposite)
Terminal commands can be run from any cell. Examples:

!ls
!1+1
!nvidia-smi

To reset all runtimes: Menu Runtime->reset all runtimes

Installing TensorFlow on an R instance on AI Platform notebooks

Create new R instance on AI Platform Notebooks (R3.6)

Create new R notebook and install TensorFlow:

First, install the Tensorflow R package::

install.packages("tensorflow")
Installing package into ‘/home/jupyter/.R/library’
(as ‘lib’ is unspecified)

Install TensorFlow and create an r-reticulate environment, in case does not exist:

library(tensorflow)
install_tensorflow(method = 'conda', envname = 'r-reticulate')

Optionally, for an older version:

install_tensorflow(version="1.4", method = 'conda', envname = 'r-reticulate')

Finally, check the TensorFlow installation:

use_condaenv("r-reticulate")
tf$constant("Hellow Tensorflow")
tf.Tensor(b'Hellow Tensorflow', shape=(), dtype=string)

Additionally, you may want to install cloudml package for training and prediction in AI platform:

install.packages("cloudml")
library(cloudml)
gcloud_install()

Check package versions

Use this commands:

conda info
conda info --envs
conda activate r-reticulate
conda list

Submitting Dataproc jobs from R

R is one of the best languages for data scientists, it’s simple to use and with lot of capabilities. But the biggest limitation is the amount of data, many times limited to memory in one single node.

This limitation can be solved with tools in the cloud including Dataproc (Spark), Vertex AI for ML and many others, also on-premises like Spark.

SparkR provides an interface from R to Spark. It supports data cleaning oprations and other tools, but is limited to many ML algorithms.

Example 1: using SparkR with Dataproc:

r_file <- "/home/rafaelsf80/r-dataproc-tensorflow/rstudio-server/shakespeare.R"
CLUSTER <- "caip-telefonica-demo"
REGION <- "europe-west1"
command = paste0("gcloud dataproc jobs submit spark-r ", r_file,
                 " --cluster=", CLUSTER,
                 " --region=", REGION
)
print(command)
system(command, intern = TRUE)

Source SparkR code:

# Load SparkR library into your R session
library(SparkR)

# Initialize SparkSession
sparkR.session(sparkPackages = "org.apache.spark:spark-avro_2.12:3.0.1", appName = "SparkR-ML-Natality-regression-example")

file <- "https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt"

sentences<-scan(file,"character",sep="\n");

#remove periods and commas
sentences<-gsub("\\.","",sentences)
sentences<-gsub("\\,","",sentences)

#split sentences into a list of words delimited by spaces
words<-strsplit(sentences," ")

#determine frequencies and show final output
words.freq<-table(unlist(words));
cbind.data.frame(names(words.freq),as.integer(words.freq))

sparkR.session.stop()

Example 2: Spark job in Dataproc from R (note source code is SparkPi):

command = paste0("gcloud dataproc jobs submit spark ", 
                  " --cluster=", CLUSTER,
                  " --region=", REGION,
                  " --class org.apache.spark.examples.SparkPi", 
                  " --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000");
print(command)
 
system(command, intern = TRUE)

Running Data Studio server on Dataproc

Refer to installation instructions here

Connections is through an SSH SOCKS tunnel

R in Dataproc

Dataproc supports R in two ways: * Sparklyr package, developed by RStudio team * SparkR package, developed by Spark team, built in and integrated with gcloud CLI tool

Comparison between SparkR and sparklyr here