Vertex AI Workbench Comparison
This post describes Vertex AI Workbench. First of all, some terminology:
ipython: interactive computational environment, in which you can combine code, tich text, mathematics, … ipython continues to exist todat as python shell and kernel. Notebook interface of ipython is now part of Jupyter.Jupyter(composed name of JUlia script + ipython + R) next-gen UI of ipython notebooks. Jupyter notebooks are formerly known as ipython notebooks (.ipynb)
Cloud-based notebook offering
Vertex AI Workbench: Vertex AI Notebook offering, composed by: Vertex Instances and the old Managed and User-managed notebooks.Colab Enterprise: managed and collaborative service that combines the ease-of-use of Google’s Colab notebooks with enterprise-level security and compliance support capabilities of Google Cloud. Colab Enterprise also powers a notebook experience for BigQuery Studio.Cloud Datalab: old and deprecated Google Cloud managed notebook product (non Jupyter), enterprise-grade, built over a virtual machine.Colaboratory: based on Jupyter, ig’s Google offering targeted for research, it’s a research product to help and disseminate ML education. Provides a free-of-charge environment, including free GPU and TPU, and is integrated with Google Drive (you can “mount” Drive on Colaboratory).Colaboratory Pro: Google offering announced in February 2020, it’s a paid-based product (USD 9.99 / month), which provides faster GPUs(T4, P100), high-memory VMs and longer runtimes (before runtimes were resetted every 12 hours)Apache Zeppelin: web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.AWS EMR notebook: part of the EMR (Elastic MapReduce) platform, it’s a managed Jupyter notebook integrated with Spark workloads and other services likes Amazon S3 (see the 2018 announcement). It runs Jupyter Notebook version 5.7.0 and Python 3.6.5. EMR Notebooks is pre-configured with the following kernels and library packages installed: PySpark, PySpark3, Python3, Spark and SparkRAWS Sagemaker Studio notebook: different from the EMR notebook, and recently announced (2019 announcement). It’s integrated with ML tools, but still in early stages.Azure ML notebooks: free hosted service to develop and run Jupyter notebooks in the cloud with no installation. Docs here.
Colab vs Colab pro vs Vertex AI Workbench Instances
| Colab | Colab Pro | Workbench Instances | |
|---|---|---|---|
| GPUs | Mostly K80 | K80, P100, T4 | K80, P100, T4, V100, A100 |
| CPUs | 2 vCPU | 2 vCPU | Up to 96 vCPU |
| Session length | 12 hours | Unlimited | Unlimited |
| Storage | 5 GB | Up to 1 TB | Unlimited |
| Guaranteed resources | NO | NO | YES |
| Enterprise support | NO | NO | YES |
| Remote SSH access | NO | NO | YES |
| Scheduler | NO | NO | YES |
| Security | IAM | IAM | IAM, VPC-SC, CMEK, AxT, VPC-peering, DRZ |
| Price | Free | USD 8 to 49 pupm | Pay-per-use GCP |
| Dataproc/Dataflow engines | NO | NO | YES |
| Region | All | 7 countries-only | All |
Colab Enterprise vs Vertex AI Workbench Instances
| Colab Enterprise | Workbench Instances | |
|---|---|---|
| Environment | Zero config, serverless, collaborative | JupyterLab (laptop replacement) |
| Add-ons | Assistive | JupyterLab extensions |
| Projects | Reading/running notebooks | Projects with multiple notebooks, files |
| Integration | BigQuery, Spark (Dataproc serverless Spark) | BigQuery, Spark (Dataproc serverless Spark) |
| Others | BigQuery Studio | Idle shutdown |
Tips
- If you use a TPU with a Vertex AI Workbench, both the VM (notebook) and the TPU must be in the same region. In Europe, the only region that has both Cloud TPU and notebooks is europe-west1-b.
- As usual with notebooks, cells must be idempotent, so make sure to initialize the variables at the beginnning of each cell.
- There is no direct integration with Google Cloud Source Repositories (CSR). If need to clone from CSR, first authenticate with your account and then clone the directory:
gcloud config set account <YOUR_ACCOUNT>
gcloud auth login
gcloud config set project <YOUR_PROJECT> # get token from browser
gcloud source repos clone <YOUR_REPO_IN_CSR> --project=<YOUR_PROJECT>
After cloning, you can see the git submenus are enabled on the UI, since it detects it is a git directory.
Keyboard shortcuts
- Autocomplete inside the cells work. If you import
cifar10and then writecifand press tab, autocomplete works. - Notebook edit mode:
aadds a cell;badds a cell below;xremoves the cell - CTRL+SHFT+M: split current cell into two; SHFT+M: merge cells (do the opposite)
- Terminal commands can be run from any cell. Examples:
!ls
!1+1
!nvidia-smi- To reset all runtimes: Menu
Runtime->reset all runtimes
Installing TensorFlow on an R instance on AI Platform notebooks
Create new R instance on AI Platform Notebooks (R3.6)
Create new R notebook and install TensorFlow:
First, install the Tensorflow R package::
install.packages("tensorflow")
Installing package into ‘/home/jupyter/.R/library’
(as ‘lib’ is unspecified)
Install TensorFlow and create an r-reticulate environment, in case does not exist:
library(tensorflow)
install_tensorflow(method = 'conda', envname = 'r-reticulate')
Optionally, for an older version:
install_tensorflow(version="1.4", method = 'conda', envname = 'r-reticulate')
Finally, check the TensorFlow installation:
use_condaenv("r-reticulate")
tf$constant("Hellow Tensorflow")
tf.Tensor(b'Hellow Tensorflow', shape=(), dtype=string)
Additionally, you may want to install cloudml package for training and prediction in AI platform:
install.packages("cloudml")
library(cloudml)
gcloud_install()
Check package versions
Use this commands:
conda info
conda info --envs
conda activate r-reticulate
conda list
Submitting Dataproc jobs from R
R is one of the best languages for data scientists, it’s simple to use and with lot of capabilities. But the biggest limitation is the amount of data, many times limited to memory in one single node.
This limitation can be solved with tools in the cloud including Dataproc (Spark), Vertex AI for ML and many others, also on-premises like Spark.
SparkR provides an interface from R to Spark. It supports data cleaning oprations and other tools, but is limited to many ML algorithms.
Example 1: using SparkR with Dataproc:
r_file <- "/home/rafaelsf80/r-dataproc-tensorflow/rstudio-server/shakespeare.R"
CLUSTER <- "caip-telefonica-demo"
REGION <- "europe-west1"
command = paste0("gcloud dataproc jobs submit spark-r ", r_file,
" --cluster=", CLUSTER,
" --region=", REGION
)
print(command)
system(command, intern = TRUE)Source SparkR code:
# Load SparkR library into your R session
library(SparkR)
# Initialize SparkSession
sparkR.session(sparkPackages = "org.apache.spark:spark-avro_2.12:3.0.1", appName = "SparkR-ML-Natality-regression-example")
file <- "https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt"
sentences<-scan(file,"character",sep="\n");
#remove periods and commas
sentences<-gsub("\\.","",sentences)
sentences<-gsub("\\,","",sentences)
#split sentences into a list of words delimited by spaces
words<-strsplit(sentences," ")
#determine frequencies and show final output
words.freq<-table(unlist(words));
cbind.data.frame(names(words.freq),as.integer(words.freq))
sparkR.session.stop()Example 2: Spark job in Dataproc from R (note source code is SparkPi):
command = paste0("gcloud dataproc jobs submit spark ",
" --cluster=", CLUSTER,
" --region=", REGION,
" --class org.apache.spark.examples.SparkPi",
" --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000");
print(command)
system(command, intern = TRUE)Running Data Studio server on Dataproc
Refer to installation instructions here
Connections is through an SSH SOCKS tunnel
R in Dataproc
Dataproc supports R in two ways: * Sparklyr package, developed by RStudio team * SparkR package, developed by Spark team, built in and integrated with gcloud CLI tool
Comparison between SparkR and sparklyr here