A review of inference servers for LLMs

Vertex AI

A review of inference servers that can be used to deploy Large Language Models

Author

Rafa Sanchez

Published

March 2, 2023

Tiangolo: the combination of Gunicorn and Uvicorn

Gunicorn (Green unicorn or unicorn) is an application server supports the WSGI standard. This means that Gunicorn can serve applications written in frameworks such as Flask or Django (more so for versions released before 2021). The way it works is that it creates and maintains a number of workers that serve requests from clients. Gunicorn itself is not compatible with FastAPI because FastAPI uses the fresh ASGI standard.
Uvicorn is essentially a low-level python HTTP server. It’s very fast ASGI server, built on uvloop and httptools. Being an ASGI server, it’s compatible with FastAPI. However, it’s capabilities as a process (workers) manager are very limited.

Uvicorn has a Gunicorn-compatible worker class, so they can be combined. As such, you can use Gunicorn as a process manager, listening on the port and the IP, and several Uvicorn workers to take advantage of concurrency and parallelism.

The pipeline is as follows: Gunicorn receives the HTTP request, that would transmit to the worker processes running the Uvicorn classes. It will make some cleaning and will transfer the message (/health, /predict, …) to FastAPI.

Gunicorn –> Uvicorn class –> FastAPI

NVIDIA Triton Inference server

NVIDIA Triton Inference is an open-source inference server allows to deploy multiple models on the same VM and create ensembles (pipelines/chaining). It also allows faster inference times suitable for production use cases. Main advantage include:

Autoscaling
Co-hosting of multi-framework
Multi-GPU multi-node inference support (for LLMs)
Low-latency

This notebook shows how to integrate the NVidia Triton Server with Vertex AI to support multiple individual or ensemble of models on a single endpoint.

This link shows deployment of Triton on GKE.

TorchServe

Pending