A review of inference servers for LLMs
Tiangolo: the combination of Gunicorn and Uvicorn
Gunicorn (Green unicorn or unicorn) is an application server supports the WSGI standard. This means that Gunicorn can serve applications written in frameworks such as Flask or Django (more so for versions released before 2021). The way it works is that it creates and maintains a number of workers that serve requests from clients. Gunicorn itself is not compatible with FastAPI because FastAPI uses the fresh ASGI standard.
Uvicorn is essentially a low-level python HTTP server. It’s very fast ASGI server, built on
uvloopandhttptools. Being an ASGI server, it’s compatible with FastAPI. However, it’s capabilities as a process (workers) manager are very limited.
Uvicorn has a Gunicorn-compatible worker class, so they can be combined. As such, you can use Gunicorn as a process manager, listening on the port and the IP, and several Uvicorn workers to take advantage of concurrency and parallelism.
The pipeline is as follows: Gunicorn receives the HTTP request, that would transmit to the worker processes running the Uvicorn classes. It will make some cleaning and will transfer the message (/health, /predict, …) to FastAPI.
Gunicorn –> Uvicorn class –> FastAPI
NVIDIA Triton Inference server
NVIDIA Triton Inference is an open-source inference server allows to deploy multiple models on the same VM and create ensembles (pipelines/chaining). It also allows faster inference times suitable for production use cases. Main advantage include:
- Autoscaling
- Co-hosting of multi-framework
- Multi-GPU multi-node inference support (for LLMs)
- Low-latency
This notebook shows how to integrate the NVidia Triton Server with Vertex AI to support multiple individual or ensemble of models on a single endpoint.
This link shows deployment of Triton on GKE.
TorchServe
Pending