arXiv 2109.01611

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

By Seungbeom Choi, Sunho Lee, et al.

Published 2021-09-01

Citation lineage

Review the prior work and downstream research connected to this paper.

As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges: first, they must provide a bounded latency for each request to support consistent service-level objective (SLO), and second, they can serve multiple heterogeneous ML models in a syst…

View the original paper on arXiv