arXiv 2109.01611
Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning
By Seungbeom Choi, Sunho Lee, et al.
Published 2021-09-01
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges: first, they must provide a bounded latency for each request to support consistent service-level objective (SLO), and second, they can serve multiple heterogeneous ML models in a syst…