arXiv 2109.01611

Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning

By Seungbeom Choi, Sunho Lee, et al.

Published 2021-09-01

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

As machine learning techniques are applied to a widening range of applications, high throughput machine learning (ML) inference servers have become critical for online service applications. Such ML inference servers pose two challenges: first, they must provide a bounded latency for each request to support consistent service-level objective (SLO), and second, they can serve multiple heterogeneous ML models in a syst…

View the original paper on arXiv