arXiv 2508.03153

Estimating Worst-Case Frontier Risks of Open-Weight LLMs

By Eric Wallace, Olivia Watkins, et al.

Published 2025-08-05

Wiki summary

Explore the paper's summary, context, and related research on Papiers.

In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybers…

View the original paper on arXiv