arXiv 2407.14933
Consent in Crisis: The Rapid Decline of the AI Data Commons
By Shayne Longpre, Robert Mahari, et al.
Published 2024-07-20
Discussion
Read the public discussion and references gathered around this paper.
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use pre…