arXiv 2407.14933

Consent in Crisis: The Rapid Decline of the AI Data Commons

By Shayne Longpre, Robert Mahari, et al.

Published 2024-07-20

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data, assembled into corpora such as C4, RefinedWeb, and Dolma. To our knowledge, we conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora. Our audit of 14,000 web domains provides an expansive view of crawlable web data and how codified data use pre…

View the original paper on arXiv