arXiv 2310.16787
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
By Shayne Longpre, Robert Mahari, et al.
Published 2023-10-25
Discussion
Read the public discussion and references gathered around this paper.
The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to…