arXiv 2310.16787
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI
By Shayne Longpre, Robert Mahari, et al.
Published 2023-10-25
Wiki summary
Explore the paper's summary, context, and related research on Papiers.
The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to…