arXiv 2310.16787

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

By Shayne Longpre, Robert Mahari, et al.

Published 2023-10-25

Discussion

Read the public discussion and references gathered around this paper.

The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to…

View the original paper on arXiv