arXiv 2412.17847

Bridging the Data Provenance Gap Across Text, Speech and Video

By Shayne Longpre, Nikhil Singh, et al.

Published 2024-12-19

Mindmap

Browse the paper's core ideas, clusters, and relationships in a structured outline.

Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and…

View the original paper on arXiv