arXiv 2412.17847
Bridging the Data Provenance Gap Across Text, Speech and Video
By Shayne Longpre, Nikhil Singh, et al.
Published 2024-12-19
Mindmap
Browse the paper's core ideas, clusters, and relationships in a structured outline.
Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and…