论文标题
Spotify播客数据集
The Spotify Podcast Dataset
论文作者
论文摘要
播客是音频媒体的一种相对较新的形式。发作以常规的节奏出现,并具有许多不同的形式和形式的水平。他们可以是正式的新闻新闻或对话聊天;小说或非小说。它们在受欢迎程度上迅速增长,但对研究的研究很少。作为一种音频格式,播客的风格和生产类型比广播新闻更具多样性,并且包含比视频研究中通常研究的更多流派。因此,该媒介是一个丰富的领域,拥有许多IR和NLP社区的研究途径。我们介绍Spotify播客数据集,这是一组由原始音频文件和随附的ASR成绩单组成的大约100K播客剧集。这代表了超过47,000个小时的转录音频,并且比以前的语音到文本语料库大的数量级。
Podcasts are a relatively new form of audio media. Episodes appear on a regular cadence, and come in many different formats and levels of formality. They can be formal news journalism or conversational chat; fiction or non-fiction. They are rapidly growing in popularity and yet have been relatively little studied. As an audio format, podcasts are more varied in style and production types than, say, broadcast news, and contain many more genres than typically studied in video research. The medium is therefore a rich domain with many research avenues for the IR and NLP communities. We present the Spotify Podcast Dataset, a set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.