Spotify播客数据集

论文标题

Spotify播客数据集

The Spotify Podcast Dataset

论文作者

Clifton, Ann, Pappu, Aasish, Reddy, Sravana, Yu, Yongze, Karlgren, Jussi, Carterette, Ben, Jones, Rosie

论文摘要

播客是音频媒体的一种相对较新的形式。发作以常规的节奏出现，并具有许多不同的形式和形式的水平。他们可以是正式的新闻新闻或对话聊天；小说或非小说。它们在受欢迎程度上迅速增长，但对研究的研究很少。作为一种音频格式，播客的风格和生产类型比广播新闻更具多样性，并且包含比视频研究中通常研究的更多流派。因此，该媒介是一个丰富的领域，拥有许多IR和NLP社区的研究途径。我们介绍Spotify播客数据集，这是一组由原始音频文件和随附的ASR成绩单组成的大约100K播客剧集。这代表了超过47,000个小时的转录音频，并且比以前的语音到文本语料库大的数量级。

Podcasts are a relatively new form of audio media. Episodes appear on a regular cadence, and come in many different formats and levels of formality. They can be formal news journalism or conversational chat; fiction or non-fiction. They are rapidly growing in popularity and yet have been relatively little studied. As an audio format, podcasts are more varied in style and production types than, say, broadcast news, and contain many more genres than typically studied in video research. The medium is therefore a rich domain with many research avenues for the IR and NLP communities. We present the Spotify Podcast Dataset, a set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.

下载PDF全文

下载文献需遵守相关版权规定

论文标题