论文标题
姿势:链接3D人类姿势和自然语言
PoseScript: Linking 3D Human Poses and Natural Language
论文作者
论文摘要
自然语言在许多计算机视觉应用中起着至关重要的作用,例如图像字幕,视觉问题答案和跨模式检索,以提供细粒的语义信息。不幸的是,尽管人姿势是人类理解的关键,但当前的3D人姿势数据集缺乏详细的语言描述。为了解决此问题,我们介绍了Posescript数据集。该数据集将超过六千3D人类与人体部分及其空间关系的丰富描述配对。此外,为了将数据集的大小提高到与数据渴望学习算法兼容的量表,我们提出了一个精心设计的字幕过程,该过程从给定的3D KePoints中生成自然语言的自动合成描述。此过程使用一组简单但通用的规则在3D关键点上提取低级姿势信息,称为“ pececodes”。然后,使用句法规则将这些后编码合并为更高级别的文本描述。有了自动注释,可用数据的量显着扩大(100K),从而有效地预先限制了深层模型,以便在人体字幕上进行填充。为了展示注释姿势的潜力,我们提出了使用Posescript数据集的三个多模式学习任务。首先,我们开发了一条管道,将3D姿势和文本描述映射到关节嵌入空间中,从而可以从大规模数据集的相关姿势进行跨模式检索。其次,我们为生成3D姿势的文本条件模型建立了基线。第三,我们提出了一个学习姿势描述的学习过程。这些应用程序证明了在各种任务中注释姿势的多功能性和实用性,并为该领域的未来研究铺平了道路。
Natural language plays a critical role in many computer vision applications, such as image captioning, visual question answering, and cross-modal retrieval, to provide fine-grained semantic information. Unfortunately, while human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. To address this issue, we have introduced the PoseScript dataset. This dataset pairs more than six thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. Additionally, to increase the size of the dataset to a scale that is compatible with data-hungry learning algorithms, we have proposed an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information, known as "posecodes", using a set of simple but generic rules on the 3D keypoints. These posecodes are then combined into higher level textual descriptions using syntactic rules. With automatic annotations, the amount of available data significantly scales up (100k), making it possible to effectively pretrain deep models for finetuning on human captions. To showcase the potential of annotated poses, we present three multi-modal learning tasks that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps 3D poses and textual descriptions into a joint embedding space, allowing for cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we establish a baseline for a text-conditioned model generating 3D poses. Thirdly, we present a learned process for generating pose descriptions. These applications demonstrate the versatility and usefulness of annotated poses in various tasks and pave the way for future research in the field.