学习视觉场景的结构化表示

论文标题

学习视觉场景的结构化表示

Learning Structured Representations of Visual Scenes

论文作者

Chiou, Meng-Jiun

论文摘要

随着中间级别的表示桥接两个层次，视觉场景的结构化表示，例如成对对象之间的视觉关系，不仅可以使组成模型与结构进行推理，而且为模型决策提供了更高的解释性。然而，这些表示的关注比传统的认可任务要少得多，因此尚未解决许多开放挑战。在论文中，我们研究了机器如何以视觉关系为结构化表示的单个图像或视频的内容。具体而言，我们探讨了如何在静态图像和视频设置中有效地构建和学习的视觉场景的结构化表示，以及由外部知识融合，降低偏见的机制以及增强的表示模型所带来的改进。在本论文的结尾，我们还讨论了一些开放的挑战和局限性，以阐明视觉场景的结构化表示学习的未来方向。

As the intermediate-level representations bridging the two levels, structured representations of visual scenes, such as visual relationships between pairwise objects, have been shown to not only benefit compositional models in learning to reason along with the structures but provide higher interpretability for model decisions. Nevertheless, these representations receive much less attention than traditional recognition tasks, leaving numerous open challenges unsolved. In the thesis, we study how machines can describe the content of the individual image or video with visual relationships as the structured representations. Specifically, we explore how structured representations of visual scenes can be effectively constructed and learned in both the static-image and video settings, with improvements resulting from external knowledge incorporation, bias-reducing mechanism, and enhanced representation models. At the end of this thesis, we also discuss some open challenges and limitations to shed light on future directions of structured representation learning for visual scenes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题