论文标题
attr2style:通过服装属性推断时尚风格的转移学习方法
Attr2Style: A Transfer Learning Approach for Inferring Fashion Styles via Apparel Attributes
论文作者
论文摘要
流行的时尚电子商务平台主要提供有关其产品详细信息页面上服装低级属性(例如,颈部类型,着装长度,项圈)的详细信息。但是,客户通常更喜欢根据其样式信息或简单地提出服装(例如,派对/运动/休闲服)购买服装。将监督的图像捕获模型应用于生成基于样式的图像字幕的限制是有限的,因为很难以基于样式的字幕的形式获得地面真相注释。这是因为注释基于样式的字幕需要一定数量的时尚域专业知识,还增加了成本和手动努力。相反,基于低级属性的注释更容易获得。为了解决此问题,我们提出了一个基于转移学习的图像字幕模型,该模型在带有足够基于属性的基于属性的基础真相字幕的源数据集上训练,并用于预测目标数据集中的基于样式的字幕。目标数据集仅具有有限的图像,并带有基于样式的地面真相字幕。我们方法的主要动机来自一个事实,即低级属性和服装的高级风格之间的相关性最常见。我们利用这一事实并使用注意机制在基于编码器的框架中训练我们的模型。特别是,该模型的编码器首先在源数据集上训练,以获得捕获低级属性的潜在表示。训练有素的模型经过微调,以生成目标数据集的基于样式的字幕。为了强调我们方法的有效性,我们在定性和定量上证明了我们方法产生的字幕接近所评估服装的实际样式信息。我们模型的概念证明是在Myntra的驾驶中,该概念会暴露于某些内部用户以获取反馈。
Popular fashion e-commerce platforms mostly provide details about low-level attributes of an apparel (eg, neck type, dress length, collar type) on their product detail pages. However, customers usually prefer to buy apparel based on their style information, or simply put, occasion (eg, party/ sports/ casual wear). Application of a supervised image-captioning model to generate style-based image captions is limited because obtaining ground-truth annotations in the form of style-based captions is difficult. This is because annotating style-based captions requires a certain amount of fashion domain expertise, and also adds to the costs and manual effort. On the contrary, low-level attribute based annotations are much more easily available. To address this issue, we propose a transfer-learning based image captioning model that is trained on a source dataset with sufficient attribute-based ground-truth captions, and used to predict style-based captions on a target dataset. The target dataset has only a limited amount of images with style-based ground-truth captions. The main motivation of our approach comes from the fact that most often there are correlations among the low-level attributes and the higher-level styles for an apparel. We leverage this fact and train our model in an encoder-decoder based framework using attention mechanism. In particular, the encoder of the model is first trained on the source dataset to obtain latent representations capturing the low-level attributes. The trained model is fine-tuned to generate style-based captions for the target dataset. To highlight the effectiveness of our method, we qualitatively and quantitatively demonstrate that the captions generated by our approach are close to the actual style information for the evaluated apparel. A Proof Of Concept for our model is under pilot at Myntra where it is exposed to some internal users for feedback.