Learning deep semantic embeddings for cross-modal retrieval

Cuicui Kang, Shengcai Liao, Zhen Li, Zigang Cao, Gang Xiong

Research output: Contribution to journalConference articlepeer-review

2 Citations (Scopus)

Abstract

Deep learning methods have been actively researched for cross-modal retrieval, with the softmax cross-entropy loss commonly applied for supervised learning. However, the softmax cross-entropy loss is known to result in large intra-class variances, which is not not very suited for cross-modal matching. In this paper, a deep architecture called Deep Semantic Embedding (DSE) is proposed, which is trained in an end-to-end manner for image-text cross-modal retrieval. With images and texts mapped to a feature embedding space, class labels are used to guide the embedding learning, so that the embedding space has a semantic meaning common for both images and texts. This way, the difference between different modalities is eliminated. Under this framework, the center loss is introduced beyond the commonly used softmax cross-entropy loss to achieve both inter-class separation and intra-class compactness. Besides, a distance based softmax cross-entropy loss is proposed to jointly consider the softmax cross-entropy and center losses in fully gradient based learning. Experiments have been done on three popular image-text cross-modal retrieval databases, showing that the proposed algorithms have achieved the best overall performances.

Original languageEnglish
Pages (from-to)471-486
Number of pages16
JournalJournal of Machine Learning Research
Volume77
Publication statusPublished - 2017
Externally publishedYes
Event9th Asian Conference on Machine Learning, ACML 2017 - Seoul, Korea, Republic of
Duration: Nov 15 2017Nov 17 2017

Keywords

  • Cross-Modal Retrieval
  • Deep Learning
  • Semantic Embedding Learning

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Learning deep semantic embeddings for cross-modal retrieval'. Together they form a unique fingerprint.

Cite this