OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian,
Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan1, Xiaodan Liang1 Hao Wang is with the School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China and Pengcheng Lab, Shenzhen 518000 (e-mail: wangh739@mail2.sysu.edu.cn, wanghao9610@gmail.com). Pengzhen Ren is with the School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China. Xiao Dong is with the School of Artificial Intelligence, Zhuhai Campus, Sun Yat-Sen University, Zhuhai, P.R. China, 519082. Zequn Jie, Chengjian Feng, Lin Ma, and Yinlong Qian are with Meituan Inc, China. Xiangyuan Lan, Dongmei Jiang, and Yaowei Wang are with Pengcheng Lab, Shenzhen 518000, China. Yaowei Wang is also with the School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China. Xiaodan Liang is with the School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China, and Pengcheng Lab, Shenzhen 518000 (e-mail: liangxd9@mail.sysu.edu.cn). 1 Xiangyuan Lan, and Xiaodan Liang are the corresponding authors.

Abstract

Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training on diverse large-scale datasets. However, these approaches still face two primary challenges: (i) how to universally integrate diverse data sources for end-to-end training, and (ii) how to effectively leverage the language-aware capability for region-level cross-modality understanding. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which pre-trains on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from pseudo-label generation by unifying different data sources into detection-centric data. In addition, we propose a Language-Aware Selective Fusion (LASF) module to enable the language-aware ability of the model through a language-aware query selection and fusion process. We evaluate the performance of the proposed OV-DINO on popular open-vocabulary detection benchmark datasets, achieving state-of-the-art results with an AP of 50.6% on the COCO dataset and 40.0% on the LVIS dataset in a zero-shot manner, demonstrating its strong generalization ability. Furthermore, the fine-tuned OV-DINO on COCO achieves 58.4% AP, outperforming many existing methods with the same backbone. The code for OV-DINO will be available at https://github.com/wanghao9610/OV-DINO.

Index Terms:

Object detection, open-vocabulary, detection transformer.

1 Introduction

Refer to caption — Figure 1: Comparison of OV-DINO with Previous Methods. (a) Previous methods (*e.g.* GLIP [1], GLIPv2 [2], G-DINO [3]) are not primarily detection-centric. They first pre-train on large-scale Detection and Grounding data, then generate pseudo labels on Image-Text data, potentially introducing noise (red circle). (b) OV-DINO is a detection-centric method that integrates various data sources into a unified detection data format through a Unified Data Integration pipeline. It undergoes end-to-end pre-training via region-text alignment within a unified detection framework.

The traditional object detection methods, such as Fast R-CNN [4], Faster R-CNN [5], Mask R-CNN [6], DETR [7], and DINO [8], are typically trained on datasets with closed-set categories, This limits their ability to detect objects outside of the predefined categories, which is a significant constraint for real-world applications. To address this limitation, a new task known as Open-Vocabulary Detection (OVD) has been proposed, attracting significant attention from both academic and industrial communities. Open-vocabulary detection requires the ability to detect any object using class names, even including objects that have never been encountered during training. The development of OVD can be traced back to the introduction of Zero-Shot Detection (ZSD) by Bansal et al.[9], where models are trained on a limited set of categories and evaluated on novel categories. Building upon ZSD, Zareian et al.[10] further expanded the concept to OVD by leveraging a visual semantic space derived from image-text data, thereby enhancing the capability of category generalization.

Recent studies [11, 12, 13] have catalyzed the development of open-world vision methodologies [14, 15, 16, 3, 1], enabling the detection of objects outside pre-defined categories, they typically pre-train on large-scale grounding datasets and then generate pseudo-labels for image-text data. This introduces two distinct challenges: (i) data noise due to the introduced pseudo-labels from diverse data sources, and (ii) a semantic gap between image regions and textual descriptions. As depicted in Figure 1, we illustrate this phenomenon. The first challenge mentioned above is attributed to the limited vocabulary concept of training data, and models trained with the data have poor generalization ability, leading to inaccurate predictions in pseudo-labeling on the image-text data. Methods such as GLIP[1], GLIPv2[2], and G-DINO[3] have approached detection as a grounding task, which pre-trains on large-scale detection and grounding datasets then generates pseudo-labels on image-text data. This pre-training followed by pseudo-labeling paradigm will introduce unavoidable noise. A detection-centric data unification can solve the problem, which integrates various pre-training data sources into a detection-centric data format. This integration is essential to prevent unnecessary noise during pseudo-label generation and provide the model with more accurate supervision information while allowing for flexible expansion of data sources. The second challenge mentioned above is attributed to semantic confusion in region-level modality fusion and alignment. To fuse the region information with text input, methods such as GLIP [1] introduce complex deep fusion in the encoder stage, G-DINO [3] proposes a bidirectional cross-attention-based lightweight feature enhancer to enhance features and extend fusion to the decoder. While DetCLIP [17] discards most cross-modality fusion blocks except for the region-word alignment, simplifying the model architecture while overlooking crucial category information fusion. OVD models make predictions based on both image and language text input, with the language text input serving as a query for the model and influencing the classification labels. Text embedding and region embedding fusion allow the representation of the text embedding to be dynamically adjusted according to the object information in the image, enabling better alignment between text and region embeddings. Therefore, it is crucial to accurately fuse and align modality information for OVD models to effectively capture image information and integrate it with the language text input.

TABLE I: Comparison of OVD methods. We compare OV-DINO with previous OVD methods in terms of method type, modality fusion, and pseudo-label generation. OV-DINO is a unified detection-centric method with LASF, eliminating the need for pseudo-label generation.

Method	Type	Modality Fusion	Pseudo-Label
GLIP[1]	Grounding	DeepFusion	Y
GLIPv2[2]	Grounding	DeepFusion	Y
G-DINO[3]	Grounding	CrossAttnFusion	Y
DetCLIP[17]	Detection	–	Y
YOLO-World[18]	Detection	RepVL-PAN	Y
OV-DINO (Ours)	Detection	LASF	N

To address both key challenges, we introduce a novel method called OV-DINO for open-vocabulary detection. For the first challenge, we propose the Unified Data Integration (UniDI) pipeline to integrate diverse data sources into a unified detection-centric data format and pre-trains on large-scale datasets end-to-end, eliminating the requirement of pseudo-label generation on image-text data. The UniDI treats all text inputs, including detection nouns, grounding phrases, and image captions, as categories for detection-centric unification. For the second challenge, we propose a Language-Aware Selective Fusion (LASF) module for region-level cross-modality fusion and alignment. The LASF module facilitates selective and dynamic fusion of text-related object embedding, as illustrated in Figure 2. The language-aware selection and fusion processes enable the model to make more accurate predictions based on the language text input. Moreover, we propose a simple adaptation of the supervised training procedure used in DINO [8] to standardize the training process for open-vocabulary detection, requiring only minimal modifications to the existing framework. To verify the effectiveness, extensive experiments are conducted on the popular open-vocabulary detection datasets COCO [19] and LVIS [20] under zero-shot and fine-tuning settings. The results demonstrate that OV-DINO achieves state-of-the-art performance on both datasets and settings. To highlight the characteristics of our model, we compare OV-DINO with recent methods in terms of method type, modality fusion, and pseudo-label generation in Table I.

In summary, our main contributions are outlined as follows:

•

We present the OV-DINO, a novel unified open vocabulary detection approach that offers superior performance and effectiveness for practical real-world application.
•

We propose a Unified Data Integration pipeline that integrates diverse data sources for end-to-end pre-training, and a Language-Aware Selective Fusion module to improve the vision-language understanding of the model.
•

The proposed OV-DINO shows significant performance improvement on COCO and LVIS benchmarks compared to previous methods, achieving relative improvements of +2.5% AP on COCO and +13.6% AP on LVIS compared to G-DINO in zero-shot evaluation. The pre-trained model and code will be open-sourced to support open-end vision development.

2 Related Work

Vision-Language Pre-Training. Conventional supervised vision methods [21, 22, 23, 24, 25] often rely on manual human annotation, thereby constraining the model’s capacity for generalization. Additionally, it is challenging to define a comprehensive list of categories and collect sufficient sample data for rare categories [14, 26, 27]. The expensive labeling cost limits the wide application of the vision model for the open-world scenario. To overcome the data annotation limitation, Vision-Language Pre-training has been proposed, it is a natural extension and development of the successful pre-train-then-fine-tune scheme in the domains of natural language processing (NLP) [28, 29] and computer vision [30] community. Dual-stream approaches such as CLIP [11] and ALIGN [12] have shown great zero-shot classification ability by pre-training on large-scale image-text pairs data (e.g. CC12M [31], YFCC100M [32], Laion5B [33]) with cross-modal contrastive learning. Single-stream approaches [34, 35, 36] directly model the relation of vision and text embedding by two separate transformer-based encoders, which perform well in tasks like image-text [37, 38, 39] and VQA [40, 41, 42]. Recently, VLMo [43], BLIP [44] and BLIPv2 [45] further explore a hybrid architecture incorporating both single-stream and two-stream architectures to facilitate a more cohesive way of vision-language understanding and generation. However, these models primarily focus on learning whole-image visual representations and cannot be directly applied to more complex core computer vision tasks such as segmentation and detection, which necessitate fine-grained semantic understanding.

Open-Vocabulary Detection. Traditional object detection methods [4, 5, 6] have been successful in supervised scenarios, but face challenges in adapting to open-world scenarios with a large number of classes. It is challenging to explore approaches to acquire more semantic concepts for tasks related to Open-Vocabulary Detection (OVD). Recent approaches such as RegionCLIP [46], Baron [46], and ViLD [47] have concentrated on extracting intricate semantic correspondences and information to improve the inclusiveness of new categories. However, these approaches are based on the pre-trained CLIP model, which restricts their capacity for generalization. Furthermore, recent methods like GLIP [1], GDINO [3], and GLIPv2 [2] aim to integrate multiple data sources to enrich the model’s concept library. These approaches consider object detection as a grounding task and generate pseudo labels for image-text data. However, the grounding-orientated unification imposes limitations on the input length of text, and the pseudo-label generation introduces noise to the model. Meanwhile, DetCLIP [17] proposes a dictionary-enriched visual-concept paralleled pre-training scheme to pre-train a model in a parallel way. DetCLIPv2 [48] further endeavors to unify all data sources in a scalable pre-training approach by utilizing different losses for various data sources, sacrificing the efficiency of architecture. Therefore, this paper proposes a unified framework to integrate all data types into the object detection data format. The proposed approach aims to provide more accurate supervisory information to the model while overcoming text length limitations and the necessity for pseudo-label generation. This unified framework is designed to enhance the generalization of the model and improve the performance of open-vocabulary detection.

Modality Information Fusion and Alignment. Vision-Language model (VLM) has two distinct vision and language modalities, it is crucial to effectively fuse and align the modality information for VLMs. In the image-level VLMs, CLIP [11] and ALIGN [12] directly align the vision and language modality with the contrastive loss [49], FILIP [13] further aligns the modality information in fine-grained scale. To effectively align and fuse the cross-modal information, ALBEF [50] proposes to align before fuse, which utilizes a multi-modal encoder to fuse the image features and text features through cross-modal attention and align the modality with an intermediate image-text contrastive loss. Flamingo [51] bridges the vision-only model and language-only model via the GATED XATTN-DENSE layers, achieving astonishing results on numerous benchmarks. For fine-grained cross-modal understanding, image-level modality fusion and alignment are insufficient for fine-grained vision-language understanding. In the region-level VLMs, RegionCLIP[46] directly aligns the region representation with the region description via region-text pre-training, VLDet [52] considers the region-text alignment as a bipartite matching problem. DetCLIP [17] and DetCLIPv2 [48] further extend the region-text alignment scheme via large-scale pre-training, achieving outstanding open-vocabulary detection performance. However, these approaches primarily concentrate on aligning modality information while ignoring the region-text modality fusion. To fuse the language information with the region representation, GLIP [1] initially integrates cross-modal information in the encoder stage using a cross-attention module, then performs alignment using the region-word alignment loss. G-DINO [3] further integrates modalities in the decoder stage. Although previous methods already consider fusion and alignment for cross-modal information interaction, they do not effectively balance the relationship between fusion and alignment. This paper aims to balance the fusion and alignment of modality information to enhance the model’s ability to capture precise image details guided by language input.

3 Method

This paper aims to develop a unified pre-training framework that integrates different data sources into a standardized format suitable for open-vocabulary detection tasks. To accomplish this objective, we propose a novel model called OV-DINO, which leverages diverse data sources to improve the performance of open-vocabulary detectors within a unified pre-training framework (Section 3.1). To facilitate unified pre-training across various data sources, we develop a Unified Data Integration (UniDI) pipeline applicable across various data sources when embedding extraction (Section 3.2). To align and fuse the fine-grained semantics between text embedding and region-specific visual embedding, we introduce a Language-Aware Selective Fusion (LASF) module to dynamically select and fuse the region-level vision-language information (Section 3.3). To enable detection-centric pre-training, we also develop a simple pre-training framework that features a straightforward design and shares similar training objectives with the closed-set detector DINO [3] (Section 3.4).

3.1 Overview

The overall framework of OV-DINO is depicted in Figure 3, which includes a text encoder, an image encoder, and a detector head with some transformer encoder and decoder layers. Given an image accompanied by a prompt, the detection category nouns or grounding phrases are prompted to captions using specific templates to create a unified representation for the general text embedding. Subsequently, vanilla image and text embeddings are extracted using dedicated image and text encoders. After the embedding extraction, the vanilla image embedding along with the positional embedding are input into the transformer encoder layers to generate refined image embedding. To improve the relevance between the image embedding and the text embedding, a language-aware query selection module is employed to select the object embedding associated with the text embedding. The selected object embedding serves as dynamic context embedding and is merged with static learnable content queries in the decoder using a language-aware query fusion module. The output queries from the final decoder layers are then used for classification projection and box regression to predict the corresponding classification scores and regress the object boxes. The model is pre-trained to align the region-specific image embedding with the related text embedding through the region-text alignment. It is optimized using the classification (alignment) loss and the box regression loss.

3.2 Unified Data Integration

In the pre-training stage of OV-DINO, three primary data sources involve detection, grounding, and image-text data, each with its unique annotation format. For example, the detection data is annotated with class labels and box coordination, the grounding data includes annotations of the caption with token positive indices and box coordination, and the image-text data solely consists of a text description for the image. Typically, various type of data requires distinct processing methods, such as designing diverse loss functions for detection data sources and generating pseudo-labels for image-text data. This increases the complexity of model optimization, preventing the model from reaching its optimal performance. Regarding this problem, OV-DINO converts all data formats into a unified detection-centric data format during data preparation, thereby enabling the integration of different types of data and harmonizing data from diverse sources for end-to-end training. Integrating detection and grounding data is relatively straightforward, as grounding data can be considered a specific type of detection data, with each image having multiple grounding phrases. The challenge lies in seamlessly transforming large-scale image-text data into the detection data format. Drawing inspiration from Detic [53], we argue that the caption description of an image can be treated as a unique category for the image. Additionally, the annotation box for the image can be utilized as an image-sized bounding box. This innovative approach called Caption Box, enables the merging of these three types of data into a detection-centric data format.

To handle various data sources, we have established a standardized format for representing triplets of data as ( $x$ , $\{b_{i}\}_{i=1}^{n}$ , $y$ ), where $x\in\mathbb{R}^{3\times H\times W}$ represents the image input, $\{b_{i}\in\mathbb{R}^{4}\}_{i=1}^{n}$ represents the bounding boxes, and $y\in\mathbb{R}^{C}$ represents the language text inputs. Here, $H$ stands for the image height, $W$ for the image width, and $n$ for the number of object instances. The bounding boxes $b_{i}$ are used as annotated boxes for detection and grounding data, while for image-text data, they represent the image-size box. The language text input $y_{i}$ varies depending on the data type. For detection data, it consists of pre-defined category names, for grounding data, it represents the grounding nouns or phrases of entities, and for image-text data, it is the entire caption. To ensure a consistent representation in language text embedding, we employ simple templates to prompt detection and grounding data (e.g. , a photo of {category}.), while leaving the text input for image-text data unchanged since it already serves as a caption. This approach is referred to as Unified Prompt, which enables all text inputs to be represented as the caption.

With the unified data integration pipeline (Caption Box and Unified Prompt), we can pre-train the model by combining training data from different data sources (detection, grounding, and image-text data), thereby reducing the complexities of model optimization.

3.3 Language-Aware Selective Fusion

The open-vocabulary detection model is designed to identify objects within an image based on the provided text inputs, where the text inputs can serve as a query for the model. To ensure that the query in the decoder aligns with the region context, we propose a Language-Aware Selective Fusion module. This module consists of two key components: language-aware query selection and language-aware query fusion. The detailed architecture of the Language-Aware Query Fusion can be seen in Figure 4 (a).

The language-aware query selection component selects the object embedding by assessing the similarity between the image embedding and the text embedding. It computes the similarity of the multi-scale image embedding $E_{enc}$ and the text embedding $E_{t}$ and then utilizes $RankTop$ to choose the most relevant proposal embedding $E_{sp}$ and object embedding $E_{so}$ . The selected proposal embedding is utilized to initialize the reference anchors, and the selected object embedding $E_{so}$ is forwarded for subsequent query fusion. The language-aware query fusion component gradually fuses language-aware object embedding while preserving the original semantics of the content queries. This component as a crucial part of the decoder layers is iterated M times. Each decoder layer consists of several sub-layers including self-attention, cross-attention, gated-cross-attention, gated-feed-forward, and feed-forward layers. Initially, it takes the multi-scale image embedding $E_{enc}$ , the selected object embedding $E_{so}$ , and the learnable content query $Q_{lc}$ as input, and then dynamically updates the content query $Q_{lc}$ .

The language-aware query selection can be formulated as follows:

\displaystyle E_{so},E_{sp}

\displaystyle=RankTop(E_{enc}\otimes E_{t}^{T})

(1)

where $E_{t}^{T}$ denotes the transpose of $E_{t}$ , $\otimes$ denotes the Kronecker product[54], and $RankTop$ is a parameter-less operation that arranges the elements in descending order and then selects the top $Q$ elements.

The language-aware query fusion can be formulated as follows:

$\displaystyle Q^{i}_{lc_{0}}$	$\displaystyle=\Phi_{Attn}(qkv=Q^{i-1}_{lc}),$	(2)
$\displaystyle Q^{i}_{lc_{1}}$	$\displaystyle=\Phi_{Attn}(q=Q^{i-1}_{lc_{0}},kv=E_{enc}),$	(3)
$\displaystyle Q^{i}_{lc_{2}}$	$\displaystyle=Q^{i}_{lc_{1}}+\tanh(\alpha_{a})\ast\Phi_{Attn}(q=Q^{i}_{lc_{1}}% ,kv=E_{so}),$	(4)
$\displaystyle Q^{i}_{lc_{3}}$	$\displaystyle=Q^{i}_{lc_{2}}+\tanh(\alpha_{b})\ast\Phi_{FFW}(Q^{i}_{lc_{2}}),$	(5)
$\displaystyle Q^{i}_{lc}$	$\displaystyle=\Phi_{FFW}(Q^{i}_{lc_{3}}),$	(6)

where the upper index $i$ represents the module index, $\Phi_{Attn}$ represents the attention layer, $\Phi_{FFW}$ represents the feed-forward layer, $\alpha_{a}$ and the $\alpha_{b}$ are learnable parameters initialized to zero. This initialization ensures the training consistent with the original decoder framework while gradually incorporating language-aware context into the content query.

To facilitate comprehension, we present the pseudocode for the Language-Aware Selective Fusion (LASF) in Algorithm 1. To enhance representation, we illustrate three different variants of LASF in Figure 4: Later-LASF, Middle-LASF, and Early-LASF, depending on the insert position. Additionally, the Cross Attention Fusion (CAF) proposed in G-DINO [3] is also considered.

Algorithm 1 Pseudocode of LASF in a PyTorch-like style.

⬇

def laqs(embed_enc, embed_t):

"""

␣␣␣␣embed_enc:␣encoded␣embedding,␣shape:␣[B,␣P,␣D].

␣␣␣␣embed_t:␣text␣embedding,␣shape:␣[B,␣C,␣D].

␣␣␣␣"""

enc_cls = embed_enc @ embed_t.T #[B, P, C]

enc_coord = BoxMLP(embed_enc) #[B, P, 4]

topk_idx = TopK(enc_cls.max(-1)[0], Q, dim=1)

# embed_so: [B, Q, D]

embed_so = Gather(enc_coord, dim=1, topk_idx)

# embed_sp: [B, Q, 4]

embed_sp = Gather(enc_cls, dim=1, topk_idx)

return embed_so, embed_sp

def laqf(q_lc, embed_so):

"""

␣␣␣␣q_lc:␣learnable␣content␣query,␣shape:␣[B,␣Q,␣D].

␣␣␣␣embed_so:␣selected␣object␣embedding,␣shape:␣[B,␣Q,␣D].

␣␣␣␣"""

# self-attention

q_lc = Attn(qkv=q_lc)

# cross-attention

q_lc = Attn(q=q_lc, kv=embed_enc)

# gated-cross-attention

q_lc = q_lc + Tanh(a) * Attn(q=q_lc, kv=embed_so)

# gated-ffw

q_lc = q_lc + Tanh(b) * FFW(q_lc)

# ffw

q_lc = FFW(q_lc)

return q_lc

def lasf(embed_enc, embed_t, q_lc):

"""

␣␣␣␣embed_enc:␣encoded␣embedding,␣shape:␣[B,␣P,␣D].

␣␣␣␣embed_t:␣text␣embedding,␣shape:␣[B,␣C,␣D].

␣␣␣␣q_lc:␣learnable␣content␣query,␣shape:␣[B,␣Q,␣D].

␣␣␣␣NOTE:␣B␣is␣the␣batch␣size,␣P␣is␣the␣patch␣number,␣D␣is␣the␣dimension␣number,␣C␣is␣the␣caption␣number,␣and␣Q␣is␣the␣query␣number.

␣␣␣␣"""

# 1. Language-aware query selection.

embed_so, embed_sp = laqs(embed_enc, embed_t)

# embed_sp to initialize the reference points,

# omit here for concise.

# 2. Language-aware query fusion.

# Decoder layers with laqf, iterate M times.

for _ in range(M):

q_lc = laqf(q_lc, embed_so)

q_sf = q_lc

return q_sf

TopK: topk selection; Gather: gathers values along index specified by dim; Attn: attention layer; FFW: feed-forward layer; Tanh: tanh activation function.

3.4 Detection-Centric Pre-Training

The unified data integration introduces a detection-centric data format that transforms different types of data into a format suitable for detection. This approach combines a range of data sources such as detection data, grounding data, and image-text data, allowing for the pre-training of a model within a unified framework with a focus on detection.

Model Forward. OV-DINO takes the triplet-wise data ( $x$ , $\{b_{i}\}_{i=1}^{n}$ , $\{y_{i}\}_{i=1}^{n}$ ) as input. The image-encoder $\Phi_{I}$ is an image backbone to extract the image embedding $E_{i}\in\mathbb{R}^{P\times D}$ from the input image $x\in\mathbb{R}^{H\times W\times 3}$ , where $P$ represents the spatial size of the flattened image embedding, $D$ represents the dimension of embedding. The text encoder $\Phi_{T}$ takes the language text $\{y_{i}\in\mathbb{R}^{C}\}_{i=1}^{n}$ as input and obtains the text embedding $E_{t}\in\mathbb{R}^{C\times D}$ . Based on the DINO[3], the OV-DINO detector head comprises a transformer encoder, a language-aware query selection module, and a transformer decoder with a language-aware query fusion module. The transformer encoder $\Phi_{Enc}$ takes encoded image embedding $E_{i}$ as input and outputs the refined multi-scale image embedding $E_{enc}$ . The language-aware query selection module filters the most relevant image embedding according to the text embedding $E_{t}$ as the object embedding $E_{so}\in\mathbb{R}^{Q\times D}$ . The transformer decoder takes the selected object embedding $E_{so}$ and a learnable content query $Q_{lc}\in\mathbb{R}^{Q\times D}$ as inputs and interacts with the refined image embedding $E_{enc}$ , which enables the query classification following the language text content in the selective fusion module. After the decoder, followed a classification project layer $F_{c}$ that projects the query embedding to a classification query logits $O\in\mathbb{R}^{Q\times D}$ , and a regression layer $F_{r}$ that predicts bounding boxes coordinates $B\in\mathbb{R}^{Q\times 4}$ . $Q$ , and $C$ denote the length of queries and prompted captions, respectively. The classification alignment score matrix $S\in\mathbb{R}^{Q\times C}$ is obtained by calculating the similarity of $O$ and $E_{t}^{T}$ . The overall process of model forward can be formulated as follows:

$\displaystyle E_{\mathrm{i}}$	$\displaystyle=\Phi_{\mathrm{I}}(x),\;E_{\mathrm{t}}=\Phi_{\mathrm{T}}(y_{i}),% \;E_{\mathrm{enc}}=\Phi_{\mathrm{Enc}}(\mathrm{E}_{\mathrm{i}}),$	(7)
$\displaystyle E_{\mathrm{so}}$	$\displaystyle=\Phi_{\mathrm{QS}}(E_{\mathrm{enc}},E_{\mathrm{t}}),\;Q_{\mathrm% {sf}}=\Phi_{\mathrm{QF}}(E_{\mathrm{enc}},E_{\mathrm{so}},Q_{\mathrm{lc}}),$	(8)
$\displaystyle O$	$\displaystyle=F_{c}(Q_{\mathrm{sf}}),\;B=F_{r}(Q_{\mathrm{sf}}),\;S=O\otimes E% _{\mathrm{t}}^{T},$	(9)

where $E_{t}^{T}$ denotes the transpose of $E_{t}$ , $\otimes$ means the Kronecker product[54], $E_{\mathrm{sp}}$ is omitted for concise.

Model Optimization. The classification ground-truth $\mathrm{GT_{cls}}\in\{0,1\}^{Q\times C}$ is a matrix that indicates the matched relationship between predicted regions and prompted texts. The bounding box ground-truth $\mathrm{GT_{box}}\in\mathbb{R}^{Q\times 4}$ is a matrix that contains corresponding box coordinates, there are constructed using the bipartite matching algorithm as described in [8, 7]. The classification loss $\mathcal{L}_{cls}$ is calculated using the predicted alignment score $S$ and the ground-truth classification ground-truth $\mathrm{GT_{cls}}$ . The regression loss $\mathcal{L}_{reg}$ is calculated using the regressed bounding box $B$ and the bounding box ground-truth $\mathrm{GT_{box}}$ . The regression loss encompasses both the box loss $\mathcal{L}_{box}$ and the generalized intersection over union (GIoU) loss $\mathcal{L}_{giou}$ . In addition to the classification and regression losses, a denoising loss $\mathcal{L}_{dn}$ [55] is introduced to enhance the stability of the training process. This loss function contributes to improving the robustness of the model during training. To maintain the simplicity of the detection-centric framework, the optimization objective of the pre-training stage is kept consistent with DINO [8]. The whole optimization objective $\mathcal{L}$ is expressed as a combination of different loss components, and can be written as:

\mathcal{L}=\alpha\mathcal{L}_{cls}+\beta\mathcal{L}_{box}+\gamma\mathcal{L}_{% giou}+\mathcal{L}_{dn}.\\

(10)

Here, $\alpha$ , $\beta$ and $\gamma$ represent the weight factors of $\mathcal{L}_{cls}$ , $\mathcal{L}_{box}$ and $\mathcal{L}_{giou}$ , respectively. $\mathcal{L}_{cls}$ is implemented by a sigmoid focal loss[56]. $\mathcal{L}_{box}$ is implemented by an L1 loss. $\mathcal{L}_{giou}$ is implemented by a GIoU loss[57]. $\mathcal{L}_{dn}$ represents the sum of the denoising losses [55] of the box and label.

4 Experiments

TABLE II: Pre-Training Data. The dataset specifications used for pre-training OV-DINO. # Texts denotes the number of categories for the detection dataset, the number of phrases for the grounding data, and the number of captions for the image-text dataset, respectively. # Images denotes the number of images. # Anno. denotes the number of instance annotations. CC1M^‡ refers to our filtered 1M subset without any instance annotations.

Dataset	Type	# Texts	# Images	# Anno.
O365[58]	Detection	365	609K	9621K
GQA[59]	Grounding	387K	621K	3681K
Flickr30k[60]	Grounding	94K	149K	641K
CC1M^‡[31]	Image-Text	1M	1M	–

In this section, we demonstrate the effectiveness of the proposed OV-DINO by conducting extensive experiments on two widely used open-vocabulary detection benchmarks: the COCO[19] and LVIS[20]. We provide an overview of the pre-training datasets and the evaluation metrics in Section 4.1, and delve into the details of implementation in Section 4.2. We pre-train OV-DINO on large-scale diverse datasets and perform a zero-shot evaluation on the COCO and LVIS benchmarks. Following this, we fine-tune the pre-trained model on the COCO dataset and evaluate its performance in terms of close-set detection, as discussed in Section 4.3. To demonstrate the effectiveness of our model design, we conduct ablations in Section 4.4. Additionally, we present qualitative results for comparison with other methods, showcasing a clear representation of the detection results in Section 4.5.

4.1 Pre-Training Data and Evaluation Metric

TABLE III: Hyper-Parameters in Pre-Training and Fine-Tuning of OV-DINO. We emphasize the essential hyper-parameters for pre-training, while only addressing the distinct items of fine-tuning that differ from pre-training.

Item	Value
Pre-Training Config
batch size	128
training epochs	24
optimizer	AdamW[61]
weight decay	1e-4
optimizer momentum	$\beta_{1}=0.9,\beta_{2}=0.999$
warmup iter	1000
lr of image encoder	2e-4
lr of text encoder	2e-5
learning rate schedule	multi-step decay
clip max norm	0.1
input resolution	[800, 1333]
hidden dim (D)	256
# of encoder layers (N)	6
# of decoder layers (M)	6
# of heads	8
# of queries (Q)	900
# of prompted text (C)	150
cost of class	1
cost of bbox	5
cost of giou	2
loss of class ( $\alpha$ )	2
loss of bbox ( $\beta$ )	5
loss of giou ( $\gamma$ )	2
Fine-Tuning Config
batch size	32
lr of image encoder	1e-5
lr of text encoder	1e-6
# of prompted text (C)	80

Pre-Training Data. In our experiments, we make use of several datasets as referenced in [1, 3, 59]. These datasets comprise the Objects365 detection dataset [58], the GoldG grounding dataset [59], and the Conceptual Captions image-text dataset [31], as detailed in Table II. Our model is trained using the detection and grounding datasets following the methodology outlined in GLIP [1]. However, the image-text dataset contains a significant amount of low-quality image-text pairs, as illustrated in Figure 5. The caption of the left sample effectively describes the image content, whereas the caption of the right sample does not align well with the image content. To mitigate the noise in the image-text dataset, we employ CLIP-Large [11] to filter 1 million image-text pairs from the original CC3M dataset. The filtering process begins by computing the similarity of 3 million pairs and subsequently ranking the top 1 million based on their image-text similarity.

Evaluation Metric. After pre-training, we evaluate the performance of the proposed OV-DINO under a zero-shot setting on the COCO [19] and LVIS [20] benchmarks. In addition, we conduct further analysis by fine-tuning the pre-trained model on the COCO dataset to explore the effectiveness of continual fine-tuning. Following previous methods [1, 3], we use the standard Average Precision (AP) metric to evaluate the performance of COCO, and the Fixed AP[62] metric on LVIS for fair comparison.

4.2 Implementation Details

TABLE IV: Zero-shot Domain Transfer Evaluation on LVIS MiniVal and Val 1.0 Datasets(%). AP_r, AP_c, and AP_f indicate the AP of race, common and frequent categories, respectively. Gray numbers denote that the model is trained on the LVIS dataset using either supervised or few-shot settings. CC3M^† denotes the pseudo-labeled CC3M in [18]. CC1M^‡ denotes a filtered subset from the CC3M dataset in our setting.

Model	Image	Params	Pre-Training Data	LVIS MiniVal				LVIS Val
Model	Encoder	Params	Pre-Training Data	AP	AP_r	AP_c	AP_f	AP	AP_r	AP_c	AP_f
DETR[7]	RN101	–	LVIS	17.8	3.2	12.9	24.8	–	–	–	–
MDETR[59]	RN101	169M	GoldG, LVIS	24.2	20.9	24.9	24.3	–	–	–	–
MaskRCNN[6]	RN101	–	LVIS	33.3	26.3	34.0	33.9	–	–	–	–
GLIP-T(A)[1]	Swin-T	232M	O365	18.5	14.2	13.9	23.4	12.3	6.0	8.0	19.4
GLIP-T(B)[1]	Swin-T	232M	O365	17.8	13.5	12.8	22.2	11.3	4.2	7.6	18.6
GLIP-T(C)[1]	Swin-T	232M	O365, GoldG	24.9	17.7	19.5	31.0	16.5	7.5	11.6	26.1
GLIP-T[1]	Swin-T	232M	O365, GoldG, Cap4M	26.0	20.8	21.4	31.0	17.2	10.1	12.5	25.5
G-DINO-T²[3]	Swin-T	172M	O365, GoldG	25.6	14.4	19.6	32.2	–	–	–	–
G-DINO-T³[3]	Swin-T	172M	O365, GoldG, Cap4M	27.4	20.8	21.4	31.0	–	–	–	–
DetCLIP-T(A)[17]	Swin-T	155M	O365	28.8	26.0	28.0	30.0	22.1	18.4	20.1	19.4
DetCLIP-T(B)[17]	Swin-T	155M	O365, GoldG	34.4	26.9	33.9	36.3	27.2	21.9	25.5	31.5
DetCLIP-T[17]	Swin-T	155M	O365, GoldG, YFCC1M	35.9	33.2	35.7	36.4	28.4	25.0	27.0	31.6
YOLO-World-S[18]	YOLOv8-S	77M	O365, GoldG	26.2	19.1	23.6	29.8	24.2	16.4	21.7	27.8
YOLO-World-M[18]	YOLOv8-M	92M	O365, GoldG	31.0	23.8	29.2	33.9	–	–	–	–
YOLO-World-L[18]	YOLOv8-L	110M	O365, GoldG, CC3M^†	35.4	27.6	34.1	38.0	–	–	–	–
OV-DINO¹(Ours)	Swin-T	166M	O365	24.4	15.5	20.2	29.7	18.7	9.3	14.5	27.4
OV-DINO²(Ours)	Swin-T	166M	O365, GoldG	39.4	31.5	38.9	41.3	32.2	26.2	30.1	37.3
OV-DINO³(Ours)	Swin-T	166M	O365, GoldG, CC1M^‡	40.0	34.6	39.5	41.5	32.9	29.1	30.4	37.4

Model Architecture. Constrained by the high cost of model training, we pre-train the model specifically using Swin-T [21] as the image encoder, which has shown superior performance compared to other methods. To ensure a fair comparison, we utilized the BERT-base from HuggingFace [63] as the text encoder, consistent with the approaches used by GLIP [1] and G-DINO [3]. To incorporate category names in detection and noun phrases in grounding data during pre-training with image-text data, we adopted a unified data integration pipeline by prompting all category names or noun phrases with specific templates in CLIP [11], such as ”a photo of {category}.”. Following DINO [8], we extracted multi-scale features at 4 scales ranging from 8x to 64x. Additionally, we set the maximum number of prompted texts at 150, encompassing positive categories or phrases present in the image and randomly selected negative texts from all other data sources. For text embedding extraction, we employed the max-length padding mode and utilized mean pooling to aggregate text embedding along the length dimension. We integrated a linear projection layer to project the image embedding into the same embedding space as the text embedding. By default, we set the number of queries to 900, with six transformer layers in the encoder and decoder layers.

Model Training. To maintain simplicity in the model, we adhere to a similar training procedure as the original DINO setting [8]. We adopt the AdamW [61] optimizer with a weight decay of 1e-4. The total batch size is 128, with a base learning rate of 2e-4 for all model parameters except the text encoder, which has a learning rate of 0.1 times the base learning rate (specifically set to 1e-5). During the fine-tuning stage on COCO, the base learning rate is adjusted to 1e-5, while the remaining hyper-parameters remain the same as in the pre-training stage. Both pre-training and fine-tuning are conducted for 24 epochs (2x schedule), using a step learning rate schedule where the learning rate is reduced to 0.1 and 0.01 of the base learning rate at the 16th and 22nd epochs, respectively. The weights allocated to the classification loss, box loss, and GIoU loss are 2.0, 5.0, and 2.0, respectively. The weights for matching cost components are identical to the losses except for the classification cost, which is given a weight of 1.0. The hyper-parameters used in the pre-training and fine-tuning stages of OV-DINO are detailed in Table III.

4.3 Main Results

TABLE V: Zero-shot Domain Transfer and Fine-tuning Evaluation on COCO(%). OV-DINO achieves superior performance than prior methods in zero-shot evaluation. Further fully fine-tuned on COCO, OV-DNIO surpasses the previous State-of-the-Art (SoTA) performance under the same setting. Gray numbers denote the method is trained on the COCO dataset under the settings of supervised or few-shot.

Model	Image	Pre-Training Data	Data Size	Epochs	COCO 2017 Val
Model	Encoder	Pre-Training Data	Data Size	Epochs	Zero-Shot	Fine-Tuning
Faster RCNN[5]	RN50-FPN	COCO	118K	36	–	40.3
Faster RCNN[5]	RN101-FPN	COCO	118K	36	–	41.8
DyHead-T[64]	Swin-T	COCO	118K	24	–	49.7
DINO-T[8]	Swin-T	COCO	118K	24	–	51.3
GLIP-T(A)[1]	Swin-T	O365	0.66M	30	42.9	52.9
GLIP-T(B)[1]	Swin-T	O365	0.66M	30	44.9	53.8
GLIP-T(C)[1]	Swin-T	O365, GoldG	1.43M	30	46.7	55.1
GLIP-T[1]	Swin-T	O365, GoldG, Cap4M	5.43M	30	46.3	54.9
G-DINO-T¹[3]	Swin-T	O365	0.61M	50	46.7	56.9
G-DINO-T²[3]	Swin-T	O365, GoldG	1.38M	50	48.1	57.1
G-DINO-T³[3]	Swin-T	O365, GoldG, Cap4M	5.38M	50	48.4	57.2
YOLO-World-S[18]	YOLOv8-S	O365, GoldG	0.61M	100	37.6	45.9
YOLO-World-M[18]	YOLOv8-M	O365, GoldG	0.61M	100	42.8	51.2
YOLO-World-L[18]	YOLOv8-L	O365, GoldG, CC3M^†	1.63M	100	45.1	53.3
OV-DINO¹(Ours)	Swin-T	O365	0.60M	24	49.5	57.5
OV-DINO²(Ours)	Swin-T	O365, GoldG	1.38M	24	50.6	58.4
OV-DINO³(Ours)	Swin-T	O365, GoldG, CC1M^‡	2.38M	24	50.2	58.2

TABLE VI: Ablations on Unified Data Integration and Language-Aware Query Fusion. We evaluate the zero-shot performance on LVIS MiniVal of the proposed methods. UniDI, UniPro, and CapBox denote the Unified Data Integration, Unified Prompt, and Caption Box, respectively.

#	Pre-Training Data	UniDI		LASF	AP	AP_r	AP_c	AP_f
#	Pre-Training Data	UniPro	CapBox	LASF	AP	AP_r	AP_c	AP_f
0	O365-100K	✗	✗	✗	18.3	10.1	14.8	22.8
1	O365-100K	✓	✗	✗	18.9	12.8	15.2	23.4
2	O365-100K	✗	✗	✓	19.2	10.5	16.5	23.1
3	O365-100K	✓	✗	✓	19.5	12.8	16.6	23.4
4	O365-100K, CC-100K	✗	✓	✓	20.6	13.1	17.9	24.4
5	O365-100K, CC-100K	✓	✓	✓	22.0	14.0	20.0	25.2

LVIS Benchmark. In Table IV, we provide a comprehensive comparison of our proposed OV-DINO with recent state-of-the-art methods on the LVIS benchmark. The LVIS dataset is specifically designed to address long-tail objects and encompasses over 1000 categories for evaluation. Our evaluation of OV-DINO is conducted on the LVIS MiniVal and LVIS Val datasets under the zero-shot evaluation setting. OV-DINO surpasses previous state-of-the-art methods across various pre-training data settings. Notably, OV-DINO has fewer parameters and undergoes pre-training for only 24 epochs, which is notably less than the pre-training schedules of other methods. Despite this, OV-DINO achieves superior results, showcasing its effectiveness and exceptional capability in detecting a wide range of categories. Moreover, when integrated with the image-text dataset, OV-DINO attains the highest AP results using the Swin-T image encoder under fair pre-training settings. These findings underscore the robustness and efficacy of OV-DINO in the context of object detection tasks, particularly in scenarios involving diverse and numerous object categories.

COCO Benchmark. In Table V, we compare the proposed OV-DINO with recent state-of-the-art methods on the COCO benchmark in both zero-shot and fine-tuning settings. In the zero-shot setting, our models are pre-trained on various large-scale datasets and directly evaluated on the COCO dataset. Firstly, we pre-train the model on the Objects365 (O365) dataset [58] and evaluate it using the zero-shot manner, where OV-DINO outperforms all previous models in the zero-shot evaluation setting. Remarkably, OV-DINO achieves the best results when combined with the GoldG [59] data. Additionally, we further fine-tune the pre-trained model on the COCO dataset, resulting in a new record of 58.4 AP on COCO2017 val using only Swin-T [21] as the image encoder. It’s interesting to note that the addition of image-text data brings negative improvement to COCO, potentially due to the limited category names in the COCO dataset. Nevertheless, we find that image-text data is essential for discovering more diverse categories, as demonstrated in LVIS experiments.

4.4 Ablation Study

We conducted extensive ablation studies to analyze the effectiveness of the proposed OV-DINO. To reduce the cost of training with the full data, we randomly sampled 100,000 images from the original O365v1 [58] dataset and 100,000 images from the filtered CC3M [31] subset for all ablation studies. We set the batch size to 32 and the training schedule to 12 epochs. Unless specified, we pre-train OV-DINO on the sampled O365-100K and CC-100K datasets and evaluate zero-shot performance on the LVIS MiniVal dataset.

UniDI and LASF. In Table VI, we conduct an ablation study on the proposed Unified Data Integration (UniDI) and the Language-Aware Selective Fusion (LASF). The UniDI harmonizes different data sources through Unified Prompt and Caption Box, while the LASF selects and fuses the cross-modality information dynamically. The proposed UniDI and LASF consistently improve performance on LVIS MiniVal.

Variants of LASF. In Table VII, we make a comparison of variants of the proposed LASF with the Cross Attention Fusion (CAF) in G-DINO[3]. Figure 4 illustrates three variants of the LASF based on the insertion location: Later-LASF, Middle-LASF, and Early-LASF. Additionally, the architecture of CAF is provided for comparison. Extensive experiments are conducted to validate the effectiveness of LASF. All models in the ablations are pre-trained using a Swin-T as the image encoder on the sampled O365-100K subset. The results demonstrate that our LASF module is more effective in capturing language-aware context compared to the CAF module. Furthermore, the Later-LASF variant demonstrates superior zero-shot transfer ability on the LVIS MiniVal benchmark, which is adopted as our default architecture.

Text Embedding Pooling. In Table VIII, we evaluate the impact of different text embedding pooling methods, such as mean-pooling and max-pooling of the text embedding. We pre-train the models on O365-100K and CC-100K with these two pooling methods, and it is observed that mean pooling demonstrates superior performance when applied to combined datasets. The mean-pooling method is effective in capturing the comprehensive representation of a prompted caption, making it suitable for UniDI.

TABLE VII: Ablations on Variants of Language-Aware Selective Fusion and Cross Attention Fusion. We ablate the variants of LASF and CAF through the zero-shot LVIS MiniVal evaluation. All models are pre-trained on the O365-100K dataset.

#	Model	AP	AP_r	AP_c	AP_f
0	Baseline	18.3	10.1	14.8	22.8
1	Baseline + CAF	18.9	10.4	16.0	22.9
2	Baseline + Eearly-LASF	18.8	9.5	16.1	22.9
3	Baseline + Middle-LASF	18.5	9.4	15.5	22.8
4	Baseline + Later-LASF	19.2	10.5	16.5	23.1

TABLE VIII: Ablations on Text Embedding Pooling. We ablate the different text embedding pooling methods on O365-100K and CC-100K datasets, then evaluate zero-shot performance on LVIS MiniVal. The pooling methods considered include mean and max, where mean represents mean-pooling of text embedding, and max represents max-pooling of text embedding.

#	Pre-Training	EmbedPool		AP	AP_r	AP_c	AP_f
#	Pre-Training	mean	max	AP	AP_r	AP_c	AP_f
0	O365	✗	✓	19.0	11.8	15.7	23.3
1	O365	✓	✗	18.9	10.7	15.1	23.7
2	O365, CC	✗	✓	21.4	13.5	18.3	25.5
3	O365, CC	✓	✗	22.0	14.0	20.0	25.2

Data Source of Caption Box. In Table IX, we compare the performance of different data sources for Caption Box. We conducted the comparison by selecting the bottom and top 100K samples based on the image-text similarity of CLIP, as well as a random 100K sample. The results show that the rank_top data source yields the best performance, while the rank_bottom performs the worst. This highlights the inevitable noise in the image-text dataset and emphasizes the necessity of our filtering operation.

4.5 Qualitative Results

Visualization on COCO. We present visualization results derived from the pre-trained OV-DINO. Figure 6 showcases the visualization results of zero-shot inference on the COCO dataset, where only the box predictions with a confidence score exceeding the threshold of 0.5 are displayed. Furthermore, a comparison is made with the predictions of GLIP [1] and G-DINO [3]. The first column depicts the image with ground truth, the second and third columns show the predictions of GLIP-T(B) and G-DINO-T³ , and the last column represents the predictions of OV-DINO². It is evident from the visualization that OV-DINO produces more precise predictions with higher confidence scores and is adept at detecting small objects. These findings demonstrate the robust zero-shot transfer capability of OV-DINO in successfully detecting all objects based on the language text input.

Visualization on LVIS. We also present visualization results derived from the pre-trained OV-DINO³. Figure 7 illustrates the visualization results of zero-shot inference on the LVIS dataset. The LVIS dataset is a long-tail dataset with over 1200 categories, which can lead to numerous predictions in an image. For a clear visualization, we only display the box predictions with scores higher than 0.5. OV-DINO demonstrates exceptional performance in detecting a diverse range of categories, resulting in highly accurate predictions.

TABLE IX: Ablations on the Data Source of Caption Box. We ablate the different data sources of the image-text dataset and evaluate the zero-shot performance on LVIS MiniVal. The three data sources considered are: random_select entails randomly selecting 100K samples, rank_bottom and rank_top involve retaining the bottom 100K samples and the top 100K samples of the descending sorted image-text pairs, respectively.

#	Data Source	AP	AP_r	AP_c	AP_f
0	rank_bottom	19.6	9.5	16.7	24.0
1	random_select	20.8	11.6	18.1	24.8
2	rank_top	22.0	14.0	20.0	25.2

5 Discussions

Conclusions. In this paper, we present OV-DINO, a robust unified open-vocabulary detector that aims to improve the performance of open-vocabulary detection. We propose a unified data integration pipeline to efficiently integrate various data sources, enabling end-to-end training with a unified detection framework for consistency and coherence. Additionally, we introduce a language-aware selective fusion module to selectively fuse cross-modality information, thereby improving the overall performance of OV-DINO through dynamic fusion of multi-modal data. Experimental results demonstrate that OV-DINO outperforms previous state-of-the-art methods when evaluated on the challenging COCO and LVIS benchmarks.

Limitations. Despite the remarkable performance of OV-DINO as a unified open-vocabulary detection method, it is crucial to recognize that some specific challenges and limitations need to be addressed. One potential limitation is scaling up OV-DINO by incorporating a larger encoder and utilizing more extensive datasets. Scaling up shows a potential vision for improving the performance and applicability of the open-vocabulary detection model. However, it is inevitable to acknowledge that the pre-training requires stage substantial computational resources, which may present a barrier to scalability. Therefore, it is essential to strategically optimize the training process to facilitate the advancement of open-vocabulary tasks.

Broader Impact. In our research, we explore the detection-centric pre-training for open-vocabulary detection (OVD), which differs from the traditional approach of custom-designing for various data sources. Additionally, we introduce the concept of language-aware cross-modality fusion and alignment, marking a departure from the conventional method of simple region-concept alignment. Consequently, our research provides an innovative perspective for OVD. We expect that OV-DINO will encourage further exploration of ways to effectively leverage language-aware cross-modality information for open-vocabulary vision tasks.

References

[1] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
[2] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. H. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao, “Glipv2: Unifying localization and vision-language understanding,” in Advances in Neural Information Processing Systems, 2022.
[3] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
[4] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
[5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[6] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
[7] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 213–229.
[8] H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv preprint arXiv:2203.03605, 2022.
[9] A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran, “Zero-shot object detection,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 384–400.
[10] A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang, “Open-vocabulary object detection using captions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 393–14 402.
[11] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[12] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning. PMLR, 2021, pp. 4904–4916.
[13] L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: fine-grained interactive language-image pre-training,” arXiv preprint arXiv:2111.07783, 2021.
[14] Y. Long, Y. Wen, J. Han, H. Xu, P. Ren, W. Zhang, S. Zhao, and X. Liang, “Capdet: Unifying dense captioning and open-world detection pretraining,” 2023.
[15] Y. Xu, M. Zhang, C. Fu, P. Chen, X. Yang, K. Li, and C. Xu, “Multi-modal queried object detection in the wild,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[16] C. Feng, Y. Zhong, Z. Jie, X. Chu, H. Ren, X. Wei, W. Xie, and L. Ma, “Promptdet: Towards open-vocabulary detection using uncurated images,” in European Conference on Computer Vision. Springer, 2022, pp. 701–717.
[17] L. Yao, J. Han, Y. Wen, X. Liang, D. Xu, W. Zhang, Z. Li, C. Xu, and H. Xu, “Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection,” arXiv preprint arXiv:2209.09407, 2022.
[18] T. Cheng, L. Song, Y. Ge, W. Liu, X. Wang, and Y. Shan, “Yolo-world: Real-time open-vocabulary object detection,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2024.
[19] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
[20] A. Gupta, P. Dollár, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” Computer Vision and Pattern Recognition, 2019.
[21] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
[22] P. Ren, C. Li, G. Wang, Y. Xiao, Q. Du, X. Liang, and X. Chang, “Beyond fixation: Dynamic window visual transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 987–11 997.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[24] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146–3154.
[25] H. Wang, W. Wang, and J. Liu, “Temporal memory attention for video semantic segmentation,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 2254–2258.
[26] P. Ren, C. Li, H. Xu, Y. Zhu, G. Wang, J. Liu, X. Chang, and X. Liang, “Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency,” arXiv preprint arXiv:2302.10307, 2023.
[27] S. Wu, W. Zhang, S. Jin, W. Liu, and C. C. Loy, “Aligning bag of regions for open-vocabulary object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 254–15 264.
[28] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[29] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
[30] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[31] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
[32] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
[33] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
[34] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.
[35] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
[36] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic vision linguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
[37] L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, “Normalized and geometry-aware self-attention network for image captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 327–10 336.
[38] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 684–699.
[39] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning. PMLR, 2015, pp. 2048–2057.
[40] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
[41] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li, “Dynamic fusion with intra-and inter-modality attention flow for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6639–6648.
[42] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16. Springer, 2020, pp. 121–137.
[43] H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, and F. Wei, “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” arXiv preprint arXiv:2111.02358, 2021.
[44] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning. PMLR, 2022, pp. 12 888–12 900.
[45] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning. PMLR, 2023, pp. 19 730–19 742.
[46] Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 793–16 803.
[47] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv preprint arXiv:2104.13921, 2021.
[48] L. Yao, J. Han, X. Liang, D. Xu, W. Zhang, Z. Li, and H. Xu, “Detclipv2: Scalable open-vocabulary object detection pre-training via word-region alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23 497–23 506.
[49] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.
[50] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021.
[51] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
[52] C. Lin, P. Sun, Y. Jiang, P. Luo, L. Qu, G. Haffari, Z. Yuan, and J. Cai, “Learning object-language alignments for open-vocabulary object detection,” arXiv preprint arXiv:2211.14843, 2022.
[53] X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” in European Conference on Computer Vision. Springer, 2022, pp. 350–368.
[54] C. F. Van Loan, “The ubiquitous kronecker product,” Journal of computational and applied mathematics, vol. 123, no. 1-2, pp. 85–100, 2000.
[55] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “Dn-detr: Accelerate detr training by introducing query denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 619–13 627.
[56] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[57] G. I. O. Union, “A metric and a loss for bounding box regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 658–666.
[58] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, “Objects365: A large-scale, high-quality dataset for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8430–8439.
[59] A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1780–1790.
[60] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649.
[61] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[62] A. Dave, P. Dollár, D. Ramanan, A. Kirillov, and R. Girshick, “Evaluating large-vocabulary object detectors: The devil is in the details,” arXiv preprint arXiv:2102.01066, 2021.
[63] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
[64] X. Dai, Y. Chen, B. Xiao, D. Chen, M. Liu, L. Yuan, and L. Zhang, “Dynamic head: Unifying object detection heads with attentions,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7373–7382.