Summary of CoDet Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

Summary CoDet Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection arxiv.org

8,577 words - PDF document - View PDF document

One Line

CoDet enhances object detection by aligning co-occurring objects, surpassing other methods, and achieving top-notch performance through the use of region correspondences and text guidance.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

CoDet: Enhancing Open-Vocabulary Object Detection

Source: arxiv.org - PDF - 8,577 words - view

Introducing CoDet

• CoDet is a novel approach to open-vocabulary object detection.

• It addresses the challenge of deriving reliable region-word alignment from image-text pairs.

• CoDet reformulates region-word alignment as a co-occurring object discovery problem.

Visual: Illustration showing image-text pairs and the challenge of alignment

Leveraging Co-Occurrence Information

• CoDet groups images that mention a shared concept in their captions.

• It leverages visual similarities to discover and align co-occurring objects with the shared concept.

• By using cross-image region similarity, CoDet identifies regions potentially containing the common object.

Visual: Diagram showing the process of grouping images and aligning co-occurring objects

Outperforming Existing Methods

• CoDet achieves superior performances and scalability in open-vocabulary detection.

• It surpasses the previous state-of-the-art by 4.2 AP on novel categories in the OV-LVIS benchmark.

• CoDet achieves 37.0 AP on novel categories and 44.7 AP on all categories in the OV-LVIS benchmark.

Visual: Comparison chart showing CoDet's performance against other methods

Discovering Region-Word Correspondences

• CoDet avoids reliance on a pre-aligned vision-language space.

• It solely relies on the vision space to discover region-word correspondences.

• The method constructs semantic groups by sampling images that mention a shared concept.

Visual: Illustration showing the process of discovering region-word correspondences

Introducing Text Guidance

• CoDet introduces text guidance into similarity estimation between region proposals.

• This makes the estimation concept-aware and more accurately reflects the closeness of objects concerning the shared semantic concept.

• Text guidance significantly improves the performance of CoDet, particularly on novel categories.

Visual: Diagram showing the integration of text guidance in similarity estimation

Experimental Results

• CoDet achieves superior performances in novel object detection on OV-LVIS and OV-COCO benchmark datasets.

• It outperforms other state-of-the-art methods in terms of average precision (AP).

• CoDet exhibits strong scalability with visual backbones, leading to significant performance gains.

Visual: Graph showing CoDet's performance on different benchmark datasets

Ablation Study: Impact of Alignment Strategies

• Alignments based on region-region similarity produce more accurate pseudo labels.

• Strategies based on region-region similarity generate more reliable results compared to other strategies.

• Self-training benefits CoDet, steadily increasing pseudo-label quality.

Visual: Visualizations comparing different alignment strategies

Ablation Study: Impact of Text Guidance and Discovery Strategy

• Text guidance significantly improves the performance of CoDet, particularly on novel categories.

• The prototype-based strategy for co-occurring object discovery achieves better results.

• Increasing the group size reduces ambiguity when there are multiple co-occurring concepts in a group.

Visual: Diagram showing the impact of text guidance and discovery strategy

Conclusion

• CoDet presents a novel approach to open-vocabulary object detection by leveraging co-occurrence information.

• It effectively discovers co-occurring objects for alignment and achieves state-of-the-art performance.

• CoDet has the potential to be combined with previous efforts in aligning regions and words with vision-language models.

Visual: Image representing the conclusion

Key Takeaways

• CoDet enhances open-vocabulary object detection by reformulating region-word alignment and leveraging co-occurrence information.

• It outperforms existing methods, achieving superior performances and scalability.

• CoDet introduces text guidance and achieves state-of-the-art results on benchmark datasets.

Visual: Image summarizing the main message

Key Points

CoDet is a novel approach to open-vocabulary object detection that addresses the challenge of deriving reliable region-word alignment from image-text pairs.
CoDet reformulates region-word alignment as a co-occurring object discovery problem by grouping images that mention a shared concept in their captions and leveraging visual similarities.
CoDet outperforms existing methods in open-vocabulary detection, achieving superior performances and scalability.
CoDet avoids the reliance on a pre-aligned vision-language space and solely relies on the vision space to discover region-word correspondences.
CoDet introduces text guidance into similarity estimation between region proposals to make it concept-aware and more accurately reflect the closeness of objects concerning the shared semantic concept.
Experimental results demonstrate the effectiveness of CoDet in achieving superior performances in novel object detection on OV-LVIS and OV-COCO benchmark datasets.
CoDet presents a novel approach to open-vocabulary object detection by reformulating region-word alignment and leveraging co-occurrence information.

Summaries

26 word summary

CoDet improves object detection by aligning co-occurring objects and outperforms existing methods in open-vocabulary detection. It effectively leverages region correspondences and text guidance for state-of-the-art performance.

80 word summary

CoDet is a novel approach to open-vocabulary object detection that improves region-word alignment by discovering and aligning co-occurring objects. It achieves superior performances, outperforming existing methods in open-vocabulary detection. CoDet focuses on the open-vocabulary setting and leverages region correspondences across images to accurately reflect the closeness of objects. Experimental results show CoDet's effectiveness in achieving superior performances, particularly on novel categories. Overall, CoDet effectively leverages cross-image region correspondences and text guidance to discover co-occurring objects for alignment, achieving state-of-the-art performance.

172 word summary

CoDet is a novel approach to open-vocabulary object detection that improves region-word alignment by discovering and aligning co-occurring objects. It achieves superior performances and scalability, outperforming existing methods in open-vocabulary detection. CoDet achieves 37.0 AP on novel categories and 44.7 AP on all categories in the OV-LVIS benchmark, surpassing the previous state-of-the-art by 4.2 AP and 9.8 AP, respectively.

CoDet focuses on the open-vocabulary setting and leverages region correspondences across images to discover and align co-occurring objects. It introduces text guidance into similarity estimation to accurately reflect the closeness of objects concerning the shared semantic concept.

Experimental results show CoDet's effectiveness in achieving superior performances in novel object detection on various benchmark datasets. It significantly improves performance, particularly on novel categories, and the prototype-based strategy for co-occurring object discovery yields better results.

Overall, CoDet effectively leverages cross-image region correspondences and text guidance to discover co-occurring objects for alignment, achieving state-of-the-art performance on various benchmark datasets. It presents a novel approach to open-vocabulary object detection and has potential for combining with vision-language models.

417 word summary

CoDet is a novel approach to open-vocabulary object detection that reformulates region-word alignment as a co-occurring object discovery problem. It groups images that mention a shared concept in their captions and uses visual similarities to discover and align co-occurring objects with the shared concept. CoDet outperforms existing methods in open-vocabulary detection, achieving superior performances and scalability. It achieves 37.0 AP on novel categories and 44.7 AP on all categories in the OV-LVIS benchmark, surpassing the previous state-of-the-art by 4.2 AP and 9.8 AP, respectively.

Object detection traditionally has a fixed vocabulary, but CoDet focuses on the open-vocabulary setting, training the detector to recognize objects of arbitrary categories. Existing methods rely on vision-language models (VLMs) for region-word alignments, but these models have limitations in alignment accuracy for novel concepts. CoDet proposes a new approach that leverages region correspondences across images for co-occurring concept discovery and alignment. It constructs semantic groups by sampling images that mention a shared concept in their captions, inferring the existence of a common object corresponding to the shared concept across images.

CoDet avoids reliance on a pre-aligned vision-language space and solely relies on the vision space to discover region-word correspondences. It introduces text guidance into similarity estimation between region proposals to make it concept-aware and more accurately reflect the closeness of objects concerning the shared semantic concept. Experimental results demonstrate the effectiveness of CoDet in achieving superior performances in novel object detection on OV-LVIS and OV-COCO benchmarks.

The authors evaluate CoDet's performance on different benchmark datasets and compare it with existing methods, showing its superior results in terms of average precision (AP) on COCO and Objects365 datasets. They investigate different alignment strategies and find that alignments based on region-region similarity produce more accurate pseudo labels. Text guidance significantly improves the performance of CoDet, particularly on novel categories. The prototype-based strategy for co-occurring object discovery achieves better results.

Increasing the group size in co-occurring concept discovery reduces ambiguity when there are multiple co-occurring concepts in a group. However, preferences for concept group size differ between human-curated caption data and web-crawled image-text pairs. Overall, CoDet effectively leverages cross-image region correspondences and text guidance to discover co-occurring objects for alignment, achieving state-of-the-art performance on various benchmark datasets.

In conclusion, CoDet presents a novel approach to open-vocabulary object detection by reformulating region-word alignment as a co-occurring object discovery problem. It outperforms existing methods and demonstrates scalability with visual backbones. The authors highlight the potential of combining CoDet with previous efforts in aligning regions and words with vision-language models.

992 word summary

CoDet is a novel approach to open-vocabulary object detection that addresses the challenge of deriving reliable region-word alignment from image-text pairs. Unlike existing methods that rely on pre-trained or self-trained vision-language models for alignment, CoDet reformulates region-word alignment as a co-occurring object discovery problem. The key idea is to group images that mention a shared concept in their captions and leverage visual similarities to discover and align co-occurring objects with the shared concept.

CoDet outperforms existing methods in open-vocabulary detection, achieving superior performances and scalability. For example, by scaling up the visual backbone, CoDet achieves 37.0 AP on novel categories and 44.7 AP on all categories in the OV-LVIS benchmark, surpassing the previous state-of-the-art by 4.2 AP and 9.8 AP, respectively. The code for CoDet is available on GitHub.

Object detection is a fundamental vision task, but traditional detectors are mostly limited to a fixed vocabulary defined by training data. To address this limitation, CoDet focuses on the open-vocabulary setting of object detection, where the detector is trained to recognize objects of arbitrary categories. Recent advancements in vision-language pretraining on web-scale image-text pairs have inspired the adaptation of this paradigm to object detection. However, human-annotated region-text pairs are limited and difficult to scale, leading to the need for methods that can mine additional region-text pairs from image-text pairs.

Existing methods typically rely on vision-language models (VLMs) to determine region-word alignments. However, the quality of generated pseudo region-text pairs is subject to limitations of VLMs. VLMs pre-trained with image-level supervision are largely unaware of localization quality, while detector-based VLMs mitigate this issue but still face limitations in alignment accuracy for novel concepts. Additionally, obtaining high-quality region-word pairs requires a VLM with object-level vision-language knowledge, but training such a VLM depends on a large number of aligned region-word pairs.

CoDet proposes a new approach by leveraging region correspondences across images for co-occurring concept discovery and alignment. The method constructs semantic groups by sampling images that mention a shared concept in their captions. By grouping images, CoDet infers the existence of a common object corresponding to the shared concept across images. It then uses cross-image region similarity to identify regions potentially containing the common object and constructs a prototype from them. The prototype and the shared concept form a natural region-text pair, which is used to supervise the training of an open-vocabulary object detector.

CoDet avoids the reliance on a pre-aligned vision-language space and solely relies on the vision space to discover region-word correspondences. However, there could be multiple co-occurring concepts in the same group, and the same concept may exhibit high intra-category variation in appearance. To address this issue, CoDet introduces text guidance into similarity estimation between region proposals to make it concept-aware and more accurately reflect the closeness of objects concerning the shared semantic concept.

The main contributions of CoDet include introducing a novel perspective in discovering region-word correspondences from image-text pairs, proposing an open-vocabulary detection framework that learns object-level vision-language alignment directly from web-crawled image-text pairs, consistently outperforming existing methods on the challenging OV-LVIS benchmark, and exhibiting strong scalability with visual backbones.

Experimental results demonstrate the effectiveness of CoDet. It achieves superior performances in novel object detection on OV-LVIS, outperforming other state-of-the-art methods. CoDet also performs well on OV-COCO, achieving the second best performance among existing methods. The scalability of CoDet is validated by testing with more powerful visual backbones, which leads to significant performance gains. Ablation studies show that CoDet works well on web-crawled data, which is a more practical setting and can easily be scaled up.

In conclusion, CoDet presents a novel approach to open-vocabulary object detection by reformulating

The paper presents CoDet, a method for open-vocabulary object detection that leverages co-occurrence information to discover region-word alignments. The authors conduct experiments to demonstrate the effectiveness of CoDet in various scenarios.

In the first set of experiments, the authors evaluate CoDet's performance on different benchmark datasets. They compare CoDet with several existing methods and show that CoDet achieves superior results in terms of average precision (AP). CoDet outperforms the second-best method by 2.0% and 2.1% AP on COCO and Objects365 datasets, respectively. This demonstrates the generalization capability of CoDet across different image domains and vocabularies.

Next, the authors investigate different alignment strategies for discovering reliable region-word alignments. They compare strategies based on region-word similarity, hand-crafted prior, and region-region similarity. The results show that alignments based on region-region similarity produce more accurate pseudo labels compared to the other two strategies. Additionally, the authors find that their method can benefit from self-training, which leads to steadily increasing pseudo-label quality. However, this pattern is not observed in similarly self-trained models relying on region-word similarity for alignment.

The authors provide visualizations of pseudo bounding box labels generated by different alignment strategies. The visualizations show that strategies based on region-region similarity produce more accurate and reliable results compared to the other strategies.

In an ablation study, the authors analyze the impact of text guidance and the strategy for co-occurring object discovery on the performance of CoDet. They find that text guidance significantly improves the performance of CoDet, particularly on novel categories. They also compare a prototype-based strategy with a heuristic strategy for co-occurring object discovery and show that the prototype-based strategy achieves better results.

The authors further analyze the impact of the size of the concept group on the performance of CoDet. They find that increasing the group size can effectively reduce ambiguity when there are multiple co-occurring concepts in a group. However, they observe different preferences for concept group size on human-curated caption data and web-crawled image-text pairs.

In conclusion, the authors make the first attempt to explore co-occurrence information for open-vocabulary object detection. They propose CoDet, which effectively leverages cross-image region correspondences and text guidance to discover co-occurring objects for alignment. The experiments demonstrate the state-of-the-art performance of CoDet on various benchmark datasets. The authors also highlight the potential of combining their method with previous efforts in aligning regions and words with vision-language models.

Raw indexed text (56,320 chars / 8,577 words / 1,090 lines)

CoDet: Co-Occurrence Guided Region-Word

Alignment for Open-Vocabulary Object Detection

Chuofan Ma 1,2† Yi Jiang 2∗ Xin Wen 1∗

Zehuan Yuan 2

The University of Hong Kong

Xiaojuan Qi 1

ByteDance Inc.

Abstract

Deriving reliable region-word alignment from image-text pairs is critical to learn

object-level vision-language representations for open-vocabulary object detection.

Existing methods typically rely on pre-trained or self-trained vision-language

models for alignment, which are prone to limitations in localization accuracy or

generalization capabilities. In this paper, we propose CoDet, a novel approach

that overcomes the reliance on pre-aligned vision-language space by reformulating

region-word alignment as a co-occurring object discovery problem. Intuitively, by

grouping images that mention a shared concept in their captions, objects corre-

sponding to the shared concept shall exhibit high co-occurrence among the group.

CoDet then leverages visual similarities to discover the co-occurring objects and

align them with the shared concept. Extensive experiments demonstrate that CoDet

has superior performances and compelling scalability in open-vocabulary detection,

e.g., by scaling up the visual backbone, CoDet achieves 37.0 AP m

novel and 44.7 AP all

on OV-LVIS, surpassing the previous SoTA by 4.2 AP novel and 9.8 AP all . Code is

available at https://github.com/CVMI-Lab/CoDet.

Introduction

Object detection is a fundamental vision task that offers object-centric comprehension of visual

scenes for various downstream applications. While remarkable progress has been made in terms of

detection accuracy and speed, traditional detectors [39, 37, 21, 5, 43] are mostly constrained to a

fixed vocabulary defined by training data, e.g., 80 categories in COCO [29]. This accounts for a major

gap compared to human visual intelligence, which can perceive a diverse range of visual concepts in

the open world. To address such limitations, this paper focuses on the open-vocabulary setting of

object detection [57], where the detector is trained to recognize objects of arbitrary categories.

Recently, vision-language pretraining on web-scale image-text pairs has demonstrated impressive

open-vocabulary capability in image classification [35, 25]. It inspires the community to adapt this

paradigm to object detection [27, 51], specifically by training an open-vocabulary detector using

region-text pairs in detection or grounding annotations [27]. However, unlike free-form image-text

pairs, human-annotated region-text pairs are limited and difficult to scale. Consequently, a growing

body of research [59, 27, 53, 15, 28] aims to mine additional region-text pairs from image-text pairs,

which raises a new question: how to find the alignments between regions and words? (Figure 1a)

Recent studies typically rely on vision-language models (VLMs) to determine region-word alignments,

for example, by estimating region-word similarity [59, 27, 15, 28]. Despite its simplicity, the quality

of generated pseudo region-text pairs is subject to limitations of VLMs. As illustrated in Figure 1b,

VLMs pre-trained with image-level supervision, such as CLIP [35], are largely unaware of localization

quality of pseudo labels [59]. Although detector-based VLMs [27, 53] mitigate this issue to some

extent, they are initially pre-trained with a limited number of detection or grounding concepts,

†

∗

This work was performed when Chuofan Ma worked as an intern at ByteDance.

Equal contribution.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).(b) Alignment by VLM

(a) Example Image-Text Pair

dog

man

frisbee

Caption: “A dog leaps to catch a

frisbee above a fallen man.”

How to find the alignment between regions and words?

“a dog”

“a man”

“a frisbee”

VLM

Images-text pairs that mention “frisbee” in their captions

*For simplicity, we show only one concept group in this illustration.

Figure 1: Illustration of different region-text alignment paradigms. (a): example image-text

pair, and region proposals generated by a pre-trained region proposal network; (b): a pre-trained

VLM (e.g., CLIP [35]) is used to retrieve the box with the highest region-word similarity concerning

the query text, which yet exhibits poor localization quality; (c) our method overcomes the reliance

on VLMs by exploring visual clues, i.e., object co-occurrence, within a group of image-text pairs

containing the same concept (e.g., frisbee ). Best viewed in color.

resulting in inaccurate alignments for novel concepts [61]. Furthermore, this approach essentially

faces a chicken-and-egg problem: obtaining high-quality region-word pairs requires a VLM with

object-level vision-language knowledge, yet training such a VLM, in turn, depends on a large number

of aligned region-word pairs.

In this work, instead of directly aligning regions and words with VLMs, we propose leveraging

region correspondences across images for co-occurring concept discovery and alignment, which we

call CoDet. Figure 1c illustrates the idea. Our key motivation is that objects corresponding to the

same concept should exhibit consistent visual similarity across images, which provides visual clues

to identity region-word correspondences. Based on this intuition, we construct semantic groups by

sampling images that mention a shared concept in their captions, from which we can infer that a

common object corresponding to the shared concept exists across images. Subsequently, we leverage

cross-image region similarity to identify regions potentially containing the common object, and

construct a prototype from them. The prototype and the shared concept form a natural region-text

pair, which is then adopted to supervise the training of an open-vocabulary object detector.

Unlike previous works, our method avoids the dependence on a pre-aligned vision-language space

and solely relies on the vision space to discover region-word correspondences. However, there could

exist multiple co-occurring concepts in the same group, and even the same concept may still exhibit

high intra-category variation in appearance, in which case general visual similarity would fail to

distinguish the object of interest. To address this issue, we introduce text guidance into similarity

estimation between region proposals, making it concept-aware and more accurately reflecting the

closeness of objects concerning the shared semantic concept.

The main contributions of this paper can be summarized as follows:

• We introduce a novel perspective in discovering region-word correspondences from image-text

pairs, which bypasses the dependence on a pre-aligned vision-language space by reformulating

region-word alignment as a co-occurring object discovery problem.

• Building on this insight, we propose CoDet, an open-vocabulary detection framework that learns

object-level vision-language alignment directly from web-crawled image-text pairs.

• CoDet consistently outperforms existing methods on the challenging OV-LVIS benchmark and

demonstrates superior performances in cross-dataset detection on COCO and Objects365.

• CoDet exhibits strong scalability with visual backbones - it achieves 23.4/29.4/37.0 mask AP novel

with ResNet50/Swin-B/EVA02-L backbone, outperforming previous SoTA at a comparable model

size by 0.8/3.1/4.2 mask AP novel , respectively.

Related Work

Zero-shot object detection (ZSD) leverages language feature space for generalization to unseen

objects. The basic idea is to project region features to the pre-computed text embedding space (e.g.,

GloVe [34]) and use word embeddings as the classifier weights [2, 10, 36]. This presents ZSD

with the flexibility of recognizing unseen objects given its name during inference. Nevertheless,

2ZSD settings restrict training samples to come from a limited number of seen classes, which is not

sufficient to align the feature space of vision and language. Although some works [63, 42] try to

overcome this limitation by hallucinating novel classes using Generative Adversarial Network [17],

there is still a large performance gap between ZSD and its supervised counterparts.

Weakly supervised object detection (WSD) exploits data with image-level labels to train an object

detector. It typically treats an image as a bag of proposals, and assigns the image label to these

proposals through multiple instance learning [4, 9, 3, 46]. By relieving object detection from costly

instance-level annotations, WSD is able to scale detection vocabulary with cheaper classification data.

For instance, recent work Detic [61] greatly expands the vocabulary of detectors to twenty-thousand

classes by leveraging image-level supervision from ImageNet-21K [11]. However, WSD still requires

non-trivial annotation efforts and has a closed vocabulary during inference.

Open-vocabulary object detection (OVD) is built upon the framework of ZSD, but relaxes

the stringent definition of novel classes from ‘not seen’ to ‘not known in advance’, which leads

to a more practical setting [57]. Particularly, with recent advancement of vision-language pre-

training [35, 25, 55, 54], a widely adopted approach of OVD is to transfer the knowledge of pre-

trained vision-language models (VLMs), e.g., CLIP [35], to object detectors through distillation [18,

50] or weight transfer [33, 26]. Despite its utility, performances of these methods are arguably

restricted by the teacher VLM, which is shown to be largely unaware of fine-grained region-word

alignment [59, 6]. Alternatively, another group of works utilize large-scale image-text pairs to expand

detection vocabulary [57, 59, 56, 14, 27, 28, 52], sharing a similar idea as WSD. Due to the absence of

regional annotations in image-caption data, these methods typically rely on pre-trained or self-trained

VLMs to find region-word correspondences, which are prone to limitations in localization accuracy

or generalization capabilities. Our method is orthogonal to all the aforementioned approaches

in the sense that it does not explicitly model region-word correspondences, but leverage region

correspondences across images to bridge regions and words, which greatly simplifies the task.

Cross-image Region Correspondence is widely explored to discover semantically related regions

among a collection of images [20, 41, 45, 44, 30]. Based on the observation that modern visual

backbones provide consistent semantic correspondences in the feature space [58, 20, 47, 24, 60, 1],

many works take a heuristic approach [41] or use clustering algorithms [20, 8, 23, 49] for common

region discovery. Our method takes inspiration from these works to discover co-occurring objects

across image-text pairs, with newly proposed text guidance.

Method

In this section, we present CoDet, an end-to-end framework exploiting image-text pairs for open-

vocabulary object detection. Figure 2 gives an overview of CoDet. We first provide a brief introduction

to the OVD setup (Sec. 3.1). Then we discuss how to reformulate region-word alignment as a co-

occurring object discovery problem (Sec. 3.2), which is subsequently addressed by CoDet (Sec. 3.3).

Finally, we summarize the overall training objectives and inference pipelines of CoDet (Sec. 3.4).

3.1

Preliminaries

Task formulation. In our study, we adopt the classical OVD problem setup as in OVR-CNN [57].

Specifically, box annotations are only provided for a predetermined set of base categories C base during

training. While at the test phase, the object detector is required to generalize beyond C base to detect

objects from novel categories C novel . Notably, C novel is not known in advance to simulate open-world

scenarios. This implies the object detector needs to possess the capability to recognize potentially

any object based on its name. To achieve this goal, we additionally leverage image-text pairs with an

unbounded vocabulary C open to extend the lexicon of the object detector.

OVD framework. Our method is built on the two-stage detection framework Mask-RCNN [21].

To adapt it for the open-vocabulary setting, we follow the common practice [18, 61] to decouple

localization from classification, by replacing the class-specific localization heads with class-agnostic

ones that produce a single box or mask prediction for each region proposal. Besides, to enable

3…

A cat is nuzzling up

to a dog's snout

Detector

dog

🐶

Region-Region

similarity

There are many goats

and a white dog

Support images

“A photo of a dog”

Prototype

Query image

Text

E T

encoder

Weights

Text guidance

A dog is relaxing

on the couch

Concept group

Region embeddings

of query image

Region embeddings

of support image

Text embeddings

of group concept

Weighted sum

Figure 2: Overview of CoDet. Our method learns to jointly discover region-word pairs from a

group of image-text pairs that mention a shared concept in their captions (e.g., dog in the figure).

We identify the co-occurring objects and the shared concept as natural region-word pairs. Then we

leverage inter-image region correspondences, i.e., region-region similarity, with text guidance to

locate the co-occurring objects for region-word alignment.

open-vocabulary classification of objects, the fixed classifier weights are substituted with dynamic

text embeddings of category names, which are generated by the pre-trained text encoder of CLIP [35].

3.2

Aligning Regions and Words by Co-occurrence

Due to the absence of box annotations, a core challenge of utilizing image-text pairs for detection

training is to figure out the fine-grained

alignments between regions and words. To put it formally,

for an image-text pair ⟨I, T ⟩, R = r 1 , r 2 , ..., r |R| denotes the set of regions proposals extracted

from image I, and C = c 1 , c 2 , ..., c |C| denotes the set of concept words extracted from caption

T . Under weak caption supervision, we can assume the presence of a concept c in T indicates the

existence of at least one region r in I containing the corresponding object. However, the concrete

correspondence between c and r remains unknown.

To solve the problem, we propose to explore the global context of caption data and align regions

and words via co-occurrence. Specifically, for a given concept c, we construct a concept group G

comprising all image-text pairs that mention c in their captions. This grouping effectively clusters

images that contain objects corresponding to concept c. Consequently, if G is large enough, we can

induce that the objects corresponding to concept c will be the most common objects among images

in G. This natural correlation automatically aligns the co-occurring objects in G to concept c. We

thus reduce the problem of modeling cross-modality correspondence (region-word) to in-modality

correspondence (region-region), which we address in the next section.

3.3

Discovering Co-occurring Objects across Images

Intuitively, candidate region proposals containing the co-occurring object should exhibit consistent

and similar visual patterns across the images. We thus take a similarity-driven approach to discover

these proposals. As illustrated in Figure 2, during training, we only sample a mini-group of images

from the concept group as inputs in consideration of efficiency. In the mini-group, we iteratively

choose one image as query image, and leave the rest as support images. Note that the regional

proposals for each image are cached to avoid re-computation when swapping query and support

images. The basic idea is to discover co-occurring objects in the query image from region proposals

that have close neighbors across the support images. To fulfill this purpose, we introduce text-guided

similarity estimation and similarity-based prototype discovery in the following paragraphs.

Similarity-based prototype discovery. Since modern visual backbones provide consistent feature

correspondences for visually similar regions across images [58, 20, 47, 24, 60, 1], a straightforward

way to identify co-occurring objects is to measure the cosine similarity between features of region

proposals. Concretely, we calculate the pairwise similarity of region proposals between the query

image and the support images. This yields a similarity matrix S ∈ R n×mn , where n stands for

the number of proposals per image, and m stands for the number of support images. Intuitively,

4co-occurring regions should exhibit high responses (similarities) in the last dimension of S. But

instead of using hand-crafted rules as in [41], we employ a two-layer MLP, denoted as Φ, to derive

co-occurrence from S. Φ is trained to estimate the probability of each region proposal in the query

images as a co-occurring region, solely conditioned on S. Here, we do not explicitly supervise the

output probability since there is no ground-truth annotation, but Φ is encouraged to assign high

probabilities for co-occurring regions to minimize the overall region-word alignment loss. Based

on the estimated probability vector p ∈ R n , we obtain the prototypical region features for the

co-occurring object via simple weighted sum:

f p =

p i · f i ,

where p = softmax (Φ (S)) .

i=1

(1)

As the text label for this prototype naturally corresponds to the shared concept c in the mini-group.

We can thus learn region-word alignment with a binary cross-entropy (BCE) classification loss:

L region-word = L BCE (Wf p , c), where L BCE (s, c) = − log σ (s c ) −

log (1 − σ (s k )) , (2)

k̸ = c

where W the classifier weight derived from C open , and σ(·) stands for the sigmoid function.

Text-guided region-region similarity estimation. However, general cosine similarity may not

always truly reflect closeness of objects in the semantic space, as objects of the same category may

exhibit significant variance in appearance. Moreover, there could exist multiple co-occurring concepts

among the sampled images, which incurs ambiguity in identifying co-occurrence. To address the

problems, we introduce text guidance into similarity estimation to make it concept-aware. Concretely,

given features of two region proposals f i , f j ∈ R d , where d is the dimension of feature vectors,

we additionally introduce w c ∈ R d , the text embedding of concept c (the shared concept in the

mini-group), to re-weight similarity calculation:

√ |w c |

f i

f j

⊤

s ij = w̄ c ·

◦

, where w̄ c = d

(3)

∥f i ∥ ∥f j ∥

∥w c ∥

where “◦” denotes Hadamard product, “|·|” denotes absolute value operation, and “∥·∥” denotes

ℓ 2 -normalization, respectively. Here, the rationale is that the relative magnitude of text features

at different dimensions indicates their relative importance in the classification of concept c. By

weighting the image feature similarities with the text feature magnitudes, the similarity measurement

can put more emphasis on feature dimensions that reflect more of the text features. Therefore, the

re-weighted similarity metric provides a more nuanced and tailored measure of the proximity between

objects in the context of a particular concept. It is noteworthy that we choose regional features from

the output of the penultimate layer of the classification head (the last layer is the text embedding) so

that they naturally reside in the shared feature space as text embeddings.

3.4

Training and Inference

Following [61, 28], we train the model simultaneously on detection data and image-text pairs to

acquire localization capability and knowledge of vision-language alignments. In addition to learning

region-level alignments from region-word pairs discovered in Sec. 3.3, we treat image-text pairs

as a generalized form of region-word pairs to learn image-level alignments. Particularly, we use a

region proposal covering the entire image to extract image features, and encode the entire caption

into language embeddings. Similar to CLIP [35], we consider each image and its original caption as

a positive pair and other captions in the same batch as negative pairs. We then use a BCE loss similar

to Eq. (2) to calculate the image-text matching loss L image-text . The overall training objective for this

framework is:

L rpn + L reg + L cls ,

if I ∈ D det

L(I) =

(4)

L region-word + L image-text , if I ∈ D cap

where L rpn , L reg , L cls are standard losses in the two-stage detector. For inference, CoDet does not

require cross-image correspondence modeling as in training. It behaves like a normal two-stage

object detector by forming the classifier with arbitrary language embeddings.

Experiments

4.1

Benchmark Setup

OV-LVIS is a general benchmark for open-vocabulary object detection, built upon LVIS [19] dataset

which contains a diverse set of 1203 categories of objects with a long-tail distribution. Following

standard practice [18, 59], we set the 866 common and frequent categories in LVIS as base categories,

and leave the 337 rare categories as novel categories. Besides, we choose CC3M [40] which contains

2.8 million free-from image-text pairs crawled from the web, as the source of image-text pairs. The

main evaluation metric on OV-LVIS is the mask AP of novel (rare) categories.

OV-COCO is derived from the popular COCO [29] benchmark for evaluation of zero-shot and open-

vocabulary object detection methods [2, 57]. It splits the categories of COCO into 48 base categories

and 17 novel categories, while removing the 15 categories without a synset in the WordNet [32]. As

for image-caption data, following existing works [57, 61, 28], we use COCO Caption [7] training set

which provides 5 human-generated captions for each image for experiments on OV-COCO. The main

evaluation metric on OV-COCO is the box AP 50 of novel categories.

4.2

Implementation Details

We extract object concepts from the text corpus of COCO Caption/CC3M using an off-the-shelf

language parser [32]. Remarkably, we filter out concepts without a synset in WordNet or outside the

scope of the ‘object’ definition (i.e., not under the hierarchy of ‘object’ synset in WordNet) to clean

the extracted concepts. For phrases of more than one word, we simply apply the filtering logic to the

last word in the phrase. Subsequently, we remove concepts with a frequency lower than 100/20 in

COCO-Caption/CC3M. This leaves 634/4706 concepts for COCO/CC3M.

For experiments on OV-LVIS, unless otherwise specified, we use CenterNet2 [62] with ResNet50 as

the backbone, following [61, 28]. For OV-COCO, Faster R-CNN with a ResNet50-C4 [22] backbone

is adopted. To achieve faster convergence, we initialize the model with parameters from the base class

detection pre-training as in [61, 28]. The batch size on a single GPU is set to 2/8 for COCO/LVIS

detection data and 8/32 for COCO Caption/CC3M caption data. The ratio between the detection

batch and caption batch is set to 1:1 during co-training. Notably, a caption batch by default contains

four mini-groups, where each mini-group is constructed by sampling 2/8 image-text pairs from the

same concept group in COCO Caption/CC3M. We train the model for 90k iterations on 8 GPUs.

Table 1: Comparison with state-of-the-art open-vocabulary object detection methods on OV-

LVIS. Caption supervision means the method learns vision-language alignment from image-text

pairs, while CLIP supervision indicates transferring knowledge from pre-trained CLIP. The column

‘Strict’ indicates whether the method follows a strict open-vocabulary setting.

Method Backbone Supervision Strict AP m

novel AP m

c AP m

f AP m

all

ViLD [18]

RegionCLIP [59]

DetPro [12]

OV-DETR [56]

PromptDet [14]

Detic [61]

F-VLM [26]

VLDet [28]

BARON [50]

CoDet (Ours) RN50-FPN

RN50-C4

RN50-FPN

RN50-C4

RN50-FPN

RN50

RN50-FPN

RN50

RN50-FPN

RN50 CLIP

Caption

CLIP

Caption

CLIP

Caption

CLIP

Caption ✓

✓

✗

✓

✓ 16.6

17.1

19.8

17.4

19.0

19.5

18.6

21.7

22.6

23.4 24.6

27.4

25.6

25.0

18.5

29.8

27.6

30.0 30.3

34.0

28.9

32.5

25.8

34.3

29.8

34.6 25.5

28.2

25.9

26.6

21.4

30.9

24.2

30.1

27.6

30.7

RegionCLIP [59]

Detic [61]

F-VLM [26]

VLDet [28]

CoDet (Ours) R50x4 (87M)

SwinB (88M)

R50x4 (87M)

SwinB (88M)

SwinB (88M) Caption

Caption

CLIP

Caption

Caption ✓

✗

✓

✓ 22.0

23.9

26.3

29.4 32.1

40.2

39.4

39.5 36.9

42.8

41.9

43.0 32.3

38.4

28.5

38.1

39.2

F-VLM [26]

CoDet (Ours) R50x64 (420M)

EVA02-L (304M) CLIP

Caption ✓

✓ 32.8

37.0 -

46.3 -

46.3 34.9

44.7

64.3

Benchmark Results

Table 1 presents our results on OV-LVIS. We follow a strict open-vocabulary setting where novel

categories are kept unknown during training, to ensure we obtain a generic open-vocabulary detector

not biased towards specific novel categories. It can be seen that CoDet consistently outperforms SoTA

methods in novel object detection. Especially, among the group of methods learning from caption

supervision, CoDet surpasses all alternatives which rely on CLIP (RegionCLIP [59], OV-DETR[56]),

max-size prior (Detic [61]), or self-trained VLM (PromptDet [14], VLDet [28]) to generate pseudo

region-text pairs, demonstrating the superiority of visual guidance in region-text alignment.

In addition, we validate the scalability of CoDet by testing with more powerful visual backbones,

i.e., Swin-B [31] and EVA02-L [13]. It turns out that our method scales up surprisingly well with

model capacity - it leads to a +6.0/13.6 AP m

novel performance boost by switching from ResNet50 to

Swin-B/EVA02-L, and continuously enlarges the performance gains over the second best method with

a comparable model size. We believe this is because stronger visual representations provide more

consistent semantic correspondences across images, which are critical for discovering co-occurring

objects among the concept group.

Table 2 presents our results on OV-COCO, where CoDet achieves the second best performance

among existing methods. Compared with the leading method VLDet, the underperformance of CoDet

can be mainly attributed to the human-curated bias in COCO Caption data distribution. That is,

images in COCO Caption contain at least one of the 80 categories in COCO, which leads to highly

concentrated concepts. For instance, roughly 1/2 of the images contain ‘people’, and 1/10 of the

images contain ‘car’. This unavoidably incurs many hard negatives for identifying co-occurring

objects of interest. Ablation studies on the concept group size of CoDet in Table 5 reveals the same

problem. But we believe this would not harm the generality of our method as we show CoDet works

well on web-crawled data (e.g., CC3M), which is a more practical setting and can easily be scaled up.

Table 2: Comparison with state-of-the- Table 3: Cross-datasets transfer detection from

art methods on OV-COCO. † : imple- OV-LVIS to COCO and Objects365. † : Detection-

mented with Deformable DETR [64].

specialized pre-training with SoCo [48].

Method

OVR-CNN [57]

ViLD [18]

RegionCLIP [59]

Detic [61]

OV-DETR [56] †

PB-OVD [15]

VLDet

CoDet (Ours)

4.4

AP all

AP base

AP novel

22.8

27.6

26.8

27.8

29.4

29.1

32.0

30.6

46.0

59.5

54.8

47.1

61.0

44.4

50.6

52.3

39.9

51.3

47.5

42.0

52.7

40.4

45.8

46.6

COCO

Method

Objects365

AP 50 AP 75 AP AP 50 AP 75

Supervised [18] 46.5 67.6 50.9 25.6 38.6

28.0

ViLD [18]

DetPro [12] †

F-VLM [26]

BARON [50]

CoDet (Ours) 55.6

53.8

53.1

55.7

57.0 39.8

37.4

34.6

39.1

42.3 11.8

12.1

11.9

13.6

14.2 18.2

18.8

19.2

21.0

20.5

12.6

12.9

12.6

14.5

15.3

36.6

34.9

32.5

36.2

39.1

Transfer to Other Datasets

In simulation to detection in the open world, where test data may come from different domains,

we conduct cross-dataset transfer detection experiments in Table 3. Specifically, we transfer the

open-vocabulary detector trained on OV-LVIS (LVIS base + CC3M) to COCO and Objects365 v1, by

plugging in the vocabulary of test datasets without further model fine-tuning. In comparison with

existing works, CoDet outperforms the second-best method ViLD [18] which uses a 32× training

schedule by 2.0% and 2.1% AP on COCO and Objects365, validating the generalization capability of

CoDet across image domains and vocabularies.

4.5

Visualization and Analysis

Discovering reliable region-word alignments is critical to learn object-level vision-language represen-

tations from image-text pairs. In this section, we investigate different types of alignment strategies

that are primarily based on: 1) region-word similarity, i.e., assigning words to regions of the highest

similarity [38, 14]; 2) hand-crafted prior, i.e., assigning words to regions of the maximum size [61];

7and 3) region-region similarity, i.e., assigning words to regions of the maximum weight, derived by

CoDet from region-region similarity matrix (one may refer to Figure 2 for clarity).

Cover rate

Figure 3 shows a comparison of these strategies.

Specifically, we employ the model trained with

60%

different strategies on OV-COCO benchmark

50%

to generate pseudo-labels for novel categories

in COCO validation set. Note that the pseudo-

40%

labels are generated with the same strategy as in

30%

training, not by prediction results. We evaluate

Iterations

the quality of pseudo-labels by cover rate, which

20%

10k

30k

50k

70k

90k

is defined as the ratio of pseudo-labels whose

region-region simi.

region-word simi.

hand-crafted

assigned region has a mIoU > 0.5 with the clos-

Figure 3: A comparison of different alignment est ground-truth box. At the end of training, we

strategies on OV-COCO. Cover rate is the ratio of can see that alignments based on region-region

assigned proposal covering the ground-truth box. similarity produce much more accurate pseudo

labels compared with the other two strategies, manifesting the reliability of visual guidance. An-

other intriguing finding is that, our method can benefit from self-training which leads to steadily

increasing pseudo-label quality, while such pattern is not observed in similarly self-trained model

relying on region-word similarity for alignment. We conjecture this is because the model gets stuck

in the aforementioned chicken-and-egg problem, which is reflected in its unstable training curve

(see Appendix B). This also aligns with the finding in [61] 3 . Further qualitative comparisons are

presented in Figure 4.

70%

cellphone

dishwasher

cellphone

dishwasher

cellphone

dishwasher

cellphone

dishwasher

cellphone

dishwasher

Figure 4: Visualization of pseudo bounding box labels generated by different region-word alignment

strategies. From top to bottom, each row shows results of strategies based on region-region similarity,

region-word similarity, and hand-crafted prior, respectively. Zoom in for a better view.

4.6

Ablation Study

Text-guidance. We ablate the impact of text guidance in estimating inter-region similarity on OV-

COCO. As shown in Table 4a, introducing text guidance leads to a significant performance boost on

novel AP. This finding aligns with our intuition that putting region similarity estimation in a semantic

context can provide better measurements of semantic closeness. Figure 5 further demonstrates how

text guidance facilitates mitigating interference from irrelevant concepts.

Detic shows that matching regions and words based on highest region-word similarity produces highly

inconsistent pseudo labels at different training stages.

8Table 4: Ablation study on effective components. We show that both text guidance and prototype-

based strategy substantially facilitate co-occurring object discovery.

(a) Text guidance

(b) Strategy for co-occurring object discovery

Text guide AP novel

50 AP base

50 AP all

50 Strategy AP novel

50 AP base

50 AP all

✗

✓ 26.6

30.6 52.4

52.3 45.7

46.6 Heuristic [41]

Prototype-based 26.9

30.6 52.4

52.3 45.7

46.6

Heuristic vs. Prototype-based co-occurring object discovery. To investigate the effectiveness

of prototype-based strategy in co-occurring object discovery, we adopt a heuristic strategy in image

co-segmentation [41] for comparison. Results are presented in Table 4b. The heuristic strategy

works to select a single region proposal that has close neighbors across support images following

hand-crafted rules (please refer to Appendix A for more details). Our prototype-based strategy makes

a substantial improvement over this simple baseline by 3.7 AP on novel categories. We speculate the

gains mainly come from two aspects: 1) robustness to noisy similarity estimations; We notice that

some region proposals in the background may be estimated with high similarity in the early stage,

which disturbs the selection by the heuristic strategy. While our prototype-based strategy avoids hard

selection by assigning soft weights to each region proposal to construct a prototype, thus is more

robust to such noises. 2) the ability to harness multiple instances; considering that there may be

multiple instances corresponding to the shared concept in an image, prototype-based strategy can

effectively make use of region proposals of different instances to construct a prototype, which we

show in Appendix C).

w/o text guidance

w/ text guidance “pigeon”

Figure 5: There can be more than one co-occurring concept among sampled images. Text guidance

helps filter out the distracting concept (chair legs) and focus on the concept of interest (pigeons).

Size of concept group. In CoDet, co-occurring object discovery is based on a mini-group of

images sampled from the same concept group during training. Since the size of the mini-group

is generally small, sometimes there could be more than one co-occurring concept in a group, in

which case the model will be confused about which concept to discover, as illustrated in Figure 5.

Besides introducing text guidance, intuitively, increasing the group size can also effectively reduce

this ambiguity, as verified by results on OV-LVIS in Table 5. However, contrary results are observed

in experiments on OV-COCO. We speculate these abnormal results are probably caused by the

aforementioned human-curated bias in COCO Caption (See Sec. 4.3). Due to the highly concentrated

concepts, increasing the group size will undesirably introduce more concurrent concepts that harm

the model performances.

Table 5: Ablation study on concept group size. CoDet shows different preferences of concept group

size on human-curated caption data (OV-COCO) and web-crawled image-text pairs (OV-LVIS).

OV-COCO

Group Size

OV-LVIS

AP novel

50 AP base

50 AP all

50 AP m

novel AP m

c AP m

f AP m

all

30.6

29.9

29.1 52.3

51.2

50.9 46.6

45.6

45.2 21.9

21.8

22.7 30.3

30.2

30.3 35.0

34.9

34.7 30.7

30.6

30.7

Limitations and Conclusions

In this paper, we make the first attempt to explore visual clues, i.e., object co-occurrence, to discover

region-word alignments for open-vocabulary object detection. We present CoDet, which effectively

leverages cross-image region correspondences and text guidance to discover co-occurring objects for

alignment, achieving state-of-the-art results on various OVD benchmarks. On the other hand, our

method is orthogonal to previous efforts in aligning regions and words with VLMs. Combining the

advantages of both sides is a promising direction of research but is under-explored here. We leave

this for further investigation.

Acknowledgements This work has been supported by Hong Kong Research Grant Council - Early

Career Scheme (Grant No. 27209621), General Research Fund Scheme (Grant No. 17202422),

RGC Theme-based research (T45-701/22-R) and RGC Matching Fund Scheme (RMGS). Part of the

described research work is conducted in the JC STEM Lab of Robotics for Soft Materials funded by

The Hong Kong Jockey Club Charities Trust.

10References

[1] Yutong Bai, Xinlei Chen, Alexander Kirillov, Alan Yuille, and Alexander C Berg. Point-level region

contrast for object detection pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 16061–16070, 2022. 3, 4

[2] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. Zero-shot object

detection. In Proceedings of the European Conference on Computer Vision (ECCV), pages 384–400, 2018.

2, 6

[3] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object detection with convex

clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages

1081–1089, 2015. 3

[4] Hakan Bilen and Andrea Vedaldi. Weakly supervised deep detection networks. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, pages 2846–2854, 2016. 3

[5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey

Zagoruyko. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th

European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer,

2020. 1

[6] Peixian Chen, Kekai Sheng, Mengdan Zhang, Yunhang Shen, Ke Li, and Chunhua Shen. Open vocabulary

object detection with proposal mining and prediction equalization. arXiv preprint arXiv:2206.11134, 2022.

[7] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and

C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint

arXiv:1504.00325, 2015. 6

[8] Jang Hyun Cho, Utkarsh Mall, Kavita Bala, and Bharath Hariharan. Picie: Unsupervised semantic

segmentation using invariance and equivariance in clustering. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages 16794–16804, 2021. 3

[9] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia Schmid. Weakly supervised object localization with

multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence,

39(1):189–203, 2016. 3

[10] Berkan Demirel, Ramazan Gokberk Cinbis, and Nazli Ikizler-Cinbis. Zero-shot object detection by hybrid

region embedding. In British Machine Vision Conference (BMVC), 2018. 2

[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical

image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.

IEEE, 2009. 3

[12] Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learning to prompt for open-

vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pages 14084–14093, 2022. 6, 7

[13] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual

representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023. 7

[14] Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, and

Lin Ma. Promptdet: Towards open-vocabulary detection using uncurated images. In Computer Vision –

ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, page

701–717, 2022. 3, 6, 7

[15] Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, and Caiming Xiong.

Open vocabulary object detection with pseudo bounding-box labels. In Computer Vision–ECCV 2022:

17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, pages 266–282.

Springer, 2022. 1, 7

[16] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret

Zoph. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings

of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2918–2928, 2021. 17

[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron

Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–

144, 2020. 3

11[18] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and

language knowledge distillation. In International Conference on Learning Representations, 2022. 3, 6, 7

[19] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation.

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356–

5364, 2019. 6

[20] Mark Hamilton, Zhoutong Zhang, Bharath Hariharan, Noah Snavely, and William T. Freeman. Unsu-

pervised semantic segmentation by distilling feature correspondences. In International Conference on

Learning Representations, 2022. 3, 4

[21] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE

International Conference on Computer Vision, pages 2961–2969, 2017. 1, 3

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778,

2016. 6

[23] Olivier J Hénaff, Skanda Koppula, Evan Shelhamer, Daniel Zoran, Andrew Jaegle, Andrew Zisserman,

João Carreira, and Relja Arandjelović. Object discovery and representation networks. In Computer

Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part

XXVII, pages 123–143, 2022. 3

[24] Hanzhe Hu, Jinshi Cui, and Liwei Wang. Region-aware contrastive learning for semantic segmentation. In

Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16291–16301, 2021. 3,

[25] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung,

Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text

supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 1, 3

[26] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova. Open-vocabulary object

detection upon frozen vision and language models. In The Eleventh International Conference on Learning

Representations, 2023. 3, 6, 7, 16

[27] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan

Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–

10975, 2022. 1, 3

[28] Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, and Jianfei Cai.

Learning object-language alignments for open-vocabulary object detection. In The Eleventh International

Conference on Learning Representations, 2023. 1, 3, 5, 6, 7

[29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,

and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014:

13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages

740–755. Springer, 2014. 1, 6

[30] Weide Liu, Chi Zhang, Guosheng Lin, and Fayao Liu. Crnet: Cross-reference networks for few-shot

segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pages 4165–4173, 2020. 3

[31] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin

transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages 10012–10022, 2021. 7

[32] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41,

1995. 6

[33] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Doso-

vitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua

Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection. In Computer Vision –

ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X, page

728–755, 2022. 3

[34] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word

representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language

Processing (EMNLP), pages 1532–1543, 2014. 2

12[35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish

Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from

natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR,

2021. 1, 2, 3, 4, 5

[36] Shafin Rahman, Salman Khan, and Nick Barnes. Transductive learning for zero-shot object detection. In

Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6082–6091, 2019. 2

[37] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time

object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

pages 779–788, 2016. 1

[38] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages 7263–7271, 2017. 7

[39] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection

with region proposal networks. Advances in Neural Information Processing Systems, 28, 2015. 1

[40] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned,

hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual

Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565,

2018. 6

[41] Gyungin Shin, Weidi Xie, and Samuel Albanie. Reco: Retrieve and co-segment for zero-shot transfer. In

Advances in Neural Information Processing Systems, volume 35, pages 33754–33767, 2022. 3, 5, 9, 15

[42] Zhao Shizhen, Gao Changxin, Shao Yuanjie, Li Lerenhan, Yu Changqian, Ji Zhong, and Sang Nong. Gtnet:

Generative transfer network for zero-shot object detection. In Proceedings of the AAAI Conference on

Artificial Intelligence (AAAI), 2020. 3

[43] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li,

Zehuan Yuan, Changhu Wang, et al. Sparse r-cnn: End-to-end object detection with learnable proposals.

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14454–

14463, 2021. 1

[44] Sara Vicente, Vladimir Kolmogorov, and Carsten Rother. Cosegmentation revisited: Models and optimiza-

tion. In Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete,

Greece, September 5-11, 2010, Proceedings, Part II 11, pages 465–479. Springer, 2010. 3

[45] Sara Vicente, Carsten Rother, and Vladimir Kolmogorov. Object cosegmentation. In CVPR 2011, pages

2217–2224. IEEE, 2011. 3

[46] Fang Wan, Chang Liu, Wei Ke, Xiangyang Ji, Jianbin Jiao, and Qixiang Ye. C-mil: Continuation multiple

instance learning for weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on

Computer Vision and Pattern Recognition, pages 2199–2208, 2019. 3

[47] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, and Luc Van Gool. Exploring

cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International

Conference on Computer Vision, pages 7303–7313, 2021. 3, 4

[48] Fangyun Wei, Yue Gao, Zhirong Wu, Han Hu, and Stephen Lin. Aligning pretraining for detection via

object-level contrastive learning. Advances in Neural Information Processing Systems, 34:22682–22694,

2021. 7

[49] Xin Wen, Bingchen Zhao, Anlin Zheng, Xiangyu Zhang, and Xiaojuan Qi. Self-supervised visual

representation learning with semantic grouping. In Advances in Neural Information Processing Systems,

volume 35, pages 16423–16438, 2022. 3

[50] Size Wu, Wenwei Zhang, Sheng Jin, Wentao Liu, and Chen Change Loy. Aligning bag of regions for

open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 15254–15264, 2023. 3, 6, 7, 16

[51] Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. Universal instance

perception as object discovery and retrieval. In Proceedings of the IEEE/CVF International Conference on

Computer Vision, 2023. 1

[52] Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, and Hang Xu. Detclipv2:

Scalable open-vocabulary object detection pre-training via word-region alignment. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23497–23506, 2023. 3

13[53] Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing

Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world

detection. In Advances in Neural Information Processing Systems, volume 35, pages 9125–9138, 2022. 1

[54] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca:

Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research,

2022. 3

[55] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong

Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv

preprint arXiv:2111.11432, 2021. 3

[56] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with

conditional matching. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel,

October 23–27, 2022, Proceedings, Part IX, pages 106–122. Springer, 2022. 3, 6, 7

[57] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection

using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pages 14393–14402, 2021. 1, 3, 6, 7

[58] Feihu Zhang, Philip Torr, René Ranftl, and Stephan Richter. Looking beyond single images for contrastive

semantic segmentation learning. In Advances in Neural Information Processing Systems, volume 34, pages

3285–3297, 2021. 3, 4

[59] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei

Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–

16803, 2022. 1, 3, 6, 7, 16

[60] Tianfei Zhou, Wenguan Wang, Ender Konukoglu, and Luc Van Gool. Rethinking semantic segmentation: A

prototype view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

pages 2582–2593, 2022. 3, 4

[61] Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, and Ishan Misra. Detecting twenty-

thousand classes using image-level supervision. In Computer Vision–ECCV 2022: 17th European Confer-

ence, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX, pages 350–368. Springer, 2022. 2, 3, 5,

6, 7, 8, 17

[62] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Probabilistic two-stage detection. arXiv preprint

arXiv:2103.07461, 2021. 6, 17

[63] Pengkai Zhu, Hanxiao Wang, and Venkatesh Saligrama. Don’t even look once: Synthesizing features

for zero-shot detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 11693–11702, 2020. 3

[64] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable

transformers for end-to-end object detection. In International Conference on Learning Representations,

2021. 7

14A

A Heuristic Baseline for Co-occurrence Discovery

In this section, we introduce the baseline method used for ablation study in Table 4b in more detail.

This baseline is adapted from a recently proposed image co-segmentation method ReCo [41]. As

shown in Figure 6, it basically consists of four steps to identify the co-occurring object in the query

image: First, it estimates pair-wise region similarity between region proposals of the query image

and support images, which is the same as CoDet. This yields a similarity matrix S ∈ R n×m×n ,

where n stands for the number of proposals per image, and m stands for the number of support

images. Second, it applies a max operator on the last dimension of S, which serves to find the nearest

neighbor in each support image for each region proposal in the query image. This reduces S to an

n × m matrix. Third, it applies a mean operator on the second dimension of S to derive the average

support that each proposal has among the support images. Finally, it identifies the co-occurring object

as the one with the highest average maximum similarity (support) among support images, by applying

an argmax operator on the first dimension of S.

𝑄 ! " 𝑃 ! 𝑄 " " 𝑃 ! … 𝑄 # " 𝑃 !

𝑄 ! " 𝑃 " 𝑄 " " 𝑃 " … 𝑄 # " 𝑃 "

…

𝑄 ! " 𝑃 # 𝑄 " " 𝑃 #

…

… 𝑄 # " 𝑃 #

max

𝑄 ! " 𝑃 $ 𝑄 " " 𝑃 %

…

mean

𝑄 # " 𝑃 &

𝑆 !

…

𝑆 "

𝑆 #

argmax

K-th region

proposal

Figure 6: Illustration of the baseline method for co-occurrence discovery. Q and P are region

proposals in the query image and support images, respectively. S is the averaged maximum similarity

score across support images.

Further Analysis on Different Alignment Strategies

Complementing the discourse in Section 4.5, we further delineate the performance of different

alignment strategies with respect to novel category AP 50 on OV-COCO in Figure 7. It can be seen that

strategies based on region-region similarity or hand-crafted rules (max-size) show steady improvement

in novel object recognition across training, whereas the performance of region-word similarity-based

method is highly unstable and even decreases in the early stage. A possible explanation is that solely

relying on region-word similarity to align regions and words may be more susceptible to errors in

pseudo-labels. For instance, if the model incorrectly matches the text label ‘seagull’ with the object

‘dove’ at the initial phase, its supervision signal would pull the two closer in the shared feature space.

This negative feedback could directly harm the following pseudo-labeling process, thus, there is a

higher probability for the model to make the same mistake.

Novel AP%

0.1x lr

Iterations

10k

20k

30k

40k

region-region simi.

50k

60k

region-word simi.

70k

80k

90k

hand-crafted

Figure 7: Performance of different alignment strategies at discrete training stages on OV-COCO.

15C

Visualization on OV-LVIS and OV-COCO

We visualize more detection results of CoDet in Figure 8 and Figure 9. On OV-LVIS, we can see that

CoDet successfully detects many rare objects, e.g., gas mask, puffin, horse buggy, heron, satchel, and

so on (Figure 8). This validates that CoDet can efficiently leverage web-crawled image-text pairs to

learn open-word knowledge for novel object recognition. On OV-COCO, our method continues to

demonstrate strong open-vocabulary capability and correctly detects some hard samples, e.g., the

occluded ‘tie’ and ‘elephant’ (upper left of Figure 9). Nevertheless, we also notice that the prediction

scores for novel categories are generally lower than base categories, which suggests the model is

biased towards base classes in OV-COCO. Such tendency to overfit base categories is also observed

in other works [59, 26, 50], due to the small training vocabulary of OV-COCO. We believe adopting

tricks like focal loss could alleviate this issue and further benefit our method.

Figure 8: Visualization of prediction results by CoDet on OV-LVIS. For clarity, we only show

results for novel categories.

Figure 9: Visualization of prediction results by CoDet on OV-COCO. Red boxes are for novel

categories, while blue boxes are for base categories.

16D

Implementation Details

Table 6 lists the detailed hyper-parameter configuration used for our OV-LVIS and OV-COCO

experiments. We follow Detic [61] to use low input resolution and large batch size for caption data to

achieve better trade-off between efficiency and performance.

Table 6: Hyper-parameter configuration of CoDet. LSJ stands for large scale jittering [16].

Resolution refers to the resized short side length of input images.

Configuration OV-LVIS OV-COCO

Optimizer

Gradient clipping

Learning rate (LR)

Total iterations

Warmup iterations

Step decay factor

Step decay schedule

Data augmentation

Batch size (detection)

Batch size (caption)

Detection/Caption data ratio

Federated loss [62]

Repeat factor sampling

L region-word weight

L image-text weight AdamW

True

2e-4

90k

–

LSJ

1:4

True

0.2

0.2 SGD

True

2e-2

90k

–

0.1×

[60k, 80k]

none

1:4

False

0.1