YOLO-World技术小结

	info
paper	https://arxiv.org/abs/2401.17270
code	https://github.com/AILab-CVC/YOLO-World
org	腾讯
demo	https://huggingface.co/spaces/stevengrove/YOLO-World
个人博客位置	http://www.myhz0606.com/article/yolo_world

1 Motivation

这篇文章从计算效率的角度解决开集目标检测问题（open-vocabulary object detection，OVD）。

在这里插入图片描述

2 Method

经典的目标检测的instance annotation是bounding box和类别对 $\Omega = \{ B_i, c_i\}^{N}_{i=1}$ 。对于OVD来说，此时的注释变为 $\Omega = \{ B_i, t_i\}^{N}_{i=1}$ ，此处的 $t$ 可以是类别名、名词短语、目标描述等。此外YOLO-Word还可以根据传入的图片和text，输出预测的box及相关的object embedding。

2.1 模型架构

在这里插入图片描述

模型架构由3个部分组成

YOLO backbone，用于提取多尺度的图片特征
text encoder，用于提取名词短语的特征。流程如下：给定一段text，首先会提取里面的名词，随后将提取的每个名词短语输入CLIP中得到向量。可以知道text encoder的输出 $W$ $\in \mathbb{R} ^{C \times D}$ , $C$ 是名词短语的数量， $D$ 是embedding的维度
Vision-Language PAN。用于预测bounding box和object embedding。其架构如下图所示，核心组件有两个，分别为Text-guided CSPLayer 及Image-Pooling Attention。下面对其进行简单介绍

Text-guided CSPLayer

该层的目的是为了用文本向量来强化图片特征。具体计算公式如下

$\prime } = X _ { l } \cdot \delta ( \max _ { j \in \{ 1 . . C \} } ( X _ { l } W _ { j } ^ { \top } ) ) ^ { \top } \tag{1}$

式中： $\, \in \, \mathbb { R } ^ { \, H \times W \times D } \, ( l \, \in \, \{ 3 , 4 , 5 \} )$ 为多尺度的图片特征。 $W_j$ 为名词 $j$ 的text embedding。 $\delta$ 为sigmoid函数。

在这里插入图片描述

**Image-Pooling Attention**

该层的目的是为了用图片特征来强化文本向量。具体做法为：将多尺度图片特征通过max pooling，每个尺度经过max-pooling后的size $\in \mathbb{R} ^ {3 \times 3 \times D}$ 即9个patch token，因为有3个尺度，总计27个patch token,记作 $\tilde { X } \in \mathbb{R}^{27 \times D}$ 。随后将这27个patch token作为 cross-attention的key，value，将text embedding作为query进行特征交互，从而得到image-aware的文本特征向量。

$\prime } = W + \mathrm { M u l t i H e a d } \mathrm { A t t e n t i o n } ( W , \tilde { X } , \tilde { X } ) \; \tag{2}$

2.2 优化目标

优化目标分为两部分：其一是针对语义的region-text 对比损失 $\mathcal{L} _ {\mathrm{con}}$ ,其二是针对检测框的IOU loss $\mathcal{L}_{\mathrm{iou}}$ 和distributed focal loss $\mathcal{L}_{\mathrm{fld}}$ ,总体优化目标如下：

$\mathcal L } ( I ) \; = \; { \mathcal L } _ { \mathrm { c o n } } \, + \, \lambda _ { I } \, \cdot \, ( { \mathcal L } _ { \mathrm { i o u } } \, + \, { \mathcal L } _ { \mathrm { d f l } } ) , \tag{3}$

2.3 一些细节

2.3.1 如何大批量自动化生成训练标注

目前我们可以很方便的拿到图片对数据，此处的目标是如何将图文对数据转化成，图片-instance annotation （ $\Omega = \{ B_i, t_i\}^{N}_{i=1}$ ）的形式

作者的方法如下：

import string
import nltk
from nltk import word_tokenize, pos_tag

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def extract_noun_phrases(text):
    
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token not in string.punctuation]
    tagged = pos_tag(tokens)
    print(tagged)
    grammar = 'NP: {<DT>?<JJ.*>*<NN.*>+}'
    cp = nltk.RegexpParser(grammar)
    result = cp.parse(tagged)
    
    noun_phrases = []
    for subtree in result.subtrees():
        if subtree.label() == 'NP':
            noun_phrases.append(' '.join(t[0] for t in subtree.leaves()))
    
    return noun_phrases

[STEP2]: 将图片和提取的名词短语输入到GLIP中检测bounding box

[STEP3]: 将(region_img, region_text)和（img, text）送入到CLIP中计算相关度，如果相关度低，则过滤掉这个图片(作者制定的规则是 $\sqrt { s ^ { i m g } * s ^ { r e g i o n }} > 0.3$ )。再通过NMS过滤掉冗余的bounding box。