小样本计数网络FamNet(Learning To Count Everything)

大多数计数方法都仅仅针对一类特定的物体，如人群计数、汽车计数、动物计数等。一些方法可以进行多类物体的计数，但是training set中的类别和test set中的类别必须是相同的。
为了增加计数方法的可拓展性，Viresh Ranjan在2021年CVPR上提出了小样本计数网络(Few-Shot Counting, FSC) FamNet。
论文地址：Learning To Count Everything

1 FamNet网络概述

小样本计数如下图所示，给定一张或几张support描述计数的物体类别，并给定一张待计数的图像query，FSC希望计算出该类别的物体在query中出现的个数。
除了在训练集中出现的类别 (称为base classes)，在测试阶段，FSC还需要处理完全没有见过的类别 (称为novel classes)。

在这里插入图片描述

从今天的角度来看，小样本计数的baselines主要类别如下：

第一类是feature-based的方法(下图a)，将support和query的feature进行concat，之后训练一个回归头部，计算density map，典型网络有 Class-Agnostic Counting (Lu et al., ACCV 2018)。
第二类是similarity-based的方法(下图b)，将support和query的feature计算一个距离度量，得到一张similarity map，之后从similarity map回归density map，我们今天要介绍的Learning To Count Everything(Ranjan et al.,CVPR 2021)就是此类。
这两种方法各有优劣。feature-based的方法对于语义信息的保持更好，而similarity-based的方法则对于support和query之间的关系感知更好，因此也有将两种方法结合起来的(下图c)，如Few-shot Object Counting with Similarity-Aware Feature Enhancement (You et al., WACV 2022)。

在这里插入图片描述

1.1 FamNet网络架构

绝大多数计数的方法都只能针对一个类别的情况，造成这种情况的原因有两个：
- 计数需要很细致的标注
- 甚至没有包含多个类别的数据集
为了解决第一个问题，将计数看成是few-shot回归任务。
- 提出了FamNet架构来解决few-shot counting的任务，在测试阶段引入了一个few-shot adaptation scheme提升了模型的性能
- 值得注意的是，本文在测试过程中选择的是和训练过程中完全不同的类别
为了解决第二个问题自己建立了一个数据集(如下图)：FSC-147（a few-shot counting dataset，有147个类别，超过6000张图片）
- 本文收集了超过6000张图片，共147个类别的图像，主要由厨房餐具、办公文具、信纸、交通工具、动物等组成。
- 数量从7-3731，平均每幅图56个目标。每个目标实例用近似中心的点来标注。
- 另外，三个目标实例随机挑选出来加上矩形框作为目标样例。
- 注意一下数据集的划分：训练集：3659张，89类；验证集：1286张，29类；测试集：1190张，29类，总计147类，可以发现训练集和测试集的类别不一致。

在这里插入图片描述

下图是FamNet的网络架构：

输入：RGB图像以及想要计数的物体的几个sampler bounding boxes(下图中红框标注的鱼，只需要3个左右的bbox信息)；
输出：密度图（和输入图像尺寸是一样的，只有1个channel），根据密度图可以得到想要的object的数量。
FamNet主要由两个核心模块组成：一个是多尺度的特征提取模块(绿色矩形)，一个是密度图预测的模块(绿色矩形)。

在这里插入图片描述

多尺度的特征提取模块

为了处理大范围的物体类别，本文用在ImageNet上预训练了的网络进行特征提取，并且在训练时候冻结参数，不进行微调。
具体来讲，FamNet用的是ImageNet上pretrained过的ResNet-50的前4个block进行多尺度特征提取，而且这些block的权重在FamNet训练时是不改变的。
用ResNet-50的第3个和第4个block得到的特征图表征输入图像，对这些特征图用ROI pooing(紫色矩阵框)得到exempler的多尺度特征。
- 在FamNet的源码中，ROI pooing实现其实非常简单，直接用的Pytorch中的插值函数实现。
- ROI pooing是在Faster RCNN里提出的。Faster RCNN中将RPN生成的候选框投影到特征图上获得相应的特征矩阵后，由于每个特征矩阵的大小不一定是相同的，于是有一个ROI Pooling层，将这些特征矩阵缩放到相同尺度的特征图，Faster RCNN用的是7×7，然后再将特征图展平通过一系列全连接层得到预测结果。

密度图预测模块

为了使模型不是针对特定类的，密度图预测模块的设计是需要类无关的。
那么显然，为了类无关的就不能直接把提取到的图像特征送入密度图预测模块。作者使用了一个correlation map表征exempler特征和输入的图像特征的相似性(上图灰色矩形框)，然后再做density prediction，这里作者使用卷积操作来计算exempler特征和输入图像特征的相似性。
同样，由于object可能会有尺度变化，因此作者把exempler的大小进行缩放（0.9-1.1），并生成多个correlation maps，这些特征图最后被concat后送入密度图预测模块。

1.2 FamNet中的损失函数

在模型训练的时候，使用的损失函数是常规的MSE损失(只有密度图预测模块参与训练)。
在模型预测或测试的时候，用的是Min-Count loss和Perturbation loss的加权和。
可能你会好奇，预测时候还需要计算损失吗？
- 这是因为训练阶段，我们利用的只是样本外观特征信息。
- 在推理阶段，作者提出一种新方法(如下图推理加入Adaptation可以提高预测精度，源码中默认训练100个epochs)用于提升估计模块的精度，可以充分利用Bounding boxes样本的位置信息。

最小计数损失(Min-Count loss)
$L_{MinCount}=\sum_{b\in B}max(0, 1-||Z_b||_1) \\ B为给定所有的样本框，Z_b表示在密度图Z上位置b处的裁剪图,||Z_b||_1为Z_b所有值的求和\\$
看了这个损失函数的数学表达式，你可能比较懵逼，那我们直接来看代码。

通过下面代码，我们知道MinCountLoss其实就是为了约束每一个exempler的框内至少有一个物体
当 $Z_b||_1>1$ 时， $L_{MinCount}=0$ 。因此，最小计数损失只会惩罚那些没有判断出exempler区域内目标大于1的情况。

# 主要是为了约束每一个exempler的框内至少有一个物体
def MincountLoss(output, boxes, use_gpu=True):
    ones = torch.ones(1)
    if use_gpu:
        ones = ones.cuda()
    Loss = 0.
    if boxes.shape[1] > 1:
        boxes = boxes.squeeze()
        # 遍历每一个提供的bbox
        for tempBoxes in boxes.squeeze():
            y1 = int(tempBoxes[1])
            y2 = int(tempBoxes[3])
            x1 = int(tempBoxes[2])
            x2 = int(tempBoxes[4])
            # 将bbox区域内密度求和
            X = output[:, :, y1:y2, x1:x2].sum()
            # 小于1就计算loss，目的是确保每一个exempler的框内至少有一个物体
            if X.item() <= 1:
                Loss += F.mse_loss(X, ones)
    else:
        boxes = boxes.squeeze()
        y1 = int(boxes[1])
        y2 = int(boxes[3])
        x1 = int(boxes[2])
        x2 = int(boxes[4])
        X = output[:, :, y1:y2, x1:x2].sum()
        if X.item() <= 1:
            Loss += F.mse_loss(X,ones)  
    return Loss

扰动损失Perturbation loss

密度图 $Z$ 本质上为exemplers和图像的关联响应（卷积）图。
而exemplers周围的密度值理想情况下也会是一个高斯分布，所以不满足这个分布的情况就会有Loss。
$G_{h×w}$ 为尺寸是 $h \times w$ 的2D高斯分布图。
预测最终所用的联合适应损失(adaptation loss)是Min-Count loss和Perturbation loss的加权和。

在这里插入图片描述

def PerturbationLoss(output, boxes, sigma=8, use_gpu=True):
    Loss = 0.
    if boxes.shape[1] > 1:
        boxes = boxes.squeeze()
        # 遍历每一个提供的bbox
        for tempBoxes in boxes.squeeze():
            y1 = int(tempBoxes[1])
            y2 = int(tempBoxes[3])
            x1 = int(tempBoxes[2])
            x2 = int(tempBoxes[4])
            out = output[:, :, y1:y2, x1:x2]
            # 作者认为预测密度图会围绕bbox呈Gaussian分布，所以不满足这个分布的情况就会有Loss
            GaussKernel = matlab_style_gauss2D(shape=(out.shape[2], out.shape[3]), sigma=sigma)
            GaussKernel = torch.from_numpy(GaussKernel).float()
            if use_gpu:
                GaussKernel = GaussKernel.cuda()
            Loss += F.mse_loss(out.squeeze(), GaussKernel)
    else:
        boxes = boxes.squeeze()
        y1 = int(boxes[1])
        y2 = int(boxes[3])
        x1 = int(boxes[2])
        x2 = int(boxes[4])
        out = output[:, :, y1:y2, x1:x2]
        Gauss = matlab_style_gauss2D(shape=(out.shape[2], out.shape[3]), sigma=sigma)
        GaussKernel = torch.from_numpy(Gauss).float()
        if use_gpu:
            GaussKernel = GaussKernel.cuda()
        Loss += F.mse_loss(out.squeeze(), GaussKernel)
    return Loss

实验部分不再介绍，可以参考原文，只贴出一个与其他少样本方法的比较。

在这里插入图片描述

2 FamNet源码分析

源码来自官方作者实现：GitHub - cvlab-stonybrook/LearningToCountEverything

下面我们通过源码，来对FamNet有更深的理解。

下面代码在源码基础上增加了中文注释并进行了微调，仓库地址：小样本计数网络FamNet

2.1 模型创建

利用在ImageNet数据集上预训练的ResNet-50模型的layer3、layer4作特征提取
密度预测模块CountRegressor由5个卷积和3个上采样层（上采样2倍）组成

# LearningToCountEverything/model.py
class Resnet50FPN(nn.Module):
    def __init__(self):
        super(Resnet50FPN, self).__init__()
        self.resnet = torchvision.models.resnet50(pretrained=True)
        children = list(self.resnet.children())
        self.conv1 = nn.Sequential(*children[:4])
        self.conv2 = children[4]
        self.conv3 = children[5]
        self.conv4 = children[6]

    def forward(self, im_data):
        feat = OrderedDict()
        feat_map = self.conv1(im_data)
        feat_map = self.conv2(feat_map)
        feat_map3 = self.conv3(feat_map)
        feat_map4 = self.conv4(feat_map3)
        feat['map3'] = feat_map3
        feat['map4'] = feat_map4
        return feat

class CountRegressor(nn.Module):
    def __init__(self, input_channels, pool='mean'):
        super(CountRegressor, self).__init__()
        self.pool = pool
        # 经过三次上采样，将feature map采样到原始图像大小
        self.regressor = nn.Sequential(
            nn.Conv2d(in_channels=input_channels, out_channels=196, kernel_size=7, padding=3),
            nn.ReLU(),
            nn.UpsamplingBilinear2d(scale_factor=2),
            nn.Conv2d(in_channels=196, out_channels=128, kernel_size=5, padding=2),
            nn.ReLU(),
            nn.UpsamplingBilinear2d(scale_factor=2),
            nn.Conv2d(in_channels=128, out_channels=64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.UpsamplingBilinear2d(scale_factor=2),
            nn.Conv2d(in_channels=64, out_channels=32, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(in_channels=32, out_channels=1, kernel_size=1),  # 输出通道为1
            nn.ReLU(),
        )

    def forward(self, im):
        num_sample = im.shape[0]
        if num_sample == 1:
            # im.squeeze(0)先压缩通道0
            output = self.regressor(im.squeeze(0))
            if self.pool == 'mean':
                # 取多个bbox所产生的correlation map的平均值
                output = torch.mean(output, dim=(0), keepdim=True)
                return output
            elif self.pool == 'max':
                output, _ = torch.max(output, 0, keepdim=True)
                return output
        else:
            for i in range(0, num_sample):
                output = self.regressor(im[i])
                if self.pool == 'mean':
                    output = torch.mean(output, dim=(0), keepdim=True)
                elif self.pool == 'max':
                    output, _ = torch.max(output, 0, keepdim=True)
                if i == 0:
                    Output = output
                else:
                    Output = torch.cat((Output, output), dim=0)
            return Output

2.2 模型训练

训练过程中采用MSE损失函数
多尺度特征提取模块使用Resnet50FPN、密度图预测模块使用CountRegressor
训练过程，ResNet-50中block的权重是不改变的，只更新CountRegressor的参数

# LearningToCountEverything/train.py
def train(args):
    device = torch.device(args.gpu if torch.cuda.is_available() else "cpu")

    # MSE损失函数
    criterion = nn.MSELoss().to(device)

    # 1、多尺度特征提取模块
    resnet50_conv = Resnet50FPN().to(device)
    resnet50_conv.eval()
    # 2、密度图预测模块
    # input_channels=6，这是因为map3和map4分别产生三个尺度（0.9、1.0、1.1）的correlation map
    regressor = CountRegressor(input_channels=6, pool='mean').to(device)
    weights_normal_init(regressor, dev=0.001)
    regressor.train()


    # 3、这里冻结Resnet50FPN，只更新CountRegressor的参数
    optimizer = optim.Adam(regressor.parameters(), lr=args.learning_rate)

    best_mae, best_rmse = 1e7, 1e7
    stats = list()
    for epoch in range(0, args.epochs):
        regressor.train()
        # 训练1个epoch
        train_loss, train_mae, train_rmse = train_one_epoch(args, resnet50_conv, optimizer, regressor, criterion, device)
        regressor.eval()
        # 模型评估
        val_mae, val_rmse = eval(args, resnet50_conv, regressor, device)
        # 将训练过程中的loss等信息保存到文件中
        stats.append((train_loss, train_mae, train_rmse, val_mae, val_rmse))
        stats_file = join(args.output_dir, "stats" + ".txt")
        with open(stats_file, 'w') as f:
            for s in stats:
                f.write("%s\n" % ','.join([str(x) for x in s]))
        if best_mae < val_mae:
            best_mae = val_mae
            best_rmse = val_rmse
            model_name = args.output_dir + '/' + f"FamNet_{best_mae}.pth"
            torch.save(regressor.state_dict(), model_name)
        print("Epoch {}, Avg. Epoch Loss: {} Train MAE: {} Train RMSE: {} Val MAE: {} Val RMSE: {} Best Val MAE: {} Best Val RMSE: {} ".format(
                epoch + 1, stats[-1][0], stats[-1][1], stats[-1][2], stats[-1][3], stats[-1][4], best_mae, best_rmse))

train_one_epoch如下所示
训练过程中，需要对图像大小进行预处理
- 训练过程中同时需对密度图进行处理，预测或测试不需对密度图进行处理。
- 如果高度和宽度均小于指定值，则不进行调整大小。
- 如果图像的宽度或高度超过指定值，则调整图像大小，使得：
  - 新高度和新宽度的最大值不超过指定值
  - 新高度和新宽度可被8整除
  - 保持纵横比
核心函数为extract_features，用来获取不同尺度的correlation map，然后将6个correlation map进行concat，输入到密度图预测模块regressor

# LearningToCountEverything/train.py
# 模型训练
def train_one_epoch(args, resnet50_conv, optimizer, regressor, criterion, device):
    print("Training on FSC147 train set data")
    # 加载数据集及相关信息
    annotations, data_split, im_dir, gt_dir = load_data(args)
    # 训练数据集中图片名称, 如['7.jpg',...]
    im_ids = data_split['train']
    random.shuffle(im_ids)
    train_mae = 0
    train_rmse = 0
    train_loss = 0
    cnt = 0

    pbar = tqdm(im_ids)
    for im_id in pbar:
        cnt += 1
        anno = annotations[im_id]
        bboxes = anno['box_examples_coordinates']
        # 1、获取每幅图片上少量的exemplar object的bbox信息
        rects = list()
        for bbox in bboxes:
            x1 = bbox[0][0]
            y1 = bbox[0][1]
            x2 = bbox[2][0]
            y2 = bbox[2][1]
            rects.append([y1, x1, y2, x2])
        # 2、加载图片
        image = Image.open('{}/{}'.format(im_dir, im_id))
        image.load()
        # 3、加载密度图
        density_path = gt_dir + '/' + im_id.split(".jpg")[0] + ".npy"
        density = np.load(density_path).astype('float32')
        # 4、装入到字典中
        sample = {'image': image,
                  'lines_boxes': rects,
                  'gt_density': density}
        # 训练过程中，对图像进行预处理
        sample = TransformTrain(sample)
        image, boxes, gt_density = sample['image'].to(device), sample['boxes'].to(device), sample['gt_density'].to(device)

        with torch.no_grad():
            # 获取不同尺度的correlation map，并concat在一起
            features = extract_features(resnet50_conv, image.unsqueeze(0), boxes.unsqueeze(0), MAPS, Scales)
        features.requires_grad = True
        optimizer.zero_grad()
        # 将6个correlation map进行concat，然后输入 密度图预测模块
        output = regressor(features)

        # if image size isn't divisible by 8, gt size is slightly different from output size
        if output.shape[2] != gt_density.shape[2] or output.shape[3] != gt_density.shape[3]:
            orig_count = gt_density.sum().detach().item()
            gt_density = F.interpolate(gt_density, size=(output.shape[2], output.shape[3]), mode='bilinear')
            new_count = gt_density.sum().detach().item()
            if new_count > 0:
                # 保证gt_density缩放后的count和orig_count相同
                gt_density = gt_density * (orig_count / new_count)
        # 计算mse loss
        loss = criterion(output, gt_density)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
        pred_cnt = torch.sum(output).item()
        gt_cnt = torch.sum(gt_density).item()
        cnt_err = abs(pred_cnt - gt_cnt)
        # 计算训练过程中mae和rmse
        train_mae += cnt_err
        train_rmse += cnt_err ** 2
        pbar.set_description('真实cnt: {:6.1f}, 预测cnt: {:6.1f}, 错误个数: {:6.1f}. Current MAE: {:5.2f}, RMSE: {:5.2f}'.format(gt_cnt, pred_cnt,
                                                                        abs(pred_cnt - gt_cnt), train_mae/cnt, (train_rmse/cnt)**0.5))
    # 计算训练1个epoch的train_loss、train_mae、train_rmse
    train_loss = train_loss / len(im_ids)
    train_mae = (train_mae / len(im_ids))
    train_rmse = (train_rmse / len(im_ids))**0.5
    return train_loss, train_mae, train_rmse

核心函数为extract_features

使用了一个correlation map表征exempler特征和输入的图像特征的相似性(利用卷积实现)
使用双线性插值实现ROI Pooling操作，使得每一张图片中的exemplars的高和宽统一
由于object可能会有尺度变化，因此作者把exemplers的大小进行缩放（0.9、1.0、1.1）
生成多个correlation maps(map3和map4分别产生三个尺度的correlation map)，这些特征图最后被concat，最后送入密度图预测模块

# LearningToCountEverything/utils.py
def extract_features(feature_model, image, boxes, feat_map_keys=['map3','map4'], exemplar_scales=[0.9, 1.1]):
    """
      1、使用了一个correlation map表征exempler特征和输入的图像特征的相似性(利用卷积实现)
      2、由于object可能会有尺度变化，因此作者把exempler的大小进行缩放（0.9、1.1）
      3、生成多个correlation maps(map3和map4分别产生三个尺度的correlation map)，这些特征图最后被concat送入密度图预测模块
    """
    N, M = image.shape[0], boxes.shape[2] # 图片个数及exampler的个数

    # 使用ResNet-50的第三个和第四个block得到的特征图表征输入图像
    # Getting features for the image N * C * H * W
    Image_features = feature_model(image)

    # Getting features for the examples (N*M) * C * h * w
    for ix in range(0, N):
        # boxes = boxes.squeeze(0)
        boxes = boxes[ix][0]  # 一张图像中exampler的bbox信息
        cnter = 0
        for keys in feat_map_keys:
            image_features = Image_features[keys][ix].unsqueeze(0)
            if keys == 'map1' or keys == 'map2':
                Scaling = 4.0
            elif keys == 'map3':
                Scaling = 8.0
            elif keys == 'map4':
                Scaling = 16.0
            else:
                Scaling = 32.0
            # exempler的bbox信息缩放到特征图尺度
            boxes_scaled = boxes / Scaling
            boxes_scaled[:, 1:3] = torch.floor(boxes_scaled[:, 1:3])
            boxes_scaled[:, 3:5] = torch.ceil(boxes_scaled[:, 3:5])
            # make the end indices exclusive
            boxes_scaled[:, 3:5] = boxes_scaled[:, 3:5] + 1
            feat_h, feat_w = image_features.shape[-2], image_features.shape[-1]
            # make sure exemplars don't go out of bound
            boxes_scaled[:, 1:3] = torch.clamp_min(boxes_scaled[:, 1:3], 0)
            boxes_scaled[:, 3] = torch.clamp_max(boxes_scaled[:, 3], feat_h)
            boxes_scaled[:, 4] = torch.clamp_max(boxes_scaled[:, 4], feat_w)            
            box_hs = boxes_scaled[:, 3] - boxes_scaled[:, 1]
            box_ws = boxes_scaled[:, 4] - boxes_scaled[:, 2]            
            # 获取一张图片中exemplars在特征图上最大的高和宽
            max_h = math.ceil(max(box_hs))
            max_w = math.ceil(max(box_ws))            
            for j in range(0, M):
                y1, x1 = int(boxes_scaled[j, 1]), int(boxes_scaled[j, 2])
                y2, x2 = int(boxes_scaled[j, 3]), int(boxes_scaled[j, 4])
                # print(y1,y2,x1,x2,max_h,max_w)
                if j == 0:
                    # 获取exemplars的相应特征图
                    examples_features = image_features[:, :, y1:y2, x1:x2]
                    if examples_features.shape[2] != max_h or examples_features.shape[3] != max_w:
                        # 双线性插值填充(目的：每一张图片中的exemplars的高和宽统一)
                        examples_features = F.interpolate(examples_features, size=(max_h, max_w), mode='bilinear')
                else:
                    feat = image_features[:, :, y1:y2, x1:x2]
                    if feat.shape[2] != max_h or feat.shape[3] != max_w:
                        # 双线性插值填充(目的：每一张图片中的exemplars的高和宽统一)
                        feat = F.interpolate(feat, size=(max_h, max_w), mode='bilinear')
                    # concat
                    examples_features = torch.cat((examples_features, feat), dim=0)
            """
                Convolving example features over image features
                【使用卷积计算相似】
                这里把examples_features当作卷积核 在 image_features上进行卷积，得到correlation map
            """
            h, w = examples_features.shape[2], examples_features.shape[3]
            # input shape(minibatch, in_channels, iH, iW)
            # weight shape(out_channels, in_channels/groups, kH, kW)
            # out_channels就是一张图片中bbox的个数
            features = F.conv2d(
                    input=F.pad(input=image_features, pad=((int(w/2)), int((w-1)/2), int(h/2), int((h-1)/2))), # 进行填充，使卷积后correlation map高宽不变
                    weight=examples_features
                )
            combined = features.permute([1, 0, 2, 3])
            # computing features for scales 0.9 and 1.1
            # 考虑到不同 scale 的 object 我们会对其进行缩放
            # 缩放之后会得到新的 correlation map
            for scale in exemplar_scales:
                    h1 = math.ceil(h * scale)
                    w1 = math.ceil(w * scale)
                    if h1 < 1: # use original size if scaled size is too small
                        h1 = h
                    if w1 < 1:
                        w1 = w
                    examples_features_scaled = F.interpolate(examples_features, size=(h1,w1), mode='bilinear')
                    features_scaled = F.conv2d(
                        F.pad(image_features, ((int(w1/2)), int((w1-1)/2), int(h1/2), int((h1-1)/2))),
                        examples_features_scaled
                    )
                    features_scaled = features_scaled.permute([1, 0, 2, 3])
                    # 把所有的correlation map concatenate 在一起
                    combined = torch.cat((combined, features_scaled), dim=1)
            if cnter == 0:
                Combined = 1.0 * combined
            else:
                # 对map4进行上采样，和map3统一宽和高
                if Combined.shape[2] != combined.shape[2] or Combined.shape[3] != combined.shape[3]:
                    combined = F.interpolate(combined, size=(Combined.shape[2], Combined.shape[3]), mode='bilinear')
                Combined = torch.cat((Combined, combined), dim=1)
            cnter += 1
        if ix == 0:
            All_feat = 1.0 * Combined.unsqueeze(0)
        else:
            All_feat = torch.cat((All_feat, Combined.unsqueeze(0)), dim=0)
    return All_feat

2.3 模型预测

预测需要图片标注少量的bbox，如果没有提供bbox文件，则提示用户标注一组边界框
预测过程中的预处理和训练基本一致，不过没有密度图相关的处理。
多尺度特征提取模块提取特征后，可以开启自适应训练(默认100轮)，然后再输出到密度图预测模块
预测所用的loss为MincountLoss和PerturbationLoss的加权和。这两种损失函数已经在1上文中介绍过，不再介绍。
在某些情况下，损失可能会变为零，其中损失是一个值为零的标量而不是张量。因此，仅针对非零情况执行梯度下降。
这里使用官方提供的预训练模型进行检测，可视化的检测结果如下图。

# LearningToCountEverything/demo.py
def detect(args):
    if not torch.cuda.is_available() or args.gpu_id < 0:
        use_gpu = False
        print("===> Using CPU mode.")
    else:
        use_gpu = True
        os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
        os.environ["CUDA_VISIBLE_DEVICES"] = str(args.gpu_id)
    # 1、多尺度特征提取模块
    resnet50_conv = Resnet50FPN()
    # 2、密度图预测模块
    # input_channels=6，这是因为Resnet50FPN的map3和map4分别产生三个尺度（0.9、1.0、1.1）的correlation map
    regressor = CountRegressor(input_channels=6, pool='mean')

    # 3、加载训练模型
    if use_gpu:
        resnet50_conv.cuda()
        regressor.cuda()
        regressor.load_state_dict(torch.load(args.model_path))
    else:
        regressor.load_state_dict(torch.load(args.model_path, map_location=torch.device('cpu')))

    resnet50_conv.eval()
    regressor.eval()

    image_name = os.path.basename(args.input_image)
    image_name = os.path.splitext(image_name)[0]

    # 如果没有提供bbox文件，则提示用户输入一组边界框
    # if no bounding box file is given, prompt the user for a set of bounding boxes
    if args.bbox_file is None:
        out_bbox_file = "{}/{}_box.txt".format(args.output_dir, image_name)
        fout = open(out_bbox_file, "w")

        im = cv2.imread(args.input_image)
        cv2.imshow('image', im)
        rects = select_exemplar_rois(im)

        rects1 = list()
        for rect in rects:
            y1, x1, y2, x2 = rect
            rects1.append([y1, x1, y2, x2])
            print(rects1)
            fout.write("{} {} {} {}\n".format(y1, x1, y2, x2))

        fout.close()
        cv2.destroyAllWindows()
        print("selected bounding boxes are saved to {}".format(out_bbox_file))
    else:
        with open(args.bbox_file, "r") as fin:
            lines = fin.readlines()

        rects1 = list()
        for line in lines:
            data = line.split()
            y1 = int(data[0])
            x1 = int(data[1])
            y2 = int(data[2])
            x2 = int(data[3])
            rects1.append([y1, x1, y2, x2])

    print("Bounding boxes: ", end="")
    print(rects1)
    # 3、加载图像、并进行预处理
    image = Image.open(args.input_image)
    image.load()
    sample = {'image': image, 'lines_boxes': rects1}
    sample = Transform(sample)
    image, boxes = sample['image'], sample['boxes']

    if use_gpu:
        image = image.cuda()
        boxes = boxes.cuda()

    # 4、多尺度特征提取模块提取特征
    with torch.no_grad():
        features = extract_features(resnet50_conv, image.unsqueeze(0), boxes.unsqueeze(0), MAPS, Scales)

    # 5、输出预测结果
    if not args.adapt:
        # 不采用自适应，直接输出预测结果
        with torch.no_grad():
            output = regressor(features)
    else:
        # 采用自适应，先进行训练(默认100轮)，再输出
        features.required_grad = True
        adapted_regressor = regressor
        adapted_regressor.train()
        # 仍然只对regressor参数进行微调
        optimizer = optim.Adam(adapted_regressor.parameters(), lr=args.learning_rate)

        pbar = tqdm(range(args.gradient_steps))
        for step in pbar:
            optimizer.zero_grad()
            output = adapted_regressor(features)
            # 此时所用loss为MincountLoss和PerturbationLoss的加权和
            lCount = args.weight_mincount * MincountLoss(output, boxes, use_gpu=use_gpu)
            lPerturbation = args.weight_perturbation * PerturbationLoss(output, boxes, sigma=8, use_gpu=use_gpu)
            Loss = lCount + lPerturbation
            # 在某些情况下，损失可能会变为零，其中损失是一个值为零的标量而不是张量。
            # 因此，仅针对非零情况执行梯度下降。
            if torch.is_tensor(Loss):
                Loss.backward()
                optimizer.step()

            pbar.set_description('Adaptation step: {:<3}, loss: {}, predicted-count: {:6.1f}'.format(step, Loss.item(),
                                                                                                     output.sum().item()))

        features.required_grad = False
        output = adapted_regressor(features)

    print(output.shape)
    print('===> The predicted count is: {:6.2f}'.format(output.sum().item()))
    # 6、可视化预测结果
    rslt_file = "{}/{}_out.png".format(args.output_dir, image_name)
    visualize_output_and_save(image.detach().cpu(), output.detach().cpu(), boxes.cpu(), rslt_file)
    print("===> Visualized output is saved to {}".format(rslt_file))