open clip论文阅读摘要

看下open clip论文
Learning Transferable Visual Models From Natural Language Supervision
These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets.

CNNs trained to predict words in image captions learn useful image representations

learn image representations from text

我好奇,在OCR上是怎么测试的?
CLIP训练样本要怎么准备,400 million (image, text) pairs,这个量级的样本集是怎么准备出来的

论文说CLIP这种预训练,zero-shot可以媲美基于监督学习构建的模型,我需要打个问号,在特定领域的业务数据上好像不太够啊?

Learning from natural language has several potential strengths over other training methods. It’s much easier to scale natural language supervision compared to standard crowd-sourced labeling for image classification since it does not require annotations to be in a classic “machine learning compatible format” such as the canonical 1-of-N majority vote “gold label”

MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each

YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality
Many images use automatically generated filenames like 20160716 113957.JPG as “titles” or contain “descriptions” of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is
approximately the same size as ImageNet

A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet.

we constructed a new dataset of 400 million (image, text) pairs collected form a variety of publicly available sources on the Internet

We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText
注重样本的类别平衡

we found training efficiency was key to successfully scaling natural language supervision and we selected our final pre-training method based on this metric
是的,在这样规模的数据集上训练,需要的时间是令人畏惧的,所以掌握更快速的训练效率是关键

Recent work in contrastive representation learning for images has found that contrastive objectives can learn better representations than their equivalent
predictive objective
这个发现很有意思,这说明我们可以不需要准确预测每个图片的text caption,这太难了

Other work has found that although generative models of images can learn high quality image representations, they require over an order of magnitude more compute than contrastive models with the same performance
这里又提到了生成模型,在学习表征方面,有监督学习CNN、对比学习CLIP、生成模型Stable Diffusion

We train CLIP from scratch without initializing the image encoder with ImageNet weights or the text encoder with pre-trained weights.
在这么一个大数据集上,甚至比ImageNet还大,加载预训练的ImageNet模型和text encoder模型确实没必要

CLIP is pre-trained to predict if an image and a text snippet are paired together in its dataset. To perform zero-shot classification, we reuse this capability. For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text)
pair according to CLIP. In a bit more detail, we first compute the feature embedding of the image and the feature embedding of the set of possible texts by their respective encoders. The cosine similarity of these embeddings is then calculated, scaled by a temperature parameter τ , and normalized into a
probability distribution via a softmax. Note that this prediction layer is a multinomial logistic regression classifier with L2-normalized inputs, L2-normalized weights, no bias, and temperature scaling

Another issue we encountered is that it’s relatively rare in our pre-training dataset for the text paired with the image to be just a single word. Usually the text is a full sentence describing the image in some way. To help bridge this distribution gap, we found that using the prompt template “A photo of a {label}.” to be a good default that helps specify the text is about the content of the image. This often improves performance over the baseline of using only the label text
出现这个问题的原因是模型没能理解语言,不过现在GPT4可以做到了,估计会有点儿突破?
Similar to the “prompt engineering” discussion around GPT3 (Brown et al., 2020; Gao et al., 2020), we have also observed that zero-shot performance can be significantly improved by customizing the prompt text to each task. A few, non exhaustive, examples follow. We found on several fine-grained image classification datasets that it helped to specify the category. For example on Oxford-IIIT Pets, using “A photo of a {label}, a type of pet.” to help provide context worked well. Likewise, on Food101 specifying a type of food and on FGVC Aircraft a type of aircraft helped too. For OCR datasets, we found that putting quotes around the text or number to be recognized improved performance. Finally, we found that on satellite image classification datasets it helped to specify that the images were of this form and we use variants of “a satellite photo of a {label}.”.
这种prompt对于性能的提升是肯定的

We also experimented with ensembling over multiple zeroshot classifiers as another way of improving performance. These classifiers are computed by using different context prompts such as ‘A photo of a big {label}” and “A photo of a small {label}”. We construct the ensemble over the embedding space instead of probability space. This allows us to cache a single set of averaged text embeddings so that the compute cost of the ensemble is the same as using a single classifier when amortized over many predictions
这里使用emsemble的方法提升性能

说白了,比监督学习强在:
1、数据量更多
2、任务种类更多
3、加上文本学习语义信息,不单单是空间信息

we see that zero-shot CLIP is quite weak on several specialized, complex, or abstract tasks such as satellite image classification (EuroSAT and RESISC45), lymph node tumor detection (PatchCamelyon), counting objects in synthetic scenes (CLEVRCounts), self-driving related tasks such as
German traffic sign recognition (GTSRB), recognizing distance to the nearest car (KITTI Distance). These results highlight the poor capability of zero-shot CLIP on more complex tasks.
貌似跟GPT4也有点像?虽然通用性很不错,但是没办法做到全能,特别是复杂任务上,我感觉还是数据的问题吧,当然也有可能是现在的模型架构没办法应对复杂任务,所以需要拆解成更简单的子任务。不可否认的是在业务数据标注上存在加速作用

First, CLIP’s zero-shot classifier is generated via natural language which allows for visual concepts to be directly specified (“communicated”). By contrast,
“normal” supervised learning must infer concepts indirectly from training examples. Context-less example-based learning has the drawback that many different hypotheses can be consistent with the data, especially in the one-shot case. A single image often contains many different visual concepts. Although a capable learner is able to exploit visual cues and heuristics, such as assuming that the concept being demonstrated is the primary object in an image, there is no guarantee
是的,few-shot的难点主要在于,你不知道模型把什么特征跟最终的标签做了关联,所以需要加大样本数据量,才能使模型正确找到这个路径
zero-shot主要是在大数据量上预训练了,所以跟few-shot还是有区别的
我比较看好在大数据上预训练过的大模型


其实这个评估有点儿问题?如何评判稳定性?你这个只是在建立的测试样本集上的结果而已,并不是大量的数据评估结果,特别是放到真实业务场景下的分析结果,我觉得每个类别多点儿数据不是坏事,可以加强特征到标签的连接,特别是捕获正确的特征

If we assume that evaluation datasets are large enough that the parameters of linear classifiers trained on them are well estimated, then, because CLIP’s zero-shot classifier is also a linear classifier, the performance of the fully supervised classifiers roughly sets an upper bound for what zero-shot transfer can achieve
从拟合能力上来看,监督学习可以拟合的性能上限,也是zero-shot可以达到的上限

在大数据上学习到通用表征能力,跟在特定数据集上做监督训练,并不是冲突的

Over the past few years, empirical studies of deep learning
systems have documented that performance is predictable as
a function of important quantities such as training compute
and dataset size
这里提到,近年来的深度学习预测能力,是可以评估的,这个确实有点儿意思哈

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mfbz.cn/a/120731.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

产品经理墨刀学习----注册页面

我们做的产品是一个校园论坛学习开发系统,目前才开始学习。 (一)流程图 (二)简单墨刀设计--注册页面 (1)有账号 (a)直接登录: (b)忘…

移动端性能专项测试之内存 - 进阶篇

在 Android 系统中内存作为重要的资源,一直是开发及测试关注的重点,内存不足或者内存资源滥用都会导致严重的问题。本篇文章将会从底层出发给大家介绍 OOM(Out Of Memory)和 LMK(Low Memory Killer)等内存相…

【LearnOpenGL基础入门——2】搭建第一个OpenGL窗口

目录 一.配置GLFW 二.配置GLAD 三.第一个OpenGL窗口 3.1 GLFW设置 3.2 GLAD设置 3.3 视口 3.4 输入 3.5渲染 在我们画出出色的效果之前,首先要做的就是创建一个OpenGL上下文(Context)和一个用于显示的窗口。然而,这些操作在每个系统上都是不一样…

改进YOLOv5:结合ICCV2023|动态蛇形卷积,构建不规则目标识别网络

🔥🔥🔥 提升多尺度、不规则目标检测,创新提升 🔥🔥🔥 🔥🔥🔥 捕捉图像特征和处理复杂图像特征 🔥🔥🔥 👉👉👉: 本专栏包含大量的新设计的创新想法,包含详细的代码和说明,具备有效的创新组合,可以有效应用到改进创新当中 👉👉👉: �…

【Redis】set常用命令集合间操作内部编码使用场景

文章目录 前置知识常见命令SADDSMEMBERSSISMEMBERSCARDSPOPSMOVESREM 集合间操作SINTERSINTERSTORESUNIONSUNIONSTORESDIFFSDIFFSTORE 命令小结内部编码测试内部编码 使用场景 前置知识 集合类型也是保存多个字符串类型的元素的,但和列表类型不同的是,在…

11.8旧有报错与修改

我将uart_done(出问题的信号)的变量类型设为reg了,也就是我是reg uart_done这个信号的,这样做是错误的,哪怕你在接收模块确实定义的是reg类型,但是在顶层模块的时候,它可以视为是一条单纯的线而…

首次分享一波

本人对单片机等电子领域感兴趣,已发布33篇文章,愿与读者们互赞互关❤️💕💞💗💓💖 日月同辉,与我共生_单片机基础,单片机串口通信-CSDN博客

基于显著性的无人机多光谱图像语义杂草检测与分类

Saliency-Based Semantic Weeds Detection and Classification Using UAV Multispectral Imaging(2023) 摘要1、介绍2、相关工作2.1 监督学习2.2 半监督学习2.3 无监督学习 3、方法3.1 贡献3.2 PC/BC-DIM NEURAL NETWORK(预测编码/有偏竞争-分裂输入调制…

[ACTF2020 新生赛]Upload 1

题目环境: 仍旧是文件上传漏洞 这道题和上一道大差不差、大同小异、这里不再赘述。 [极客大挑战 2019]Upload 1:https://blog.csdn.net/m0_73734159/article/details/134267317?spm1001.2014.3001.5501 区别在于本题需要在抓包数据里面改文件后缀&#…

Web Worker:JS多线程的伪解药?

前言 在前端开发领域,JavaScript 的单线程限制一直是一个难以忽视的挑战。当谈到解决JavaScript的单线程限制时,HTML5引入的Web Worker被普遍认为是一剂解药💊。同时,业界中大量的文章也是聚焦于讨论web worker的神奇力量。然而&…

【JavaEE】实现简单博客系统-前端部分

文件目录&#xff1a; 展示&#xff1a; blog_list.html: <!DOCTYPE html> <html lang"cn"> <head><meta charset"UTF-8"><meta name"viewport" content"widthdevice-width, initial-scale1.0"><t…

制作甘特图

教程秒懂百科​​​​​​

电路布线问题动态规划详解(做题思路)

对于电路布线问题&#xff0c;想必学过动态规划的大家都很清除。今天就来讲解一下这个动态规划经典题目。 目录 问题描述输入分析最优子结构代码 问题描述 在一块电路板的上、下2端分别有n个接线柱。根据电路设计&#xff0c;要求用导 线(i,π(i))将上端接线柱与下端接线柱相…

RFSoC Debug:Petalinux 不显示 flash选项

这个板子和NI的X410是一样的。 问题 不显示Flash选项 [*] Advanced bootable images storage Settings ---> boot image settings ---> Image storage media (primary flash) --->解决 在Block Design中添加SD卡或者Flash选项&#xff0c;否则就不会显示&#xff1…

Linux驱动开发——USB设备驱动

目录 一、 USB 协议简介 二、 Linux USB 驱动 三、 USB 设备驱动实例 一、 USB 协议简介 USB(Universal Serial Bus&#xff0c;通用串行总线)正如它的名字一样&#xff0c;是用来连接PC外设的一种通用串行总线&#xff0c;即插即用和易扩展是它最大的特点。所谓即插即用&am…

软件测试怎么测别的类的main方法

软件测试怎么测别的类的main方法 🍎如果软测开发者题目待测类里有main方法,我们如何测? 可以采取以下步骤: 了解main函数的功能:首先,你需要了解这个main函数的功能和预期的输出。这样你才能设计出合适的测试用例。设计测试用例:设计测试用例时,需要考虑各种可能的输…

人工智能(AI)是一种快速发展的技术,其未来发展前景非常广阔。

人工智能&#xff08;AI&#xff09;是一种快速发展的技术&#xff0c;其未来发展前景非常广阔。以下是一些关于AI未来的可能发展方向和就业前景的详细说明&#xff1a; 1.机器学习工程师&#xff1a;机器学习是AI的核心技术之一&#xff0c;它涉及到从数据中自动学习模式并进…

一台电脑生成两个ssh,绑定两个GitHub账号

背景 一般一台电脑账号生成一个ssh绑定一个GitHub&#xff0c;即一一对应的关系&#xff01;我之前有一个账号也配置了ssh&#xff0c;但是我想经营两个GitHub账号&#xff0c;当我用https url clone新账号的仓库时&#xff0c;直接超时。所以想起了配置ssh。于是有了今天这篇…

一天吃透MySQL面试八股文

目录 事务的四大特性&#xff1f;数据库的三大范式事务隔离级别有哪些&#xff1f;生产环境数据库一般用的什么隔离级别呢&#xff1f;编码和字符集的关系utf8和utf8mb4的区别什么是索引&#xff1f;索引的优缺点&#xff1f;索引的作用&#xff1f;什么情况下需要建索引&…

【Linux】第十站:git和gdb的基本使用

文章目录 一、git的基本操作1.gitee新建仓库注意事项2.git的安装3.git的克隆4.git的add5.git的commit6.git的push7.git log8.git status9. .gitignore 二、Linux调试器---gdb1.背景2.gdb安装、进入与退出3.list/l4.r/run运行程序5. break/b 打断点6.info/i b 查看断点7.delete/…
最新文章