[晓理紫]每日论文推送(有中文摘要，源码或项目地址)--大模型相关、扩散模型、视觉导航

专属领域论文订阅

VX关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持。
VX关注晓理紫，并留下邮箱可免费获取每日论文推送服务

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉导航
具身智能，机器人
强化学习
开放词汇，检测分割

== LLM ==

标题: A Closer Look at AUROC and AUPRC under Class Imbalance

作者: Matthew B. A. McDermott, Lasse Hyldig Hansen, Haoran Zhang

中文摘要: 在机器学习（ML）中，一句流传甚广的格言是，对于具有类不平衡的二进制分类任务，精度-召回曲线下的面积（AUPRC）是与接收器操作特性下的面积相比的模型比较的更好的度量。本文通过新颖的数学分析对这一概念提出了质疑，说明AUROC和AUPRC可以用概率术语简明地联系起来。我们证明，与普遍认为的相反，AUPRC在阶级失衡的情况下并不优越，甚至可能是一个有害的指标，因为它倾向于过度倾向于在具有更频繁阳性标签的亚群体中改进模型。这种偏见可能会无意中加剧算法差异。在这些见解的推动下，我们对现有的ML文献进行了彻底的审查，利用大型语言模型分析了arXiv的150多万篇论文。我们的调查重点是所谓的AUPRC优势的普遍性和证据。研究结果暴露了经验支持的严重不足和错误归因的趋势，这助长了人们对AUPRC所谓优势的广泛接受。我们的发现代表了双重贡献：在理解度量行为方面的重大技术进步，以及对ML社区中未经检查的假设的严厉警告。所有实验均可访问https://github.com/mmcdermott/AUC_is_all_you_need.

摘要: In machine learning (ML), a widespread adage is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for binary classification tasks with class imbalance. This paper challenges this notion through novel mathematical analysis, illustrating that AUROC and AUPRC can be concisely related in probabilistic terms. We demonstrate that AUPRC, contrary to popular belief, is not superior in cases of class imbalance and might even be a harmful metric, given its inclination to unduly favor model improvements in subpopulations with more frequent positive labels. This bias can inadvertently heighten algorithmic disparities. Prompted by these insights, a thorough review of existing ML literature was conducted, utilizing large language models to analyze over 1.5 million papers from arXiv. Our investigation focused on the prevalence and substantiation of the purported AUPRC superiority. The results expose a significant deficit in empirical backing and a trend of misattributions that have fuelled the widespread acceptance of AUPRC’s supposed advantages. Our findings represent a dual contribution: a significant technical advancement in understanding metric behaviors and a stark warning about unchecked assumptions in the ML community. All experiments are accessible at https://github.com/mmcdermott/AUC_is_all_you_need.

[Downlink:]http://arxiv.org/abs/2401.06091v1

[GitHub:]https://github.com/mmcdermott/AUC_is_all_you_need.|

标题: Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

作者: Zhipeng Chen, Kun Zhou, Wayne Xin Zhao

中文摘要: 强化学习（RL）已被广泛用于训练大型语言模型，以防止意外输出，例如减少危害和错误。然而，现有的RL方法大多采用实例级奖励，无法对复杂的推理任务提供细粒度的监督，也无法关注导致错误的少数关键令牌。为了解决这一问题，我们提出了一种新的RL方法，名为\textbf{RLMEC}，该方法结合了一个生成模型作为奖励模型，该模型由错误解重写任务在最小编辑约束下进行训练，并可以为RL训练产生令牌级奖励。基于生成奖励模型，我们设计了用于训练的令牌级RL目标和用于稳定RL过程的基于模仿的正则化。这两个目标都集中在学习错误解决方案的关键令牌上，减少其他不重要令牌的影响。数学任务和问答任务的实验结果证明了该方法的有效性。我们的代码和数据位于\url{https://github.com/RUCAIBox/RLMEC}.

摘要: Reinforcement learning (RL) has been widely used in training large language models~(LLMs) for preventing unexpected outputs, \eg reducing harmfulness and errors. However, existing RL methods mostly adopt the instance-level reward, which is unable to provide fine-grained supervision for complex reasoning tasks, and can not focus on the few key tokens that lead to the incorrectness. To address it, we propose a new RL method named \textbf{RLMEC} that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, and can produce token-level rewards for RL training. Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process. And the both objectives focus on the learning of the key tokens for the erroneous solution, reducing the effect of other unimportant tokens. The experiment results on mathematical tasks and question-answering tasks have demonstrated the effectiveness of our approach. Our code and data are available at \url{https://github.com/RUCAIBox/RLMEC}.

[Downlink:]http://arxiv.org/abs/2401.06081v1

[GitHub:]https://github.com/RUCAIBox/RLMEC|

标题: CORAL: Expert-Curated medical Oncology Reports to Advance Language Model Inference

作者: Madhumita Sushil, Vanessa E. Kennedy, Divneet Mandair

中文摘要: 肿瘤学的医疗护理和观察性研究都需要彻底了解患者的疾病进展和治疗史，通常会在临床笔记中详细记录。尽管它们起着至关重要的作用，但目前没有一种肿瘤学信息表示和注释模式能够完全封装这些注释中记录的信息的多样性。尽管大型语言模型（LLM）最近在各种医学自然语言处理任务上表现出了令人印象深刻的性能，但由于目前缺乏全面注释的肿瘤学数据集，对LLM在肿瘤学笔记中复杂修辞的提取和推理方面的广泛评估仍有待深入研究。我们开发了一个详细的模式来注释文本肿瘤学信息，包括患者特征、肿瘤特征、测试、治疗和时间性。使用加州大学旧金山分校的40份未识别的乳腺癌和胰腺癌癌症进展记录，我们应用该模式评估了最近三种LLM（GPT-4、GPT-3.5-turbo和FLAN-UL2）的零样本能力，以从临床进展记录的两个叙述部分提取详细的肿瘤学史。我们的团队注释了9028个实体、9986个修饰符和5312个关系。GPT-4模型表现出总体最佳性能，平均BLEU得分为0.73，平均ROUGE得分为0.72，精确匹配F1得分为0.51，在复杂任务上的平均准确率为68%（专家手动评估子集）。值得注意的是，它精通肿瘤特征和药物提取，并在不良事件检测等关系推理方面表现出卓越的性能。然而，在使用它从癌症进展记录中可靠地提取临床研究、复杂的人口管理和记录优质患者护理所需的重要事实之前，还需要进一步改进

摘要: Both medical care and observational studies in oncology require a thorough understanding of a patient’s disease progression and treatment history, often elaborately documented in clinical notes. Despite their vital role, no current oncology information representation and annotation schema fully encapsulates the diversity of information recorded within these notes. Although large language models (LLMs) have recently exhibited impressive performance on various medical natural language processing tasks, due to the current lack of comprehensively annotated oncology datasets, an extensive evaluation of LLMs in extracting and reasoning with the complex rhetoric in oncology notes remains understudied. We developed a detailed schema for annotating textual oncology information, encompassing patient characteristics, tumor characteristics, tests, treatments, and temporality. Using a corpus of 40 de-identified breast and pancreatic cancer progress notes at University of California, San Francisco, we applied this schema to assess the zero-shot abilities of three recent LLMs (GPT-4, GPT-3.5-turbo, and FLAN-UL2) to extract detailed oncological history from two narrative sections of clinical progress notes. Our team annotated 9028 entities, 9986 modifiers, and 5312 relationships. The GPT-4 model exhibited overall best performance, with an average BLEU score of 0.73, an average ROUGE score of 0.72, an exact-match F1-score of 0.51, and an average accuracy of 68% on complex tasks (expert manual evaluation on subset). Notably, it was proficient in tumor characteristic and medication extraction, and demonstrated superior performance in relational inference like adverse event detection. However, further improvements are needed before using it to reliably extract important facts from cancer progress notes needed for clinical research, complex population management, and documenting quality patient care.

[Downlink:]http://arxiv.org/abs/2308.03853v3

[GitHub:]https://github.com/MadhumitaSushil/OncLLMExtraction|

标题: LLM-as-a-Coauthor: The Challenges of Detecting LLM-Human Mixcase

作者: Chujie Gao, Dongping Chen, Qihui Zhang

中文摘要: 随着大型语言模型（LLM）的显著发展和广泛应用，机器生成文本（MGT）的使用越来越普遍。这一趋势带来了潜在的风险，特别是对新闻和教育等领域信息的质量和完整性。目前的研究主要针对纯MGT的检测，而没有充分解决混合场景，包括人工智能修正的人类书面文本（HWT）或人类修正的MGT。为了应对这一挑战，我们引入了mixcase，这是一个新颖的概念，代表了一种混合文本形式，包括机器生成和人工生成的内容。我们收集了从多个日常文本编辑场景中生成的mixcase实例，并组成了MixSet，这是第一个专门研究这些混合修改场景的数据集。我们进行实验来评估流行的MGT检测器的功效，评估它们的有效性、鲁棒性和泛化性能。我们的发现表明，现有的检测器很难将mixcase识别为一个单独的类或MGT，特别是在处理细微的修改和风格适应性方面。这项研究强调了迫切需要更多针对混合情况量身定制的细粒度探测器，为未来的研究提供有价值的见解。代码和型号可在https://github.com/Dongping-Chen/MixSet.

摘要: With the remarkable development and widespread applications of large language models (LLMs), the use of machine-generated text (MGT) is becoming increasingly common. This trend brings potential risks, particularly to the quality and completeness of information in fields such as news and education. Current research predominantly addresses the detection of pure MGT without adequately addressing mixed scenarios including AI-revised Human-Written Text (HWT) or human-revised MGT. To confront this challenge, we introduce mixcase, a novel concept representing a hybrid text form involving both machine-generated and human-generated content. We collected mixcase instances generated from multiple daily text-editing scenarios and composed MixSet, the first dataset dedicated to studying these mixed modification scenarios. We conduct experiments to evaluate the efficacy of popular MGT detectors, assessing their effectiveness, robustness, and generalization performance. Our findings reveal that existing detectors struggle to identify mixcase as a separate class or MGT, particularly in dealing with subtle modifications and style adaptability. This research underscores the urgent need for more fine-grain detectors tailored for mixcase, offering valuable insights for future research. Code and Models are available at https://github.com/Dongping-Chen/MixSet.

[Downlink:]http://arxiv.org/abs/2401.05952v1

[GitHub:]https://github.com/Dongping-Chen/MixSet.|

标题: EarthPT: a time series foundation model for Earth Observation

作者: Michael J. Smith, Luke Fleming, James E. Geach

中文摘要: 我们介绍了EarthPT——一种地球观测（EO）预训练变压器。EarthPT是一个7亿参数解码变压器基础模型，以自回归自监督方式进行训练，并专门考虑EO用例进行开发。我们证明，EarthPT是一种有效的预测器，可以准确预测未来400-2300 nm范围内的未来像素级表面反射率。例如，在五个月的测试集范围内，对归一化差异植被指数（NDVI）演变的预测在像素水平上的典型误差约为0.05（在-1->1的自然范围内），超过了基于历史平均值的简单相位折叠模型。我们还证明了EarthPT学习的嵌入包含语义上有意义的信息，可以用于下游任务，如高度细粒度的动态土地利用分类。令人兴奋的是，我们注意到丰富的EO数据在理论上为我们提供了数万亿的训练令牌。因此，如果我们假设EarthPT遵循类似于大型语言模型（LLM）的神经缩放定律，那么目前对EarthPT和其他类似的“大型观测模型”的缩放没有数据限制

摘要: We introduce EarthPT – an Earth Observation (EO) pretrained transformer. EarthPT is a 700 million parameter decoding transformer foundation model trained in an autoregressive self-supervised manner and developed specifically with EO use-cases in mind. We demonstrate that EarthPT is an effective forecaster that can accurately predict future pixel-level surface reflectances across the 400-2300 nm range well into the future. For example, forecasts of the evolution of the Normalised Difference Vegetation Index (NDVI) have a typical error of approximately 0.05 (over a natural range of -1 -> 1) at the pixel level over a five month test set horizon, out-performing simple phase-folded models based on historical averaging. We also demonstrate that embeddings learnt by EarthPT hold semantically meaningful information and could be exploited for downstream tasks such as highly granular, dynamic land use classification. Excitingly, we note that the abundance of EO data provides us with – in theory – quadrillions of training tokens. Therefore, if we assume that EarthPT follows neural scaling laws akin to those derived for Large Language Models (LLMs), there is currently no data-imposed limit to scaling EarthPT and other similar `Large Observation Models.’

[Downlink:]http://arxiv.org/abs/2309.07207v2

[Project:]https://www.climatechange.ai/papers/neurips2023/2|

[GitHub:]https://github.com/aspiaspace/EarthPT|

标题: Fine-Tuning Language Models with Just Forward Passes

作者: Sadhika Malladi, Tianyu Gao, Eshaan Nichani

中文摘要: 微调语言模型（LMs）在各种下游任务上取得了成功，但随着LMs规模的增长，反向传播需要大量的内存。零阶（ZO）方法原则上可以仅使用两次前向通过来估计梯度，但理论上对于优化大型模型来说是灾难性的缓慢。在这项工作中，我们提出了一种内存高效的zerothorder优化器（MeZO），将经典的ZO-SGD方法调整为原位操作，从而以与推理相同的内存占用来微调LM。例如，使用单个A100 80GB GPU，MeZO可以训练300亿个参数模型，而使用反向传播进行微调只能训练具有相同预算的2.7B LM。我们在模型类型（掩蔽和自回归LMs）、模型规模（高达66B）和下游任务（分类、多项选择和生成）之间进行了全面的实验。我们的结果表明：（1）MeZO在上下文学习和线性探测方面显著优于其他算法；（2） MeZO在多个任务中实现了与反向传播微调相当的性能，在我们的实现中，内存减少了12倍，GPU小时减少了2倍；（3） MeZO兼容全参数和参数有效调谐技术，如LoRA和前缀调谐；（4） MeZO可以有效地优化不可微的目标（例如，最大化精度或F1）。我们用理论见解支持我们的实证研究结果，强调了充分的预训练和任务提示如何使MeZO能够微调庞大的模型，尽管经典的ZO分析表明情况并非如此

摘要: Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.

[Downlink:]http://arxiv.org/abs/2305.17333v3

[GitHub:]https://github.com/princeton-nlp/MeZO|

== VLM ==

标题: Dubbing for Everyone: Data-Efficient Visual Dubbing using Neural Rendering Priors

作者: Jack Saunders, Vinay Namboodiri

中文摘要: 视觉配音是在视频中生成演员嘴唇动作以与给定音频同步的过程。最近的进展在实现这一目标方面取得了进展，但未能产生适合大规模采用的方法。现有方法分为个人通用模型或个人特定模型。特定于个人的模型产生的结果几乎与现实无法区分，但依赖于使用大型单人数据集的长训练时间。个人通用作品允许在没有进一步训练的情况下将任何视频视觉配音为任何音频，但这些作品无法捕捉到特定于个人的细微差别，并且往往存在视觉伪影。我们的方法基于数据高效的神经渲染先验，克服了现有方法的局限性。我们的管道包括学习延迟神经渲染先验网络和使用神经纹理的演员特定自适应。这种方法允许 $\textbf｛只需几秒钟的数据即可进行高质量的视觉配音｝$ ，从而为任何演员（从一线明星到背景演员）提供视频配音。我们通过两项用户研究表明，我们在 $\textbf｛视觉质量｝$ 和 $\textbf｛可识别性｝$ 方面实现了最先进的定量和定性。我们之前的学习和适应方法 $\textbf｛适用于有限的数据｝$ 比现有的特定于个人的模型更好，并且更具 $\textbf｛可扩展性｝$ 。我们在现实世界中有限数据场景中的实验发现，我们的模型比其他模型更受欢迎。项目页面可在https://dubbingforeveryone.github.io/

摘要: Visual dubbing is the process of generating lip motions of an actor in a video to synchronise with given audio. Recent advances have made progress towards this goal but have not been able to produce an approach suitable for mass adoption. Existing methods are split into either person-generic or person-specific models. Person-specific models produce results almost indistinguishable from reality but rely on long training times using large single-person datasets. Person-generic works have allowed for the visual dubbing of any video to any audio without further training, but these fail to capture the person-specific nuances and often suffer from visual artefacts. Our method, based on data-efficient neural rendering priors, overcomes the limitations of existing approaches. Our pipeline consists of learning a deferred neural rendering prior network and actor-specific adaptation using neural textures. This method allows for $\textbf{high-quality visual dubbing with just a few seconds of data}$ , that enables video dubbing for any actor - from A-list celebrities to background actors. We show that we achieve state-of-the-art in terms of $\textbf{visual quality}$ and $\textbf{recognisability}$ both quantitatively, and qualitatively through two user studies. Our prior learning and adaptation method $\textbf{generalises to limited data}$ better and is more $\textbf{scalable}$ than existing person-specific models. Our experiments on real-world, limited data scenarios find that our model is preferred over all others. The project page may be found at https://dubbingforeveryone.github.io/

[Downlink:]http://arxiv.org/abs/2401.06126v1

[Project:]https://dubbingforeveryone.github.io/|

标题: Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

作者: Xiang Li, Varun Belagali, Jinghuan Shang

中文摘要: 序列建模方法在机器人模仿学习中显示出了良好的效果。最近，扩散模型以序列建模的方式被用于行为克隆，这得益于它们在建模复杂数据分布方面的卓越能力。基于标准扩散的策略从以输入状态为条件的随机噪声迭代地生成动作序列。尽管如此，扩散政策的模型可以在视觉表示方面得到进一步改进。在这项工作中，我们提出了Crossway Diffusion，这是一种简单而有效的方法，通过精心设计的状态解码器和辅助自监督学习（SSL）目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。整个模型由SSL目标和原始扩散损失共同优化。我们的实验证明了Crossway Diffusion在各种模拟和真实世界的机器人任务中的有效性，证实了其相对于标准的基于扩散的策略的一致优势，以及相对于基线的显著改进

摘要: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.

[Downlink:]http://arxiv.org/abs/2307.01849v3

[Project:]https://youtu.be/9deKHueZBuk|

[GitHub:]https://github.com/LostXine/crossway_diffusion|

标题: Surgical-DINO: Adapter Learning of Foundation Model for Depth Estimation in Endoscopic Surgery

作者: Cui Beilei, Islam Mobarakol, Bai Long

中文摘要: 目的：机器人手术中的深度估计在3D重建、手术导航和增强现实可视化中至关重要。尽管基础模型在许多视觉任务中表现出出色的性能，包括深度估计（例如，DINOv2），但最近的工作观察到其在医学和外科领域特定应用中的局限性。这项工作提出了一种用于手术深度估计的基础模型的低阶自适应（LoRA）。方法：我们设计了一种基于基础模型的深度估计方法，称为Surgical DINO，这是对DINOv2的低阶自适应，用于内窥镜手术中的深度估计。我们构建了LoRA层，并将其集成到DINO中，以适应手术特定领域的知识，而不是传统的微调。在训练过程中，我们冻结了显示出出色视觉表示能力的DINO图像编码器，并仅优化了LoRA层和深度解码器，以集成来自手术场景的特征。结果：我们的模型在SCARED的MICCAI挑战数据集上得到了广泛验证，该数据集是从达芬奇Xi内窥镜手术中收集的。我们的经验表明，外科DINO在内窥镜深度估计任务中显著优于所有最先进的模型。消融研究的分析表明，我们的LoRA层和适应具有显著效果。结论：外科DINO为基础模型成功适应外科领域进行深度估计提供了一些启示。结果中有明确证据表明，对计算机视觉数据集中预先训练的权重进行零样本预测或简单微调不足以直接在外科领域使用基础模型。代码位于https://github.com/BeileiCui/SurgicalDINO.

摘要: Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.

[Downlink:]http://arxiv.org/abs/2401.06013v1

[GitHub:]https://github.com/BeileiCui/SurgicalDINO.|

标题: Interaction Region Visual Transformer for Egocentric Action Anticipation

作者: Debaditya Roy, Ramanathan Rajendiran, Basura Fernando

中文摘要: 人-物交互是最重要的视觉线索之一，我们提出了一种新的方法来表示以自我为中心的动作预期的人-物互动。我们提出了一种新的转换器变体，通过计算由于执行动作而导致的物体和人手外观的变化来对交互进行建模，并使用这些变化来细化视频表示。具体而言，我们使用空间交叉注意力（SCA）对手和物体之间的交互进行建模，并使用轨迹交叉注意力进一步注入上下文信息，以获得环境细化的交互令牌。使用这些令牌，我们构建了一个以交互为中心的动作预期视频表示。我们将我们的模型命名为InAViT，它在大规模以自我为中心的数据集EPICKTICHENS100（EK100）和EGTEA Gaze+上实现了最先进的动作预期性能。InAViT优于其他基于视觉转换器的方法，包括以对象为中心的视频表示。在EK100评估服务器上，InAViT是公共排行榜上表现最好的方法（在提交时），它在平均回收率上比第二好的模型高3.3%

摘要: Human-object interaction is one of the most important visual cues and we propose a novel way to represent human-object interactions for egocentric action anticipation. We propose a novel transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall.

[Downlink:]http://arxiv.org/abs/2211.14154v7

[Project:]https://codalab.lisn.upsaclay.fr/competitions/702|

标题: Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

作者: Ronglai Zuo, Brian Mak

中文摘要: 大多数基于深度学习的连续手语识别（CSLR）模型共享由视觉模块、顺序模块和对齐模块组成的类似主干。然而，由于训练样本有限，连接主义的时间分类损失可能无法充分训练这种CSLR骨干。在这项工作中，我们提出了三项辅助任务来增强CSLR骨干。第一个任务从一致性的角度增强了视觉模块，该模块对训练不足的问题很敏感。具体而言，由于手语的信息主要包含在签名者的面部表情和手部动作中，因此开发了一个关键点引导的空间注意力模块，以强制视觉模块关注信息区域，即空间注意力一致性。其次，注意到视觉模块和序列模块的输出特征都表示同一个句子，为了更好地利用主干的能力，在视觉模块和顺序模块之间施加了句子嵌入一致性约束，以增强两个特征的表示能力。我们将用上述辅助任务训练的CSLR模型命名为一致性增强的CSLR，它在依赖于签名者的数据集上表现良好，其中所有签名者都出现在训练和测试过程中。为了使其对独立于签名者的设置更具鲁棒性，进一步提出了一种基于特征解纠缠的签名者移除模块来从主干中移除签名者信息。进行了广泛的消融研究，以验证这些辅助任务的有效性。更值得注意的是，通过基于变压器的主干网，我们的模型在五个基准测试上实现了最先进或有竞争力的性能，即PHOENIX-2014、PHOENIX-2014-T、PHOEINIX-2014-SI、CSL和CSL Daily。代码和型号可在https://github.com/2000ZRL/LCSA_C2SLR_SRM.

摘要: Most deep-learning-based continuous sign language recognition (CSLR) models share a similar backbone consisting of a visual module, a sequential module, and an alignment module. However, due to limited training samples, a connectionist temporal classification loss may not train such CSLR backbones sufficiently. In this work, we propose three auxiliary tasks to enhance the CSLR backbones. The first task enhances the visual module, which is sensitive to the insufficient training problem, from the perspective of consistency. Specifically, since the information of sign languages is mainly included in signers’ facial expressions and hand movements, a keypoint-guided spatial attention module is developed to enforce the visual module to focus on informative regions, i.e., spatial attention consistency. Second, noticing that both the output features of the visual and sequential modules represent the same sentence, to better exploit the backbone’s power, a sentence embedding consistency constraint is imposed between the visual and sequential modules to enhance the representation power of both features. We name the CSLR model trained with the above auxiliary tasks as consistency-enhanced CSLR, which performs well on signer-dependent datasets in which all signers appear during both training and testing. To make it more robust for the signer-independent setting, a signer removal module based on feature disentanglement is further proposed to remove signer information from the backbone. Extensive ablation studies are conducted to validate the effectiveness of these auxiliary tasks. More remarkably, with a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and Models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.

[Downlink:]http://arxiv.org/abs/2212.13023v2

[GitHub:]https://github.com/2000ZRL/LCSA_C2SLR_SRM.|

标题: CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification

作者: Xiaoyan Yu, Neng Dong, Liehuang Zhu

中文摘要: 可见红外人物再识别（VIReID）主要处理来自不同模态的人物图像之间的身份匹配。由于可见光和红外图像之间的模态间隙，跨模态身份匹配带来了重大挑战。认识到行人外观的高级语义，如性别、形状和服装风格，在不同的模态中保持一致，本文旨在通过将视觉特征与高级语义相融合来弥合模态差距。鉴于CLIP感知与视觉表示相对应的高级语义信息的能力，我们探索了CLIP在VIReID领域中的应用。因此，我们提出了一个CLIP驱动的语义发现网络（CSDN），该网络由特定模态的即时学习者、语义信息集成（SII）和高级语义嵌入（HSE）组成。具体而言，考虑到语言描述中模态差异所带来的多样性，我们设计了双峰可学习文本标记来分别捕获可见光和红外图像的模态私有语义信息。此外，我们认识到不同模态的语义细节具有互补性，因此我们整合了双峰语言描述中的文本特征，以实现全面的语义。最后，我们在整合的文本特征和跨模态的视觉特征之间建立了联系。该过程将丰富的高级语义信息嵌入到视觉表示中，从而提高了视觉表示的模态不变性。通过对多个广泛使用的基准的实验评估，我们提出的CSDN相对于现有方法的有效性和优越性已经得到了证实。代码将在\url发布{https://github.com/nengdong96/CSDN}.

摘要: Visible-infrared person re-identification (VIReID) primarily deals with matching identities across person images from different modalities. Due to the modality gap between visible and infrared images, cross-modality identity matching poses significant challenges. Recognizing that high-level semantics of pedestrian appearance, such as gender, shape, and clothing style, remain consistent across modalities, this paper intends to bridge the modality gap by infusing visual features with high-level semantics. Given the capability of CLIP to sense high-level semantic information corresponding to visual representations, we explore the application of CLIP within the domain of VIReID. Consequently, we propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration (SII), and High-level Semantic Embedding (HSE). Specifically, considering the diversity stemming from modality discrepancies in language descriptions, we devise bimodal learnable text tokens to capture modality-private semantic information for visible and infrared images, respectively. Additionally, acknowledging the complementary nature of semantic details across different modalities, we integrate text features from the bimodal language descriptions to achieve comprehensive semantics. Finally, we establish a connection between the integrated text features and the visual features across modalities. This process embed rich high-level semantic information into visual representations, thereby promoting the modality invariance of visual representations. The effectiveness and superiority of our proposed CSDN over existing methods have been substantiated through experimental evaluations on multiple widely used benchmarks. The code will be released at \url{https://github.com/nengdong96/CSDN}.

[Downlink:]http://arxiv.org/abs/2401.05806v1

[GitHub:]https://github.com/nengdong96/CSDN|

== diffusion model ==

标题: Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

作者: Xiang Li, Varun Belagali, Jinghuan Shang

[Downlink:]http://arxiv.org/abs/2307.01849v3

[Project:]https://youtu.be/9deKHueZBuk|

[GitHub:]https://github.com/LostXine/crossway_diffusion|

标题: Efficient Image Deblurring Networks based on Diffusion Models

作者: Kang Chen, Yuanjie Liu

中文摘要: 本文介绍了一种用于散焦去模糊的滑动窗口模型，该模型以极低的内存使用率实现了迄今为止的最佳性能。该方法名为Swintormer，利用扩散模型生成潜在的先验特征，有助于恢复更详细的图像。它还将滑动窗口策略扩展到专门的Transformer块，以实现高效推理。此外，我们还进一步优化了乘法累加运算（Mac）。与目前性能最好的GRL方法相比，我们的Swintomer模型将计算复杂度从140.35 GMAC大幅降低到8.02 GMACs，同时还将散焦去模糊的信噪比（SNR）从27.04dB提高到27.07dB。这种新方法允许在内存有限的设备上处理更高分辨率的图像，大大扩展了潜在的应用场景。文章最后进行了消融研究，深入分析了每个网络模块对最终性能的影响。源代码和模型将在以下网站上提供：https://github.com/bnm6900030/swintormer.

摘要: This article introduces a sliding window model for defocus deblurring that achieves the best performance to date with extremely low memory usage. Named Swintormer, the method utilizes a diffusion model to generate latent prior features that assist in restoring more detailed images. It also extends the sliding window strategy to specialized Transformer blocks for efficient inference. Additionally, we have further optimized Multiply-Accumulate operations (Macs). Compared to the currently top-performing GRL method, our Swintormer model drastically reduces computational complexity from 140.35 GMACs to 8.02 GMacs, while also improving the Signal-to-Noise Ratio (SNR) for defocus deblurring from 27.04 dB to 27.07 dB. This new method allows for the processing of higher resolution images on devices with limited memory, significantly expanding potential application scenarios. The article concludes with an ablation study that provides an in-depth analysis of the impact of each network module on final performance. The source code and model will be available at the following website: https://github.com/bnm6900030/swintormer.

[Downlink:]http://arxiv.org/abs/2401.05907v1

[GitHub:]https://github.com/bnm6900030/swintormer.|

标题: Relation-Aware Diffusion Model for Controllable Poster Layout Generation

作者: Fengheng Li, An Liu, Wei Feng

中文摘要: 海报布局是海报设计的一个重要方面。现有的方法主要关注视觉内容和图形元素之间的相关性。然而，一个愉快的布局也应该考虑视觉和文本内容之间的关系以及元素之间的关系。在这项研究中，我们引入了一个用于海报布局生成的关系感知扩散模型，该模型在生成过程中结合了这两种关系。首先，我们设计了一个视觉-文本关系感知模块，该模块跨模态对齐视觉和文本表示，从而增强布局在传递文本信息方面的功效。随后，我们提出了一个几何关系感知模块，该模块通过综合考虑上下文信息来学习元素之间的几何关系。此外，所提出的方法可以基于用户约束生成不同的布局。为了推进这一领域的研究，我们构建了一个名为CGL数据集V2的海报布局数据集。我们提出的方法在CGL数据集V2上优于最先进的方法。数据和代码将在https://github.com/liuan0803/RADM.

摘要: Poster layout is a crucial aspect of poster design. Prior methods primarily focus on the correlation between visual content and graphic elements. However, a pleasant layout should also consider the relationship between visual and textual contents and the relationship between elements. In this study, we introduce a relation-aware diffusion model for poster layout generation that incorporates these two relationships in the generation process. Firstly, we devise a visual-textual relation-aware module that aligns the visual and textual representations across modalities, thereby enhancing the layout’s efficacy in conveying textual information. Subsequently, we propose a geometry relation-aware module that learns the geometry relationship between elements by comprehensively considering contextual information. Additionally, the proposed method can generate diverse layouts based on user constraints. To advance research in this field, we have constructed a poster layout dataset named CGL-Dataset V2. Our proposed method outperforms state-of-the-art methods on CGL-Dataset V2. The data and code will be available at https://github.com/liuan0803/RADM.

[Downlink:]http://arxiv.org/abs/2306.09086v2

[GitHub:]https://github.com/liuan0803/RADM.|

标题: Controllable 3D Face Generation with Conditional Style Code Diffusion

作者: Xiaolong Shen, Jianxin Ma, Chang Zhou

中文摘要: 根据给定条件生成照片级真实感三维人脸是一项具有挑战性的任务。现有的方法通常依赖于耗时的一个接一个的优化方法，这些方法对于对相同的分发内容（例如人脸）进行建模是无效的。此外，理想的可控3D人脸生成模型应该同时考虑人脸属性和表情。因此，我们提出了一种称为TEx-Face（TExt&Expression to Face）的新方法，该方法通过将任务分为三个部分来解决这些挑战，即3D GAN反转、条件风格代码扩散和3D人脸解码。对于三维GAN反演，我们介绍了两种方法，旨在增强样式代码的表示并减轻三维不一致性。此外，我们设计了一个风格代码去噪器，将多个条件合并到风格代码中，并提出了一种数据扩充策略，以解决配对视觉语言数据不足的问题。在FFHQ、CelebA-HQ和CelebA-Dialog上进行的大量实验证明了我们的TEx Face在实现高效可控的真实感3D人脸生成方面的良好性能。代码将在https://github.com/sxl142/TEx-Face.

摘要: Generating photorealistic 3D faces from given conditions is a challenging task. Existing methods often rely on time-consuming one-by-one optimization approaches, which are not efficient for modeling the same distribution content, e.g., faces. Additionally, an ideal controllable 3D face generation model should consider both facial attributes and expressions. Thus we propose a novel approach called TEx-Face(TExt & Expression-to-Face) that addresses these challenges by dividing the task into three components, i.e., 3D GAN Inversion, Conditional Style Code Diffusion, and 3D Face Decoding. For 3D GAN inversion, we introduce two methods which aim to enhance the representation of style codes and alleviate 3D inconsistencies. Furthermore, we design a style code denoiser to incorporate multiple conditions into the style code and propose a data augmentation strategy to address the issue of insufficient paired visual-language data. Extensive experiments conducted on FFHQ, CelebA-HQ, and CelebA-Dialog demonstrate the promising performance of our TEx-Face in achieving the efficient and controllable generation of photorealistic 3D faces. The code will be available at https://github.com/sxl142/TEx-Face.

[Downlink:]http://arxiv.org/abs/2312.13941v2

[GitHub:]https://github.com/sxl142/TEx-Face.|

标题: The Devil's Advocate: Shattering the Illusion of Unexploitable Data using Diffusion Models

作者: Hadi M. Dolatabadi, Sarah Erfani, Christopher Leckie

中文摘要: 保护个人数据免受机器学习模型的利用至关重要。最近，可用性攻击显示出巨大的前景，可以提供一层额外的保护，防止未经授权使用数据来训练神经网络。这些方法旨在为干净的数据添加难以察觉的噪声，使神经网络无法从受保护的数据中提取有意义的模式，声称它们会使个人数据“不可利用”。本文针对这些方法提供了强有力的对策，表明不可利用的数据可能只是一种幻觉。特别是，我们利用了扩散模型的力量，并表明精心设计的去噪过程可以抵消数据保护扰动的有效性。我们严格分析了我们的算法，并从理论上证明了所需的去噪量与数据保护扰动的大小直接相关。我们的方法称为AVTAR，在各种场景中针对最近的一系列可用性攻击提供了最先进的性能，即使在扩散模型和受保护数据之间的分布不匹配的情况下，也优于对抗性训练。我们的发现呼吁对个人数据的不可利用性进行更多的研究，这表明这一目标远未结束。我们的实施可在此存储库中获得：https://github.com/hmdolatabadi/AVATAR.

摘要: Protecting personal data against exploitation of machine learning models is crucial. Recently, availability attacks have shown great promise to provide an extra layer of protection against the unauthorized use of data to train neural networks. These methods aim to add imperceptible noise to clean data so that the neural networks cannot extract meaningful patterns from the protected data, claiming that they can make personal data “unexploitable.” This paper provides a strong countermeasure against such approaches, showing that unexploitable data might only be an illusion. In particular, we leverage the power of diffusion models and show that a carefully designed denoising process can counteract the effectiveness of the data-protecting perturbations. We rigorously analyze our algorithm, and theoretically prove that the amount of required denoising is directly related to the magnitude of the data-protecting perturbations. Our approach, called AVATAR, delivers state-of-the-art performance against a suite of recent availability attacks in various scenarios, outperforming adversarial training even under distribution mismatch between the diffusion model and the protected data. Our findings call for more research into making personal data unexploitable, showing that this goal is far from over. Our implementation is available at this repository: https://github.com/hmdolatabadi/AVATAR.

[Downlink:]http://arxiv.org/abs/2303.08500v2

[GitHub:]https://github.com/hmdolatabadi/AVATAR.|

标题: Stimulating the Diffusion Model for Image Denoising via Adaptive Embedding and Ensembling

作者: Tong Li, Hansen Feng, Lizhi Wang

中文摘要: 图像去噪是计算摄影中的一个基本问题，在计算摄影中，以低失真实现高感知是非常高的要求。当前的方法要么与感知质量作斗争，要么遭受显著失真。最近，新兴的扩散模型在各种任务中都取得了最先进的性能，并显示出图像去噪的巨大潜力。然而，用于图像去噪的刺激扩散模型并不简单，并且需要解决几个关键问题。一方面，输入的不一致性阻碍了扩散模型和图像去噪之间的联系。另一方面，生成的图像和期望的去噪图像之间的内容不一致引入失真。为了解决这些问题，我们提出了一种新的策略，称为图像去噪的扩散模型（DMID），通过从去噪的角度理解和重新思考扩散模型。我们的DMID策略包括将噪声图像嵌入预训练的无条件扩散模型的自适应嵌入方法和减少去噪图像中失真的自适应集成方法。我们的DMID策略在基于失真和基于感知的度量上实现了最先进的性能，用于高斯和真实世界的图像去噪。代码位于https://github.com/Li-Tong-621/DMID.

摘要: Image denoising is a fundamental problem in computational photography, where achieving high perception with low distortion is highly demanding. Current methods either struggle with perceptual quality or suffer from significant distortion. Recently, the emerging diffusion model has achieved state-of-the-art performance in various tasks and demonstrates great potential for image denoising. However, stimulating diffusion models for image denoising is not straightforward and requires solving several critical problems. For one thing, the input inconsistency hinders the connection between diffusion models and image denoising. For another, the content inconsistency between the generated image and the desired denoised image introduces distortion. To tackle these problems, we present a novel strategy called the Diffusion Model for Image Denoising (DMID) by understanding and rethinking the diffusion model from a denoising perspective. Our DMID strategy includes an adaptive embedding method that embeds the noisy image into a pre-trained unconditional diffusion model and an adaptive ensembling method that reduces distortion in the denoised image. Our DMID strategy achieves state-of-the-art performance on both distortion-based and perception-based metrics, for both Gaussian and real-world image denoising.The code is available at https://github.com/Li-Tong-621/DMID.

[Downlink:]http://arxiv.org/abs/2307.03992v3

[GitHub:]https://github.com/Li-Tong-621/DMID.|

== 视觉导航 ==

标题: Surgical-DINO: Adapter Learning of Foundation Model for Depth Estimation in Endoscopic Surgery

作者: Cui Beilei, Islam Mobarakol, Bai Long

[Downlink:]http://arxiv.org/abs/2401.06013v1

[GitHub:]https://github.com/BeileiCui/SurgicalDINO.|

标题: On State Estimation in Multi-Sensor Fusion Navigation: Optimization and Filtering

作者: Feng Zhu, Zhuo Xu, Xveqing Zhang

中文摘要: 导航、感知和决策是智能机器人的基本任务，其本质是估计必要的系统状态。其中，导航是其他上层应用程序的基础，通过集成来自多个传感器的测量，提供精确的位置和方向。通过对每个传感器的观测值进行适当的建模，将导航的多传感器融合任务简化为状态估计问题，该问题可以通过两种方法解决：优化和滤波。最近的研究表明，基于优化的框架在准确性方面优于基于过滤的框架。然而，这两种方法都是基于最大似然估计（MLE）的，并且在理论上应该与相同的线性化点、观测模型、测量和高斯噪声假设等效。在本文中，我们深入挖掘了基于优化和基于过滤的方法中使用的理论和现有策略。结果表明，这两种方法在理论上是相等的，但由于在实时操作中应用的策略不同，这种等价性会破坏。通过调整现有的基于滤波的方法的策略，基于视觉里程计（VO）的蒙特卡洛模拟和车载消融实验表明，策略调整后的滤波严格等于优化。因此，未来对传感器融合问题的研究应该集中在它们自己的算法和策略上，而不是状态估计方法

摘要: The essential of navigation, perception, and decision-making which are basic tasks for intelligent robots, is to estimate necessary system states. Among them, navigation is fundamental for other upper applications, providing precise position and orientation, by integrating measurements from multiple sensors. With observations of each sensor appropriately modelled, multi-sensor fusion tasks for navigation are reduced to the state estimation problem which can be solved by two approaches: optimization and filtering. Recent research has shown that optimization-based frameworks outperform filtering-based ones in terms of accuracy. However, both methods are based on maximum likelihood estimation (MLE) and should be theoretically equivalent with the same linearization points, observation model, measurements, and Gaussian noise assumption. In this paper, we deeply dig into the theories and existing strategies utilized in both optimization-based and filtering-based approaches. It is demonstrated that the two methods are equal theoretically, but this equivalence corrupts due to different strategies applied in real-time operation. By adjusting existing strategies of the filtering-based approaches, the Monte-Carlo simulation and vehicular ablation experiments based on visual odometry (VO) indicate that the strategy adjusted filtering strictly equals to optimization. Therefore, future research on sensor-fusion problems should concentrate on their own algorithms and strategies rather than state estimation approaches.

[Downlink:]http://arxiv.org/abs/2401.05836v1

标题: MatSAM: Efficient Materials Microstructure Extraction via Visual Large Model

作者: Changtai Li, Xu Han, Chao Yao

中文摘要: 准确有效地提取材料微观图像中的微观结构，在探索结构-性能关系和优化工艺参数方面发挥着关键作用。基于深度学习的图像分割技术依赖于手动注释，耗时耗力，难以满足模型可移植性和泛化的要求。Segment Anything Model（SAM）是一种具有强大的深度特征表示和零样本泛化能力的大型视觉模型，为图像分割提供了新的解决方案。然而，在没有人为注释的情况下直接应用SAM来分割材料微观图像中的微观结构并不能达到预期的结果，因为难以使其原生的即时工程适应材料微观图像的关键微观结构的密集和分散特征。在本文中，我们提出了一种基于SAM的通用高效的微观结构提取解决方案MatSAM。根据材料微观结构的分布和形状，设计了一种新的基于点的提示生成策略。它为不同的微观图像生成提示，融合感兴趣区域（ROI）关键点和网格关键点的提示，并集成后处理方法对材料微观结构进行定量表征。对于包括晶界和相在内的常见微观结构，MatSAM实现了优于传统方法的分割性能，甚至优于对光学显微镜（OM）和扫描电子显微镜（SEM）成像的18种材料微观结构进行评估的监督学习方法。我们相信，MatSAM可以显著降低材料微观结构定量表征的成本，并加快新材料的设计

摘要: Accurate and efficient extraction of microstructures in microscopic images of materials plays a critical role in the exploration of structure-property relationships and the optimization of process parameters. Deep learning-based image segmentation techniques that rely on manual annotation are time-consuming and labor-intensive and hardly meet the demand for model transferability and generalization. Segment Anything Model (SAM), a large visual model with powerful deep feature representation and zero-shot generalization capabilities, has provided new solutions for image segmentation. However, directly applying SAM to segmenting microstructures in microscopic images of materials without human annotation cannot achieve the expected results, as the difficulty of adapting its native prompt engineering to the dense and dispersed characteristics of key microstructures in materials microscopy images. In this paper, we propose MatSAM, a general and efficient microstructure extraction solution based on SAM. A new point-based prompts generation strategy is designed, grounded on the distribution and shape of materials microstructures. It generates prompts for different microscopic images, fuses the prompts of the region of interest (ROI) key points and grid key points, and integrates post-processing methods for quantitative characterization of materials microstructures. For common microstructures including grain boundary and phase, MatSAM achieves superior segmentation performance to conventional methods and is even preferable to supervised learning methods evaluated on 18 materials microstructures imaged by the optical microscope (OM) and scanning electron microscope (SEM). We believe that MatSAM can significantly reduce the cost of quantitative characterization of materials microstructures and accelerate the design of new materials.

[Downlink:]http://arxiv.org/abs/2401.05638v1

标题: Exploring Vulnerabilities of No-Reference Image Quality Assessment Models: A Query-Based Black-Box Method

作者: Chenxi Yang, Yujia Liu, Dingquan Li

中文摘要: 无参考图像质量评估（NR-IQA）旨在预测与人类感知一致的图像质量分数，而不依赖于原始参考图像，作为各种视觉任务的关键组成部分。确保NR-IQA方法的稳健性对于不同图像处理技术的可靠比较和推荐中一致的用户体验至关重要。NR-IQA的攻击方法为测试NR-IQA提供了一个强大的工具。然而，当前NR-IQA的攻击方法严重依赖于NR-IQA模型的梯度，导致在梯度信息不可用时受到限制。在本文中，我们提出了一种针对NR-IQA方法的开创性的基于查询的黑匣子攻击。我们提出了\emph{分数边界}的概念，并利用了一种具有多个分数边界的自适应迭代方法。同时，初始攻击方向也被设计为利用人类视觉系统（HVS）的特性。实验表明，我们的攻击方法优于所有最先进的方法，并且远远领先于以前的黑盒方法。有效的DBCNN模型在受到我们的方法攻击时，Spearman秩序相关系数（SROCC）下降了0.6972$，揭示了NR-IQA对黑匣子攻击的脆弱性。所提出的攻击方法也为进一步探索NR-IQA的鲁棒性提供了有力的工具

摘要: No-Reference Image Quality Assessment (NR-IQA) aims to predict image quality scores consistent with human perception without relying on pristine reference images, serving as a crucial component in various visual tasks. Ensuring the robustness of NR-IQA methods is vital for reliable comparisons of different image processing techniques and consistent user experiences in recommendations. The attack methods for NR-IQA provide a powerful instrument to test the robustness of NR-IQA. However, current attack methods of NR-IQA heavily rely on the gradient of the NR-IQA model, leading to limitations when the gradient information is unavailable. In this paper, we present a pioneering query-based black box attack against NR-IQA methods. We propose the concept of \emph{score boundary} and leverage an adaptive iterative approach with multiple score boundaries. Meanwhile, the initial attack directions are also designed to leverage the characteristics of the Human Visual System (HVS). Experiments show our attack method outperforms all compared state-of-the-art methods and is far ahead of previous black-box methods. The effective DBCNN model suffers a Spearman rank-order correlation coefficient (SROCC) decline of $0.6972$ attacked by our method, revealing the vulnerability of NR-IQA to black-box attacks. The proposed attack method also provides a potent tool for further exploration into NR-IQA robustness.

[Downlink:]http://arxiv.org/abs/2401.05217v1

标题: Amplifying robotics capacities with a human touch: An immersive low-latency panoramic remote system

作者: Junjie Li, Kang Li, Dewei Han

中文摘要: 人工智能和机器人技术在过去十年中取得了显著进步，改变了各个领域的工作模式和机会。这些技术的应用将社会推向了一个人与机器共生的时代。为了促进人类与智能机器人之间的高效通信，我们提出了“阿凡达”系统，这是一个沉浸式低延迟全景人机交互平台。我们设计并测试了一个坚固的移动平台原型，该平台集成了边缘计算单元、全景视频捕获设备、动力电池、机械臂和网络通信设备。在良好的网络条件下，我们实现了延迟357ms的低延迟高清全景视觉体验。操作员可以利用VR耳机和控制器对机器人和设备进行实时沉浸式控制。该系统能够实现跨越校园、省份、国家甚至大洲（纽约到深圳）的远距离远程控制。此外，该系统结合了用于地图和轨迹记录的视觉SLAM技术，提供了自主导航功能。我们相信，这个直观的系统平台可以提高人机协作的效率和情景体验，随着相关技术的进一步进步，它将成为人工智能与人类高效共生合作的通用工具

摘要: AI and robotics technologies have witnessed remarkable advancements in the past decade, revolutionizing work patterns and opportunities in various domains. The application of these technologies has propelled society towards an era of symbiosis between humans and machines. To facilitate efficient communication between humans and intelligent robots, we propose the “Avatar” system, an immersive low-latency panoramic human-robot interaction platform. We have designed and tested a prototype of a rugged mobile platform integrated with edge computing units, panoramic video capture devices, power batteries, robot arms, and network communication equipment. Under favorable network conditions, we achieved a low-latency high-definition panoramic visual experience with a delay of 357ms. Operators can utilize VR headsets and controllers for real-time immersive control of robots and devices. The system enables remote control over vast physical distances, spanning campuses, provinces, countries, and even continents (New York to Shenzhen). Additionally, the system incorporates visual SLAM technology for map and trajectory recording, providing autonomous navigation capabilities. We believe that this intuitive system platform can enhance efficiency and situational experience in human-robot collaboration, and with further advancements in related technologies, it will become a versatile tool for efficient and symbiotic cooperation between AI and humans.

[Downlink:]http://arxiv.org/abs/2401.03398v2

标题: Autonomous robotic re-alignment for face-to-face underwater human-robot interaction

作者: Demetrious T. Kutzke, Ashwin Wariar, Junaed Sattar

中文摘要: 由于传感、导航、操纵和机载计算技术的进步，自动水下航行器（AUV）用于完成传统上具有挑战性和危险性的任务的使用激增。在水下人机交互（UHRI）中使用AUV的增长水平相对较小，这是由于双向通信的局限性和弥合与陆地交互策略的类比与水下领域可能的类比之间差距的重大技术障碍。支持UHRI的一个必要组成部分是建立一个安全的机器人潜水员方法系统，以建立考虑非标准人体姿势的面对面交流。在这项工作中，我们介绍了一种用于增强UHRI的立体视觉系统，该系统利用立体图像对的三维重建和机器学习来定位人类关节估计。然后，我们为坐标系建立了一个约定，该约定对人类相对于相机坐标系所面对的方向进行编码。这允许自动设置点计算，该设置点计算保持人体比例并且可以用作基于图像的视觉伺服控制方案的输入。我们表明，我们的设定点计算往往在数量和质量上与实验设定点基线一致。所介绍的方法有望通过改善机器人对水下人类方位的感知来增强UHRI

摘要: The use of autonomous underwater vehicles (AUVs) to accomplish traditionally challenging and dangerous tasks has proliferated thanks to advances in sensing, navigation, manipulation, and on-board computing technologies. Utilizing AUVs in underwater human-robot interaction (UHRI) has witnessed comparatively smaller levels of growth due to limitations in bi-directional communication and significant technical hurdles to bridge the gap between analogies with terrestrial interaction strategies and those that are possible in the underwater domain. A necessary component to support UHRI is establishing a system for safe robotic-diver approach to establish face-to-face communication that considers non-standard human body pose. In this work, we introduce a stereo vision system for enhancing UHRI that utilizes three-dimensional reconstruction from stereo image pairs and machine learning for localizing human joint estimates. We then establish a convention for a coordinate system that encodes the direction the human is facing with respect to the camera coordinate frame. This allows automatic setpoint computation that preserves human body scale and can be used as input to an image-based visual servo control scheme. We show that our setpoint computations tend to agree both quantitatively and qualitatively with experimental setpoint baselines. The methodology introduced shows promise for enhancing UHRI by improving robotic perception of human orientation underwater.

[Downlink:]http://arxiv.org/abs/2401.04320v1