[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型

专属领域论文订阅

VX关注{晓理紫}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

》》由于精力有限，今后就不在CSDN上更新最新相关论文信息，VX公号会持续更新，有需要的可以关注，谢谢支持。
在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

==LLM ==

标题: NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention

作者: Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao

PubTime: 2024-03-02

Downlink: http://arxiv.org/abs/2403.01273v1

GitHub: https://github.com/tonyzhang617/nomad-dist|

中文摘要: 由于注意力计算中大量昂贵的乘加（MAD）矩阵运算，在中央处理器（CPU）上进行大型语言模型推理具有挑战性。在本文中，我们认为现代CPU中有一个罕见的宝石，单指令多数据（SIMD）寄存器，它允许批量超低延迟查找。我们利用CPU的这种独特能力提出了NoMAD-Attention，这是一种高效的注意算法，用寄存器内查找取代了MAD操作。通过硬件感知算法设计，NoMAD-Attention使用对SIMD寄存器的重复快速访问来实现注意力分数的计算，尽管它们的大小非常有限。此外，NoMAD-Attention与预先训练的基于注意力的LLMs一起工作，无需模型微调。经验评估表明，NoMAD-Attention很好地保持了原始LLMs的质量，并将基于4位量化LLaMA-7B的模型加速了高达2$乘以$1万亿16k上下文长度。我们的结果可在https：//github.com/tonyzhang617/nomad-dist。

摘要: Large language model inference on Central Processing Units (CPU) is challenging due to the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention computations. In this paper, we argue that there is a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allow for ultra-low-latency lookups in batch. We leverage this unique capability of CPUs to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers despite their highly limited sizes. Moreover, NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well, and speeds up the 4-bit quantized LLaMA-7B-based model by up to 2 $\times$ at 16k context length. Our results are reproducible at https://github.com/tonyzhang617/nomad-dist.

标题: DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning

作者: Jing Xiong, Zixuan Li, Chuanyang Zheng

PubTime: 2024-03-02

Downlink: http://arxiv.org/abs/2310.02954v5

GitHub: https://github.com/AI4fun/DQ-LoRe|https://github.com/AI4fun/DQ-LoRe|

中文摘要: 自然语言处理的最新进展，主要由大型语言模型（LLMs）推动，展示了它们基于上下文学习的卓越能力。在复杂的推理任务中指导LLMs的一个有前途的途径是利用思维链（CoT）范式中的中间推理步骤。然而，核心挑战在于有效选择样本以促进情境学习。在这项研究中，我们介绍了一个框架，利用双重查询和低秩近似重新排序（DQ-LoRe）来自动选择样本进行上下文学习。双重查询首先查询LLM以获得LLM生成的知识，如CoT，然后查询检索器以通过问题和知识获得最终样本。此外，对于第二个查询，LoRe采用降维技术来细化样本选择，确保与输入问题的知识紧密一致。通过大量的实验，我们证明了DQ-LoRe在自动选择GPT-4样本方面明显优于现有的最先进的方法，将性能从92.5%提高到94.2%。我们的综合分析进一步揭示了DQ知识在性能和适应性方面始终优于基于检索的方法，特别是在以分布变化为特征的场景中。DQ知识推动了情境学习的边界，并为解决复杂的推理挑战开辟了新的途径。我们的代码发布于https://github.com/AI4fun/DQ-LoRe}{https://github.com/AI4fun/DQ-LoRe。

摘要: Recent advances in natural language processing, primarily propelled by Large Language Models (LLMs), have showcased their remarkable capabilities grounded in in-context learning. A promising avenue for guiding LLMs in intricate reasoning tasks involves the utilization of intermediate reasoning steps within the Chain-of-Thought (CoT) paradigm. Nevertheless, the central challenge lies in the effective selection of exemplars for facilitating in-context learning. In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking (DQ-LoRe) to automatically select exemplars for in-context learning. Dual Queries first query LLM to obtain LLM-generated knowledge such as CoT, then query the retriever to obtain the final exemplars via both question and the knowledge. Moreover, for the second query, LoRe employs dimensionality reduction techniques to refine exemplar selection, ensuring close alignment with the input question’s knowledge. Through extensive experiments, we demonstrate that DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%. Our comprehensive analysis further reveals that DQ-LoRe consistently outperforms retrieval-based approaches in terms of both performance and adaptability, especially in scenarios characterized by distribution shifts. DQ-LoRe pushes the boundary of in-context learning and opens up new avenues for addressing complex reasoning challenges. Our code is released at https://github.com/AI4fun/DQ-LoRe}{https://github.com/AI4fun/DQ-LoRe.

标题: Consistency Matters: Explore LLMs Consistency From a Black-Box Perspective

作者: Fufangchen Zhao, Guoqiang Jin, Jiaheng Huang

PubTime: 2024-03-02

Downlink: http://arxiv.org/abs/2402.17411v2

GitHub: https://github.com/heavenhellchen/Consistency.git|

中文摘要: 如今，商业和开源学术LLM已经成为NLP的主流模式。然而，仍然缺乏对LLM一致性的研究，这意味着在LLM研究和部署的各个阶段，其内部参数和能力应保持不变。这个问题在工业和学术部门都存在。该问题的解决方案通常是耗时和劳动密集型的，并且还存在二次部署的额外成本，从而导致经济和时间损失。为了填补这一空白，我们构建了一个LLM一致性任务数据集，并设计了几个基线。此外，我们为主要实验选择不同规模的模型。具体来说，在LightGBM实验中，我们使用传统的NLG度量（即胭脂、蓝色、流星）作为模型训练所需的特征。最终结果超过了人工评估和GPT3.5以及主实验中的其他模型，达到了最佳性能。最后，我们使用性能最好的LightGBM模型作为基础模型来构建评估工具，该工具可以有效地辅助业务模型的部署。我们的代码和工具演示可在https：//github.com/heavenhellchen/Consistency.git

摘要: Nowadays both commercial and open-source academic LLM have become the mainstream models of NLP. However, there is still a lack of research on LLM consistency, meaning that throughout the various stages of LLM research and deployment, its internal parameters and capabilities should remain unchanged. This issue exists in both the industrial and academic sectors. The solution to this problem is often time-consuming and labor-intensive, and there is also an additional cost of secondary deployment, resulting in economic and time losses. To fill this gap, we build an LLM consistency task dataset and design several baselines. Additionally, we choose models of diverse scales for the main experiments. Specifically, in the LightGBM experiment, we used traditional NLG metrics (i.e., ROUGE, BLEU, METEOR) as the features needed for model training. The final result exceeds the manual evaluation and GPT3.5 as well as other models in the main experiment, achieving the best performance. In the end, we use the best performing LightGBM model as the base model to build the evaluation tool, which can effectively assist in the deployment of business models. Our code and tool demo are available at https://github.com/heavenhellchen/Consistency.git

标题: Measurement of ChatGPT Performance in Mapping Natural Language Speficaction into an Entity Relationship Diagram

作者: Mussa A. Omar

PubTime: 2023-12

Downlink: https://ieeexplore.ieee.org/document/10449869/

Journal: 2023 IEEE 11th International Conference on Systems and Control (ICSC)

摘要: This paper explores the entity relationship diagram, a popular conceptual model used to depict entities, attributes, and relationships graphically. To help with this, we use ChatGPT, a sophisticated language model based on the GPT architecture, which can translate natural language text into an entity relationship diagram. The paper details the process of evaluating how well ChatGPT can perform compared to other state-of-the-art approaches for entity and relationship extraction. Our experimental findings demonstrate the strong ability of ChatGPT to translate natural language text into entity relationship diagrams, which has potential applications for knowledge graph building, data integration, and database schema design. Moreover, it can aid in automating the extraction and organization of information from unstructured text data, thereby simplifying the study of complex systems.

标题: Analyzing the Competitive Mathematical Problem-Solving Skills of ChatGPT

作者: Rohin R. Teegavarapu, Harshal Sanghvi

PubTime: 2023-12

Downlink: https://ieeexplore.ieee.org/document/10452659/

Journal: 2023 International Conference on Data Science, Agents & Artificial Intelligence (ICDSAAI)

中文摘要: 最近出现的ChatGPT（CGPT）被许多人认为是机器学习的突破，因为它能够像人一样“说话”，并能够解释、辅导和协助。本研究评估了CGPT解决不同类型和复杂程度的AMC（美国数学竞赛）竞争性数学问题的能力。这些问题被分为数论、几何、代数、组合学和计算指定的五个竞争数学领域，然后进一步分为三个不同的复杂程度（简单、普通和困难）。结果表明，CGPT在数论和计算类别中表现最好，但在几何和组合学等科目中表现不佳。平均来说，50%到60%的问题被CGPT正确解决了；然而，它不能解决大多数更复杂的问题。这一点表明，CGPT未能正确应用对解决具体问题至关重要的概念。因此，CGPT被发现在空间推理和解决问题的技能方面很弱。

摘要: The recent emergence of ChatGPT (CGPT) has been considered a machine-learning breakthrough by many for its ability to “speak” like a human and its ability to explain, tutor, and assist. In this study, the ability of CGPT to solve competitive mathematics problems from the AMC (American Mathematics Competitions) with varying type and complexity was evaluated. The problems were categorized into five fields of competitive mathematics specified by number theory, geometry, algebra, combinatorics, and computation, and then further categorized into three different complexity levels (easy, normal, and hard). Results indicated that CGPT performed the best in the number theory and computation categories but did not perform well in subjects such as geometry and combinatorics. On average, 50 to 60 percent of the problems were correctly solved by CGPT; however, it could not solve most problems of higher complexity. This point indicated that CGPT failed to correctly apply concepts crucial to solving specific problems. Therefore, CGPT was found to be weak in skills involving spatial reasoning and problem-solving.

标题: PD-UVU: Prompts-Driven Unified Visual Understanding Model Based on Specific Queries

作者: Lu Xin, Chunyuan Zhang

PubTime: 2023-08

Downlink: https://ieeexplore.ieee.org/document/10448966/

Journal: 2023 IEEE Smart World Congress (SWC)

摘要: All visual understanding tasks aim to locate specified objects based on language expressions, category names, or annotation information as queries. Currently, tasks with this common feature are completed independently as multiple sub-tasks. In this paper, we propose a universal visual understanding model, PD-UVU. PD-UVU redefines diverse visual understanding tasks as a unified prompts-driven query paradigm, where changing the prompts can retrieve objects of a specific category. This unified representation brings the following benefits: (1) a universal object-centered semantic representation can be trained based on a large amount of data from different visual tasks and label vocabularies, which is particularly advantageous for tasks with limited training data; (2) a unified visual understanding model has strong parameter efficiency and domain generalization ability, enabling the simultaneous handling of multiple tasks and saving redundant computations required by small models for independent tasks. PD-UVU shows outstanding performance on 20 benchmarks from 10 challenging visual understanding tasks, including image-level tasks such as object detection and instance segmentation, expression understanding and language-image segmentation tasks, as well as object tracking tasks.

== CLIP@ViT @ VLM @ visual model ==

标题: Deformable One-shot Face Stylization via DINO Semantic Guidance

作者: Yang Zhou, Zichong Chen, Hui Huang

PubTime: 2024-03-04

Downlink: http://arxiv.org/abs/2403.00459v2

GitHub: https://github.com/zichongc/DoesFS|https://github.com/zichongc/DoesFS|

中文摘要: 本文解决了一次性人脸风格化的复杂问题，重点是同时考虑外观和结构，这是以前的方法所没有的。我们探索变形感知人脸风格化，它不同于传统的单一图像风格参考，而是选择真实风格的图像对。我们的方法的基石是利用自我监督的视觉Transformer model，特别是DINO-ViT，来建立跨真实和风格域的鲁棒和一致的面部结构表示。我们的风格化过程从通过集成空间转换器（STN）使StyleGAN生成器具有变形感知能力开始。然后，我们在DINO语义的指导下引入了两个创新的生成器微调约束：i）调节DINO空间中方向向量的方向变形损失，以及ii）基于DINO令牌自相似性的相对结构一致性约束，确保多样化的生成。此外，风格混合用于将颜色生成与参考对齐，从而最大限度地减少不一致的对应关系。该框架为一般的一次性人脸风格化提供了增强的变形能力，以大约10分钟的微调持续时间实现了显著的效率。广泛的定性和定量比较证明了我们优于最先进的一次性人脸风格化方法。代码可在https：//github.com/zichongc/DoesFS

摘要: This paper addresses the complex issue of one-shot face stylization, focusing on the simultaneous consideration of appearance and structure, where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference, opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer, specifically DINO-ViT, to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space, and ii) a relative structural consistency constraint based on DINO token self-similarities, ensuring diverse generation. Additionally, style-mixing is employed to align the color generation with the reference, minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization, achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate our superiority over state-of-the-art one-shot face stylization methods. Code is available at https://github.com/zichongc/DoesFS

标题: Towards Accurate Lip-to-Speech Synthesis in-the-Wild

作者: Sindhu Hegde, Rudrabha Mukhopadhyay, C. V. Jawahar

PubTime: 2024-03-02

Downlink: http://arxiv.org/abs/2403.01087v1

Project: http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw|

摘要: In this paper, we introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements. The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone, resulting in unsatisfactory outcomes. To overcome this issue, we propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model. The noisy text is generated using a pre-trained lip-to-text model, enabling our approach to work without text annotations during inference. We design a visual text-to-speech network that utilizes the visual stream to generate accurate speech, which is in-sync with the silent input video. We perform extensive experiments and ablation studies, demonstrating our approach’s superiority over the current state-of-the-art methods on various benchmark datasets. Further, we demonstrate an essential practical application of our method in assistive technology by generating speech for an ALS patient who has lost the voice but can make mouth movements. Our demo video, code, and additional details can be found at \url{http://cvit.iiit.ac.in/research/projects/cvit-projects/ms-l2s-itw}.

标题: G3DR: Generative 3D Reconstruction in ImageNet

作者: Pradyumna Reddy, Ismail Elezi, Jiankang Deng

PubTime: 2024-03-01

Downlink: http://arxiv.org/abs/2403.00939v1

GitHub: https://github.com/preddy5/G3DR|

中文摘要: 我们在ImageNet中介绍了一种新的3D生成方法，即生成3D重建（G3DR），它能够从单个图像中生成多样化和高质量的3D对象，解决了现有方法的局限性。我们框架的核心是一种新颖的深度正则化技术，它能够生成具有高几何保真度的场景。G3DR还利用预训练的语言视觉模型，如CLIP，来实现新视图中的重建，并提高几代人的视觉真实感。此外，G3DR设计了一个简单而有效的采样程序，以进一步提高世代的质量。G3DR基于类别或文本条件提供多样化和高效的3D资产生成。尽管它很简单，但G3DR能够击败最先进的方法，在感知指标方面提高了22%，在几何分数方面提高了90%，而只需要一半的训练时间。代码可在https：//github.com/preddy5/G3DR

摘要: We introduce a novel 3D generative method, Generative 3D Reconstruction (G3DR) in ImageNet, capable of generating diverse and high-quality 3D objects from single images, addressing the limitations of existing methods. At the heart of our framework is a novel depth regularization technique that enables the generation of scenes with high-geometric fidelity. G3DR also leverages a pretrained language-vision model, such as CLIP, to enable reconstruction in novel views and improve the visual realism of generations. Additionally, G3DR designs a simple but effective sampling procedure to further improve the quality of generations. G3DR offers diverse and efficient 3D asset generation based on class or text conditioning. Despite its simplicity, G3DR is able to beat state-of-theart methods, improving over them by up to 22% in perceptual metrics and 90% in geometry scores, while needing only half of the training time. Code is available at https://github.com/preddy5/G3DR

标题: Rethinking Inductive Biases for Surface Normal Estimation

作者: Gwangbin Bae, Andrew J. Davison

PubTime: 2024-03-01

Downlink: http://arxiv.org/abs/2403.00712v1

GitHub: https://github.com/baegwangbin/DSINE|

摘要: Despite the growing demand for accurate surface normal estimation models, existing methods use general-purpose dense prediction models, adopting the same inductive biases as other tasks. In this paper, we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet, piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model, our method shows a stronger generalization ability, despite being trained on an orders of magnitude smaller dataset. The code is available at https://github.com/baegwangbin/DSINE.

标题: Abductive Ego-View Accident Video Understanding for Safe Driving Perception

作者: Jianwu Fang, Lei-lei Li, Junfei Zhou

PubTime: 2024-03-01

Downlink: http://arxiv.org/abs/2403.00436v1

Project: http://www.lotvsmmau.net|

中文摘要: 我们提出了MM-AU，这是一个用于多模态事故视频理解的新型数据集。MM-AU包含11,727个野外自我视角事故视频，每个视频都有时间对齐的文本描述。我们注释了超过223万个对象框和58,650对基于视频的事故原因，涵盖58个事故类别。MM-AU支持各种事故理解任务，特别是多模态视频传播，以理解安全驾驶的事故因果链。利用MM-AU，我们提出了一个用于安全驾驶感知的绑架事故视频理解框架（AdVersa-SD）。AdVersa-SD通过以对象为中心的视频扩散（OAVD）方法执行视频扩散，该方法由溯因剪辑模型驱动。该模型涉及一个对比交互损失，以学习正常、接近事故、事故框架与相应文本描述（如事故原因、预防建议和事故类别）的对共现。OAVD在视频生成中固定原始帧背景内容的同时，实施因果区域学习，以找到某些事故的主导因果链。大量的实验验证了AdVersa-SD的溯因能力和OAVD相对于最先进的扩散模型的优越性。此外，由于AdVersa-SD依赖于精确的物体和事故原因信息，我们为物体检测和事故原因回答提供了仔细的基准评估。

摘要: We present MM-AU, a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11,727 in-the-wild ego-view accident videos, each with temporally aligned text descriptions. We annotate over 2.23 million object boxes and 58,650 pairs of video-based accident reasons, covering 58 accident categories. MM-AU supports various accident understanding tasks, particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. With MM-AU, we present an Abductive accident Video understanding framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal, near-accident, accident frames with the corresponding text descriptions, such as accident reasons, prevention advice, and accident categories. OAVD enforces the causal region learning while fixing the content of the original frame background in video generation, to find the dominant cause-effect chain for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. Additionally, we provide careful benchmark evaluations for object detection and accident reason answering since AdVersa-SD relies on precise object and accident reason information.

标题: CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

作者: Santiago Castro, Amir Ziai, Avneesh Saluja

PubTime: 2024-03-01

Downlink: http://arxiv.org/abs/2402.15021v2

GitHub: https://github.com/netflix/clove|

摘要: Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove.

== diffusion policy@diffusion formulation@diffusion model ==

标题: GCD-DDPM: A Generative Change Detection Model Based on Difference-Feature Guided DDPM

作者: Yihan Wen, Xianping Ma, Xiaokang Zhang

PubTime: 2024-03-02

Downlink: http://arxiv.org/abs/2306.03424v4

GitHub: https://github.com/udrs/GCD|

中文摘要: 基于深度学习（DL）的方法最近在双时态变化检测（CD）中显示出巨大的前景。现有的基于卷积神经网络（CNN）和变压器的判别方法依赖于判别表示学习来进行变化识别，同时努力探索局部和长期的上下文依赖性。因此，在不同的地面场景中获得细粒度和鲁棒的CD地图仍然具有挑战性。为了应对这一挑战，这项工作提出了一种称为GCD-DDPM的生成变化检测模型，通过利用去噪扩散概率模型（DDPM）直接生成CD图，而不是将每个像素分类为改变或未改变的类别。此外，差分条件编码器（DCE）被设计成通过利用多级差分特征来指导CD图的生成。利用变分推理（VI）程序，GCD-DDPM可以通过迭代推理过程自适应地重新校准CD结果，同时准确区分不同场景中的细微和不规则变化。最后，基于噪声抑制的语义增强器（NSSE）被专门设计来减轻来自CD编码器的当前步骤的变化感知特征表示中的噪声。这种细化作为注意力地图，可以指导后续迭代，同时提高CD的准确性。在四个高分辨率CD数据集上的大量实验证实了所提出的GCD-DDPM的优异性能。这项工作的代码将在https：//github.com/udrs/GCD。提供

摘要: Deep learning (DL)-based methods have recently shown great promise in bitemporal change detection (CD). Existing discriminative methods based on Convolutional Neural Networks (CNNs) and Transformers rely on discriminative representation learning for change recognition while struggling with exploring local and long-range contextual dependencies. As a result, it is still challenging to obtain fine-grained and robust CD maps in diverse ground scenes. To cope with this challenge, this work proposes a generative change detection model called GCD-DDPM to directly generate CD maps by exploiting the Denoising Diffusion Probabilistic Model (DDPM), instead of classifying each pixel into changed or unchanged categories. Furthermore, the Difference Conditional Encoder (DCE), is designed to guide the generation of CD maps by exploiting multi-level difference features. Leveraging the variational inference (VI) procedure, GCD-DDPM can adaptively re-calibrate the CD results through an iterative inference process, while accurately distinguishing subtle and irregular changes in diverse scenes. Finally, a Noise Suppression-based Semantic Enhancer (NSSE) is specifically designed to mitigate noise in the current step’s change-aware feature representations from the CD Encoder. This refinement, serving as an attention map, can guide subsequent iterations while enhancing CD accuracy. Extensive experiments on four high-resolution CD datasets confirm the superior performance of the proposed GCD-DDPM. The code for this work will be available at https://github.com/udrs/GCD.

标题: LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model

作者: Chenjie Cao, Yunuo Cai, Qiaole Dong

PubTime: 2024-03-02

Downlink: http://arxiv.org/abs/2305.11577v3

Project: https://ewrfcas.github.io/LeftRefill|

GitHub: https://github.com/ewrfcas/LeftRefill|https://github.com/ewrfcas/LeftRefill|

中文摘要: 本文介绍了LeftRefill，这是一种有效利用大型文本到图像（T2I）扩散模型进行参考引导图像合成的创新方法。顾名思义，LeftRefill将参考视图和目标视图水平拼接在一起，作为一个整体输入。参考图像位于左侧，而目标画布位于右侧。然后，LeftRefill根据左侧引用和特定任务指令绘制右侧目标画布。这种任务公式与上下文修复有一些相似之处，类似于人类画家的动作。这种新的公式有效地学习参考和目标之间的结构和纹理对应关系，而不需要其他图像编码器或适配器。我们通过T2I模型中的交叉注意模块注入任务和视图信息，并通过重新排列的自我注意模块进一步展示多视图参考能力。这使得LeftRefill能够作为通用模型执行一致的生成，而不需要测试时微调或模型修改。因此，LeftRefill可以被看作是一个简单而统一的框架来处理参考引导的合成。作为一个例子，我们利用LeftRefill来解决两个不同的挑战：基于预训练的稳定扩散的参考引导修复和新颖的视图合成。代码和模型发布于https://github.com/ewrfcas/LeftRefill。

摘要: This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. As the name implies, LeftRefill horizontally stitches reference and target views together as a whole input. The reference image occupies the left side, while the target canvas is positioned on the right. Then, LeftRefill paints the right-side target canvas based on the left-side reference and specific task instructions. Such a task formulation shares some similarities with contextual inpainting, akin to the actions of a human painter. This novel formulation efficiently learns both structural and textured correspondence between reference and target without other image encoders or adapters. We inject task and view information through cross-attention modules in T2I models, and further exhibit multi-view reference ability via the re-arranged self-attention modules. These enable LeftRefill to perform consistent generation as a generalized model without requiring test-time fine-tuning or model modifications. Thus, LeftRefill can be seen as a simple yet unified framework to address reference-guided synthesis. As an exemplar, we leverage LeftRefill to address two different challenges: reference-guided inpainting and novel view synthesis, based on the pre-trained StableDiffusion. Codes and models are released at https://github.com/ewrfcas/LeftRefill.

标题: SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

作者: Chongyu Fan, Jiancheng Liu, Yihua Zhang

PubTime: 2024-03-01

Downlink: http://arxiv.org/abs/2310.12508v4

GitHub: https://github.com/OPTML-Group/Unlearn-Saliency|

中文摘要: 随着数据法规的发展，机器忘却（MU）已经成为当今人工智能模型中培养信任和安全的重要工具。然而，现有的侧重于数据和/或权重观点的MU方法通常在忘记准确性、稳定性和跨领域适用性方面受到限制。为了应对这些挑战，我们为MU引入了“权重显著性”的概念，在模型解释中与输入显著性进行了比较。这一创新将MU的注意力引向特定的模型权重，而不是整个模型，从而提高了有效性和效率。我们称之为显著性遗忘（SalUn）的结果方法缩小了与“精确”遗忘（去除遗忘数据点后从头开始重新训练模型）的性能差距。据我们所知，SalUn是第一个原则性的MU方法，可以有效地消除图像分类和生成任务中忘记数据、类或概念的影响。如下所强调的，例如，SalUn在高方差随机数据遗忘中产生稳定性优势，例如，与CIFAR-10数据集上的精确遗忘相比有0.2%的差距。此外，在防止条件扩散模型生成有害图像方面，SalUn实现了近100%的遗忘准确性，优于当前最先进的基线，如擦除稳定扩散和勿忘我。代码可在https://github.com/OPTML-Group/Unlearn-Saliency获得。（警告：本文包含的模型输出可能具有攻击性。）

摘要: With evolving data regulations, machine unlearning (MU) has become an important tool for fostering trust and safety in today’s AI models. However, existing MU methods focusing on data and/or weight perspectives often suffer limitations in unlearning accuracy, stability, and cross-domain applicability. To address these challenges, we introduce the concept of ‘weight saliency’ for MU, drawing parallels with input saliency in model explanation. This innovation directs MU’s attention toward specific model weights rather than the entire model, improving effectiveness and efficiency. The resultant method that we call saliency unlearning (SalUn) narrows the performance gap with ‘exact’ unlearning (model retraining from scratch after removing the forgetting data points). To the best of our knowledge, SalUn is the first principled MU approach that can effectively erase the influence of forgetting data, classes, or concepts in both image classification and generation tasks. As highlighted below, For example, SalUn yields a stability advantage in high-variance random data forgetting, e.g., with a 0.2% gap compared to exact unlearning on the CIFAR-10 dataset. Moreover, in preventing conditional diffusion models from generating harmful images, SalUn achieves nearly 100% unlearning accuracy, outperforming current state-of-the-art baselines like Erased Stable Diffusion and Forget-Me-Not. Codes are available at https://github.com/OPTML-Group/Unlearn-Saliency. (WARNING: This paper contains model outputs that may be offensive in nature.)

标题: BS-Diff: Effective Bone Suppression Using Conditional Diffusion Models from Chest X-Ray Images

作者: Zhanghao Chen, Yifei Sun, Wenjian Qin

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2311.15328v3

GitHub: https://github.com/Benny0323/BS-Diff|

中文摘要: 胸部X射线（CXRs）通常被用作肺部筛查的低剂量方式。尽管如此，CXRs的功效还是受到了一定的阻碍，因为大约75%的肺面积与骨重叠，这反过来又阻碍了疾病的检测和诊断。作为一种补救措施，骨抑制技术已经被引入。目前临床上的双能减影成像技术需要昂贵的设备和暴露于高辐射的受试者。为了规避这些问题，已经提出了基于深度学习的图像生成算法。然而，现有的方法在产生高质量图像和捕捉纹理细节方面有所欠缺，特别是对于肺血管。为了解决这些问题，本文提出了一种新的骨抑制框架，称为BS-Diff，它包括一个配备有U-Net架构的条件扩散模型和一个简单的增强模块，以结合自动编码器。我们提出的网络不仅可以生成具有高骨抑制率的软组织图像，还具有捕捉精细图像细节的能力。此外，我们编制了自2010年以来最大的数据集，包括来自120名患者的数据，这些患者具有我们附属医院收集的高清、高分辨率配对CXRs和软组织图像。广泛的实验、比较分析、消融研究和临床评估表明，提出的BS-Diff在多个指标上优于几种骨抑制模型。我们的代码可以在https://github.com/Benny0323/BS-Diff。

摘要: Chest X-rays (CXRs) are commonly utilized as a low-dose modality for lung screening. Nonetheless, the efficacy of CXRs is somewhat impeded, given that approximately 75% of the lung area overlaps with bone, which in turn hampers the detection and diagnosis of diseases. As a remedial measure, bone suppression techniques have been introduced. The current dual-energy subtraction imaging technique in the clinic requires costly equipment and subjects being exposed to high radiation. To circumvent these issues, deep learning-based image generation algorithms have been proposed. However, existing methods fall short in terms of producing high-quality images and capturing texture details, particularly with pulmonary vessels. To address these issues, this paper proposes a new bone suppression framework, termed BS-Diff, that comprises a conditional diffusion model equipped with a U-Net architecture and a simple enhancement module to incorporate an autoencoder. Our proposed network cannot only generate soft tissue images with a high bone suppression rate but also possesses the capability to capture fine image details. Additionally, we compiled the largest dataset since 2010, including data from 120 patients with high-definition, high-resolution paired CXRs and soft tissue images collected by our affiliated hospital. Extensive experiments, comparative analyses, ablation studies, and clinical evaluations indicate that the proposed BS-Diff outperforms several bone-suppression models across multiple metrics. Our code can be accessed at https://github.com/Benny0323/BS-Diff.

标题: Bespoke Non-Stationary Solvers for Fast Sampling of Diffusion and Flow Models

作者: Neta Shaul, Uriel Singer, Ricky T. Q. Chen

PubTime: 2024-03-02

Downlink: http://arxiv.org/abs/2403.01329v1

摘要: This paper introduces Bespoke Non-Stationary (BNS) Solvers, a solver distillation approach to improve sample efficiency of Diffusion and Flow models. BNS solvers are based on a family of non-stationary solvers that provably subsumes existing numerical ODE solvers and consequently demonstrate considerable improvement in sample approximation (PSNR) over these baselines. Compared to model distillation, BNS solvers benefit from a tiny parameter space ($<$200 parameters), fast optimization (two orders of magnitude faster), maintain diversity of samples, and in contrast to previous solver distillation approaches nearly close the gap from standard distillation methods such as Progressive Distillation in the low-medium NFE regime. For example, BNS solver achieves 45 PSNR / 1.76 FID using 16 NFE in class-conditional ImageNet-64. We experimented with BNS solvers for conditional image generation, text-to-image generation, and text-2-audio generation showing significant improvement in sample approximation (PSNR) in all.

标题: SDXL-Lightning: Progressive Adversarial Diffusion Distillation

作者: Shanchuan Lin, Anran Wang, Xiao Yang

PubTime: 2024-03-02

Downlink: http://arxiv.org/abs/2402.13929v3

摘要: We propose a diffusion distillation method that achieves new state-of-the-art in one-step/few-step 1024px text-to-image generation based on SDXL. Our method combines progressive and adversarial distillation to achieve a balance between quality and mode coverage. In this paper, we discuss the theoretical analysis, discriminator design, model formulation, and training techniques. We open-source our distilled SDXL-Lightning models both as LoRA and full UNet weights.

== Visual Navigation@VLN @ Visual Language Navigation ==

标题: Active propulsion noise shaping for multi-rotor aircraft localization

作者: Gabriele Serussi, Tamir Shor, Tom Hirshberg

PubTime: 2024-02-29

Downlink: http://arxiv.org/abs/2402.17289v2

中文摘要: 多旋翼空中自主飞行器（MAV）主要依靠视觉进行导航。然而，视觉定位和里程计技术在低阳光或直射阳光下表现不佳，视野有限，易受遮挡。在许多情况下，声学传感可以作为视觉的补充甚至替代形式，并且它还具有更低的系统成本和能量足迹的额外好处，这对于微型飞机尤其重要。本文建议主动控制和塑造旋翼产生的飞机推进噪声，以利于定位任务，而不是将其视为有害的干扰。我们提出了一种神经网络架构，用于已知环境中基于自噪声的定位。我们表明，训练它与学习时变转子相位调制同时实现了精确和鲁棒的定位。使用2D声学环境中MAV转子噪声的计算负担得起的模拟来评估所提出的方法，该模拟与转子压力场的真实记录相吻合。

摘要: Multi-rotor aerial autonomous vehicles (MAVs) primarily rely on vision for navigation purposes. However, visual localization and odometry techniques suffer from poor performance in low or direct sunlight, a limited field of view, and vulnerability to occlusions. Acoustic sensing can serve as a complementary or even alternative modality for vision in many situations, and it also has the added benefits of lower system cost and energy footprint, which is especially important for micro aircraft. This paper proposes actively controlling and shaping the aircraft propulsion noise generated by the rotors to benefit localization tasks, rather than considering it a harmful nuisance. We present a neural network architecture for selfnoise-based localization in a known environment. We show that training it simultaneously with learning time-varying rotor phase modulation achieves accurate and robust localization. The proposed methods are evaluated using a computationally affordable simulation of MAV rotor noise in 2D acoustic environments that is fitted to real recordings of rotor pressure fields.