[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉导航

专属领域论文订阅

VX 扫嘛关注{晓理紫}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉导航
具身智能，机器人
强化学习
开放词汇，检测分割

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)

== LLM ==

标题: LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition

作者: Chengsong Huang, Qian Liu, Bill Yuchen Lin

中文摘要: 低秩自适应（LoRA）经常被用来为新任务微调大型语言模型（LLMs）。本文研究了用于跨任务泛化的LoRA可组合性，并介绍了LoraHub，这是一个简单的框架，用于有目的地组装在不同给定任务上训练的LoRA模块，目标是在看不见的任务上实现适应性能。只需几个新任务的例子，LoraHub就可以流畅地组合多个LoRA模块，无需人工专业知识和假设。值得注意的是，合成既不需要额外的模型参数，也不需要梯度。Big-Bench硬基准测试的实证结果表明，LoraHub虽然没有超过上下文学习的性能，但通过在推理过程中每个示例使用显著减少的令牌数量，在少数镜头场景中提供了显著的性能——效率权衡。值得注意的是，当与不同的演示示例配对时，与上下文学习相比，LoraHub建立了更好的上限，展示了其未来发展的潜力。我们的愿景是为LoRA模块建立一个平台，使用户能够分享他们训练过的LoRA模块。这种协作方法促进了LoRA模块在新任务中的无缝应用，有助于适应性生态系统。我们的代码可以在https：//github.com/sail-sg/lorahub，获得，所有预先训练的LoRA模块都在https：//huggingface.co/lorahub。发布

摘要: Low-rank adaptations (LoRA) are often employed to fine-tune large language models (LLMs) for new tasks. This paper investigates LoRA composability for cross-task generalization and introduces LoraHub, a simple framework devised for the purposive assembly of LoRA modules trained on diverse given tasks, with the objective of achieving adaptable performance on unseen tasks. With just a few examples from a new task, LoraHub can fluidly combine multiple LoRA modules, eliminating the need for human expertise and assumptions. Notably, the composition requires neither additional model parameters nor gradients. Empirical results on the Big-Bench Hard benchmark suggest that LoraHub, while not surpassing the performance of in-context learning, offers a notable performance-efficiency trade-off in few-shot scenarios by employing a significantly reduced number of tokens per example during inference. Notably, LoraHub establishes a better upper bound compared to in-context learning when paired with different demonstration examples, demonstrating its potential for future development. Our vision is to establish a platform for LoRA modules, empowering users to share their trained LoRA modules. This collaborative approach facilitates the seamless application of LoRA modules to novel tasks, contributing to an adaptive ecosystem. Our code is available at https://github.com/sail-sg/lorahub, and all the pre-trained LoRA modules are released at https://huggingface.co/lorahub.

[Downlink:]http://arxiv.org/abs/2307.13269v2

[Project:]https://huggingface.co/lorahub.|

[GitHub:]https://github.com/sail-sg/lorahub,|

标题: CLadder: Assessing Causal Reasoning in Language Models

作者: Zhijing Jin, Yuen Chen, Felix Leeb

中文摘要: 执行因果推理的能力被广泛认为是智能的核心特征。在这项工作中，我们研究了大型语言模型（LLMs）是否可以连贯地推理因果关系。自然语言处理（NLP）中的许多现有工作集中于评估LLMs中的常识性因果推理，因此未能评估模型是否能够根据一组定义良好的形式规则执行因果推理。为了解决这个问题，我们提出了一个新的NLP任务，自然语言中的因果推理，受Judea Pearl等人假设的“因果推理机”的启发。我们用10K样本组成了一个大型数据集CLadder：基于因果图和查询（关联、干预和反事实）的集合，我们通过oracle因果推理引擎获得符号问题和基本事实答案。然后这些被翻译成自然语言。我们在我们的数据集上评估了多个LLM，并引入和评估了一个定制的思维链提示策略CausalCoT。我们表明，我们的任务对LLMs来说是极具挑战性的，我们进行了深入的分析，以获得对LLMs因果推理能力的更深入的见解。我们的数据在https：//huggingface.co/datasets/causalNLP/cladder，开源，我们的代码可以在https：//github.com/causalNLP/cladder。找到

摘要: The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating commonsense causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined formal rules. To address this, we propose a new NLP task, causal inference in natural language, inspired by the “causal inference engine” postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insights into the causal reasoning abilities of LLMs. Our data is open-sourced at https://huggingface.co/datasets/causalNLP/cladder, and our code can be found at https://github.com/causalNLP/cladder.

[Downlink:]http://arxiv.org/abs/2312.04350v3

[Project:]https://huggingface.co/datasets/causalNLP/cladder,|

[GitHub:]https://github.com/causalNLP/cladder.|

标题: EarthPT: a time series foundation model for Earth Observation

作者: Michael J. Smith, Luke Fleming, James E. Geach

中文摘要: 我们引入了EarthPT-一种地球观测（EO）预训练Transformer model。EarthPT是一个7亿参数解码Transformer model基础模型，以自回归自我监督的方式训练，并专门针对EO用例开发。我们证明了EarthPT是一个有效的预测器，可以准确预测未来400-2300 nm范围内的未来像素级表面反射率。例如，在五个月的测试集范围内，标准化差异植被指数（NDVI）的演变预测在像素级具有大约0.05的典型误差（在-1->1的自然范围内），优于基于历史平均的简单相位折叠模型。我们还证明了EarthPT学习的嵌入包含语义上有意义的信息，可以用于下游任务，如高粒度、动态的土地利用分类。令人兴奋的是，我们注意到丰富的EO数据在理论上为我们提供了千万亿次训练令牌。因此，如果我们假设EarthPT遵循类似于大型语言模型（LLMs）的神经缩放定律，那么目前对缩放EarthPT和其他类似的“大型观察模型”没有数据限制。

摘要: We introduce EarthPT – an Earth Observation (EO) pretrained transformer. EarthPT is a 700 million parameter decoding transformer foundation model trained in an autoregressive self-supervised manner and developed specifically with EO use-cases in mind. We demonstrate that EarthPT is an effective forecaster that can accurately predict future pixel-level surface reflectances across the 400-2300 nm range well into the future. For example, forecasts of the evolution of the Normalised Difference Vegetation Index (NDVI) have a typical error of approximately 0.05 (over a natural range of -1 -> 1) at the pixel level over a five month test set horizon, out-performing simple phase-folded models based on historical averaging. We also demonstrate that embeddings learnt by EarthPT hold semantically meaningful information and could be exploited for downstream tasks such as highly granular, dynamic land use classification. Excitingly, we note that the abundance of EO data provides us with – in theory – quadrillions of training tokens. Therefore, if we assume that EarthPT follows neural scaling laws akin to those derived for Large Language Models (LLMs), there is currently no data-imposed limit to scaling EarthPT and other similar `Large Observation Models.’

[Downlink:]http://arxiv.org/abs/2309.07207v2

[Project:]https://www.climatechange.ai/papers/neurips2023/2|

[GitHub:]https://github.com/aspiaspace/EarthPT|

标题: T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step

作者: Zehui Chen, Weihua Du, Wenwei Zhang

中文摘要: 大型语言模型（LLM）在各种NLP任务上取得了显著的性能，并通过工具得到了更广泛应用的增强。然而，如何评价和分析有限责任管理系统的工具利用能力仍未得到充分的探索。与以前整体评估模型的工作相比，我们将工具利用全面分解为多个子过程，包括指令遵循、计划、推理、检索、理解和审查。在此基础上，我们进一步引入T-Eval来逐步评估工具利用能力。T-Eval将工具利用率评估分解为模型能力的几个子领域，促进了对LLMs整体和孤立能力的内部理解。我们对T-Eval进行了广泛的实验，并对各种LLMs进行了深入的分析。T-Eval不仅表现出与以结果为导向的评估的一致性，而且提供了对LLM能力的更细粒度的分析，为LLM工具利用能力的评估提供了一个新的视角。该基准将在https：//github.com/open-compass/T-Eval上提供。

摘要: Large language models (LLM) have achieved remarkable performance on various NLP tasks and are augmented by tools for broader applications. Yet, how to evaluate and analyze the tool-utilization capability of LLMs is still under-explored. In contrast to previous works that evaluate models holistically, we comprehensively decompose the tool utilization into multiple sub-processes, including instruction following, planning, reasoning, retrieval, understanding, and review. Based on that, we further introduce T-Eval to evaluate the tool utilization capability step by step. T-Eval disentangles the tool utilization evaluation into several sub-domains along model capabilities, facilitating the inner understanding of both holistic and isolated competency of LLMs. We conduct extensive experiments on T-Eval and in-depth analysis of various LLMs. T-Eval not only exhibits consistency with the outcome-oriented evaluation but also provides a more fine-grained analysis of the capabilities of LLMs, providing a new perspective in LLM evaluation on tool-utilization ability. The benchmark will be available at https://github.com/open-compass/T-Eval.

[Downlink:]http://arxiv.org/abs/2312.14033v3

[Project:]https://open-compass.github.io/T-Eval|

[GitHub:]https://github.com/open-compass/T-Eval.|

标题: Physio: An LLM-Based Physiotherapy Advisor

作者: Rúben Almeida, Hugo Sousa, Luís F. Cunha

中文摘要: 最新语言模型的功能增加了将它们集成到现实世界应用程序中的兴趣。然而，当考虑它们在几个领域中的使用时，这些模型生成看似合理但不正确的文本的事实造成了限制。医疗保健是一个典型的例子，在这个领域中，文本生成的可信度是保护患者健康的硬性要求。在本文中，我们介绍了Physio，一个基于聊天的物理康复应用程序。理疗师能够做出初步诊断，同时引用可靠的健康来源来支持所提供的信息。此外，利用外部知识数据库，理疗师可以推荐康复锻炼和缓解症状的非处方药。通过结合这些特征，Physio可以利用生成模型的力量进行语言处理，同时还可以根据可靠和可验证的来源来调节其响应。Physio的现场演示可在https://physio.inesctec.pt。

摘要: The capabilities of the most recent language models have increased the
interest in integrating them into real-world applications. However, the fact
that these models generate plausible, yet incorrect text poses a constraint
when considering their use in several domains. Healthcare is a prime example of
a domain where text-generative trustworthiness is a hard requirement to
safeguard patient well-being. In this paper, we present Physio, a chat-based
application for physical rehabilitation. Physio is capable of making an initial
diagnosis while citing reliable health sources to support the information
provided. Furthermore, drawing upon external knowledge databases, Physio can
recommend rehabilitation exercises and over-the-counter medication for symptom
relief. By combining these features, Physio can leverage the power of
generative models for language processing while also conditioning its response
on dependable and verifiable sources. A live demo of Physio is available at
https://physio.inesctec.pt.

[Downlink:]http://arxiv.org/abs/2401.01825v1

[Project:]https://physio.inesctec.pt.|

标题: Understanding the Effects of RLHF on LLM Generalisation and Diversity

作者: Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis

中文摘要: 大型语言模型（LLMs）通过人类反馈强化学习（RLHF）进行了微调，已经在迄今为止一些部署最广泛的人工智能模型中使用，如OpenAI的ChatGPT或Anthropic的Claude。%，或Meta的美洲驼-2。虽然在开发这些方法方面已经做了大量的工作，但是我们对RLHF每个阶段的优点和缺点的理解仍然有限。为了填补这一空白，我们对该过程的每个阶段（即监督微调（SFT）、奖励建模和RLHF）如何影响两个关键属性进行了广泛的分析：分布外（OOD）概括和输出多样性。考虑到这些模型被使用的真实世界场景的广泛范围，OOD泛化是至关重要的，而输出多样性是指模型生成不同输出的能力，并且对于各种用例是重要的。我们对总结和指导任务的两个基本模型进行分析，后者与当前的LLM用例高度相关。我们发现RLHF比SFT更能推广到新的输入，特别是当训练和测试之间的分布偏移变大时。然而，与SFT相比，RLHF在各种测量中显著降低了输出多样性，这意味着当前LLM微调方法在泛化和多样性之间进行了权衡。我们的结果为根据应用使用哪种微调方法提供了指导，并表明需要更多的研究来改善通用性和多样性之间的权衡。

摘要: Large language models (LLMs) fine-tuned with reinforcement learning from
human feedback (RLHF) have been used in some of the most widely deployed AI
models to date, such as OpenAI’s ChatGPT or Anthropic’s Claude. % , or Meta’s
LLaMA-2. While there has been significant work developing these methods, our
understanding of the benefits and downsides of each stage in RLHF is still
limited. To fill this gap, we present an extensive analysis of how each stage
of the process (i.e.~supervised fine-tuning (SFT), reward modelling, and RLHF)
affects two key properties: out-of-distribution (OOD) generalisation and output
diversity. OOD generalisation is crucial given the wide range of real-world
scenarios in which these models are being used, while output diversity refers
to the model’s ability to generate varied outputs and is important for a
variety of use cases. We perform our analysis across two base models on both
summarisation and instruction following tasks, the latter being highly relevant
for current LLM use cases. We find that RLHF generalises better than SFT to new
inputs, particularly as the distribution shift between train and test becomes
larger. However, RLHF significantly reduces output diversity compared to SFT
across a variety of measures, implying a tradeoff in current LLM fine-tuning
methods between generalisation and diversity. Our results provide guidance on
which fine-tuning method should be used depending on the application, and show
that more research is needed to improve the tradeoff between generalisation and
diversity.

[Downlink:]http://arxiv.org/abs/2310.06452v2

[GitHub:]https://github.com/facebookresearch/rlfh-gen-div|

== CLIP @ Visual transformers @ VLM @ visual model ==

标题: Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

作者: Xiang Li, Varun Belagali, Jinghuan Shang

中文摘要: 序列建模方法在机器人模仿学习中显示出了良好的效果。最近，扩散模型以序列建模的方式被用于行为克隆，这得益于它们在建模复杂数据分布方面的卓越能力。基于标准扩散的策略从以输入状态为条件的随机噪声迭代地生成动作序列。尽管如此，扩散政策的模型可以在视觉表示方面得到进一步改进。在这项工作中，我们提出了Crossway Diffusion，这是一种简单而有效的方法，通过精心设计的状态解码器和辅助自监督学习（SSL）目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。整个模型由SSL目标和原始扩散损失共同优化。我们的实验证明了Crossway Diffusion在各种模拟和真实世界的机器人任务中的有效性，证实了其相对于标准的基于扩散的策略的一致优势，以及相对于基线的显著改进

摘要: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.

[Downlink:]http://arxiv.org/abs/2307.01849v3

[Project:]https://youtu.be/9deKHueZBuk|

[GitHub:]https://github.com/LostXine/crossway_diffusion|

标题: Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

作者: David Junhao Zhang, Dongxu Li, Hung Le

中文摘要: 大多数现有的视频扩散模型（VDM）仅限于文本条件。因此，它们通常缺乏对所生成视频的视觉外观和几何结构的控制。这项工作介绍了Moonshot，一种新的视频生成模型，它同时基于图像和文本的多模态输入。该模型建立在一个核心模块上，称为多模态视频块（MVB），它由用于表示视频特征的传统时空层和用于处理图像和文本输入以进行外观调节的解耦交叉注意力层组成。此外，我们仔细设计了模型架构，使得它可以选择性地与预先训练的图像控制网络模块集成，用于几何视觉条件，而不需要额外的训练开销，这与先前的方法相反。实验表明，与现有模型相比，Moonshot具有多功能的多模态调节机制，在视觉质量和时间一致性方面表现出显著的改善。此外，该模型可以很容易地重新用于各种生成应用，如个性化视频生成、图像动画和视频编辑，揭示了其作为可控视频生成的基本架构的潜力。模型将在https：//github.com/salesforce/LAVIS上公开。

摘要: Most existing video diffusion models (VDMs) are limited to mere text
conditions. Thereby, they are usually lacking in control over visual appearance
and geometry structure of the generated videos. This work presents Moonshot, a
new video generation model that conditions simultaneously on multimodal inputs
of image and text. The model builts upon a core module, called multimodal video
block (MVB), which consists of conventional spatialtemporal layers for
representing video features, and a decoupled cross-attention layer to address
image and text inputs for appearance conditioning. In addition, we carefully
design the model architecture such that it can optionally integrate with
pre-trained image ControlNet modules for geometry visual conditions, without
needing of extra training overhead as opposed to prior methods. Experiments
show that with versatile multimodal conditioning mechanisms, Moonshot
demonstrates significant improvement on visual quality and temporal consistency
compared to existing models. In addition, the model can be easily repurposed
for a variety of generative applications, such as personalized video
generation, image animation and video editing, unveiling its potential to serve
as a fundamental architecture for controllable video generation. Models will be
made public on https://github.com/salesforce/LAVIS.

[Downlink:]http://arxiv.org/abs/2401.01827v1

[Project:]https://showlab.github.io/Moonshot/|

[GitHub:]https://github.com/salesforce/LAVIS.|

标题: SVGDreamer: Text Guided SVG Generation with Diffusion Model

作者: Ximing Xing, Haitao Zhou, Chuang Wang

中文摘要: 最近，文本引导的可缩放矢量图形（SVG）合成在图像学和素描等领域显示出了前景。然而，现有的文本到SVG的生成方法缺乏可编辑性，并且难以满足视觉质量和结果多样性。为了解决这些限制，我们提出了一种新的文本引导矢量图形合成方法，称为SVGDreamer。SVGDreamer整合了语义驱动的图像矢量化（SIVE）过程，能够将合成分解为前景对象和背景，从而增强可编辑性。具体来说，SIVE过程引入了基于注意力的原始控制和注意力屏蔽损失函数，用于有效控制和操作单个元素。此外，我们提出了一种矢量化的基于粒子的分数提取（VPSD）方法来解决现有文本到SVG生成方法中颜色过饱和度、矢量基元过平滑和有限结果多样性的挑战。此外，在VPSD的基础上，我们引入了奖励反馈学习（ReFL）来加速VPSD的收敛和提高审美情趣。已经进行了大量的实验来验证SVGDreamer的有效性，证明了它在可编辑性、视觉质量和多样性方面优于基线方法。SVGDreamer的代码和演示可以在\href{https：//ximinng.github.io/SVGDreamer-project/}{https：//ximinng.github.io/SVGDreamer-project/}找到

摘要: Recently, text-guided scalable vector graphics (SVGs) synthesis has shown
promise in domains such as iconography and sketch. However, existing
text-to-SVG generation methods lack editability and struggle with visual
quality and result diversity. To address these limitations, we propose a novel
text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer
incorporates a semantic-driven image vectorization (SIVE) process that enables
the decomposition of synthesis into foreground objects and background, thereby
enhancing editability. Specifically, the SIVE process introduce attention-based
primitive control and an attention-mask loss function for effective control and
manipulation of individual elements. Additionally, we propose a Vectorized
Particle-based Score Distillation (VPSD) approach to tackle the challenges of
color over-saturation, vector primitives over-smoothing, and limited result
diversity in existing text-to-SVG generation methods. Furthermore, on the basis
of VPSD, we introduce Reward Feedback Learning (ReFL) to accelerate VPSD
convergence and improve aesthetic appeal. Extensive experiments have been
conducted to validate the effectiveness of SVGDreamer, demonstrating its
superiority over baseline methods in terms of editability, visual quality, and
diversity. The code and demo of SVGDreamer can be found at
\href{https://ximinng.github.io/SVGDreamer-project/}{https://ximinng.github.io/SVGDreamer-project/}.

[Downlink:]http://arxiv.org/abs/2312.16476v2

[Project:]https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|

标题: ODTrack: Online Dense Temporal Token Learning for Visual Tracking

作者: Yaozong Zheng, Bineng Zhong, Qihua Liang

中文摘要: 跨连续视频帧的在线上下文推理和关联对于视觉跟踪中的感知实例至关重要。然而，大多数当前性能最好的跟踪器通过离线模式持续依赖于参考帧和搜索帧之间的稀疏时间关系。因此，它们只能在每个图像对中独立地相互作用，并建立有限的时间相关性。为了缓解上述问题，我们提出了一种简单、灵活、有效的视频级跟踪流水线，名为\textbf{ODTrack}，它以在线令牌传播的方式密集关联视频帧的上下文关系。ODTrack接收任意长度的视频帧来捕捉实例的时空轨迹关系，并将目标的辨别特征（定位信息）压缩成令牌序列，实现帧到帧的关联。这种新的解决方案带来了以下好处：1）净化后的令牌序列可以作为下一个视频帧中推理的提示，从而利用过去的信息来指导未来的推理；2）通过令牌序列的迭代传播，有效地避免了复杂的在线更新策略，从而实现了更高效的模型表示和计算。ODTrack在七个基准测试中实现了新的\textit{SOTA}性能，同时以实时速度运行。代码和模型可在\url{https：//github.com/GXNU-ZhongLab/ODTrack}获得。

摘要: Online contextual reasoning and association across consecutive video frames
are critical to perceive instances in visual tracking. However, most current
top-performing trackers persistently lean on sparse temporal relationships
between reference and search frames via an offline mode. Consequently, they can
only interact independently within each image-pair and establish limited
temporal correlations. To alleviate the above problem, we propose a simple,
flexible and effective video-level tracking pipeline, named \textbf{ODTrack},
which densely associates the contextual relationships of video frames in an
online token propagation manner. ODTrack receives video frames of arbitrary
length to capture the spatio-temporal trajectory relationships of an instance,
and compresses the discrimination features (localization information) of a
target into a token sequence to achieve frame-to-frame association. This new
solution brings the following benefits: 1) the purified token sequences can
serve as prompts for the inference in the next video frame, whereby past
information is leveraged to guide future inference; 2) the complex online
update strategies are effectively avoided by the iterative propagation of token
sequences, and thus we can achieve more efficient model representation and
computation. ODTrack achieves a new \textit{SOTA} performance on seven
benchmarks, while running at real-time speed. Code and models are available at
\url{https://github.com/GXNU-ZhongLab/ODTrack}.

[Downlink:]http://arxiv.org/abs/2401.01686v1

[GitHub:]https://github.com/GXNU-ZhongLab/ODTrack|

标题: Learning Prompt with Distribution-Based Feature Replay for Few-Shot Class-Incremental Learning

作者: Zitong Huang, Ze Chen, Zhixing Chen

中文摘要: 几个镜头类——增量学习(FSCIL)旨在基于非常有限的训练数据不断学习新的类，而不会忘记遇到的旧类。现有的研究仅依赖于纯视觉网络在本文中，我们利用视觉语言模型（如CLIP）解决了FSCIL问题，并提出了一个简单而有效的框架，名为带分布的学习提示（Learning Prompt with Distribution-bas）ed功能回放（LP-DiF）。我们观察到，简单地使用CLIP进行零镜头评估可以大大优于最有影响力的方法。然后，采用快速调整技术进一步提高模型的自适应能力，使模型能够不断地从每个会话中获取特定的知识。为了防止可学习提示在新会话中忘记旧知识，我们提出了一种伪特征重放方法。具体来说，我们通过保持具有对角协方差矩阵的特征级高斯分布来保留每个类的旧知识，该对角协方差矩阵由训练图像的图像特征和从VAE生成的合成特征来估计。当进展到新的会话时，结合当前会话的训练图像，从旧类分布中采样伪特征，以优化提示，从而使模型能够学习新知识，同时保留旧知识。在三个流行的基准（即CIFAR100、mini-ImageNet、CUB-200）和两个更具挑战性的基准（即SUN-397和CUB-200 $^*$ ）上的实验展示了LP-DiF的优越性，在FSCIL中实现了新的最先进（SOTA）。代码可在https://github.com/1170300714/LP-DiF。

摘要: Few-shot Class-Incremental Learning (FSCIL) aims to continuously learn new
classes based on very limited training data without forgetting the old ones
encountered. Existing studies solely relied on pure visual networks, while in
this paper we solved FSCIL by leveraging the Vision-Language model (e.g., CLIP)
and propose a simple yet effective framework, named Learning Prompt with
Distribution-based Feature Replay (LP-DiF). We observe that simply using CLIP
for zero-shot evaluation can substantially outperform the most influential
methods. Then, prompt tuning technique is involved to further improve its
adaptation ability, allowing the model to continually capture specific
knowledge from each session. To prevent the learnable prompt from forgetting
old knowledge in the new session, we propose a pseudo-feature replay approach.
Specifically, we preserve the old knowledge of each class by maintaining a
feature-level Gaussian distribution with a diagonal covariance matrix, which is
estimated by the image features of training images and synthesized features
generated from a VAE. When progressing to a new session, pseudo-features are
sampled from old-class distributions combined with training images of the
current session to optimize the prompt, thus enabling the model to learn new
knowledge while retaining old knowledge. Experiments on three prevalent
benchmarks, i.e., CIFAR100, mini-ImageNet, CUB-200, and two more challenging
benchmarks, i.e., SUN-397 and CUB-200 $^*$ proposed in this paper showcase the
superiority of LP-DiF, achieving new state-of-the-art (SOTA) in FSCIL. Code is
publicly available at https://github.com/1170300714/LP-DiF.

[Downlink:]http://arxiv.org/abs/2401.01598v1

[GitHub:]https://github.com/1170300714/LP-DiF.|

标题: GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

作者: Yuting Wang, Jinpeng Wang, Bin Chen

中文摘要: 给定一个文本查询，部分相关视频检索（PRVR）寻求在数据库中找到包含相关时刻的未修剪视频。对于PRVR来说，剪辑建模对于捕捉文本和视频之间的部分关系至关重要。目前的PRVR方法采用基于扫描的剪辑构造来实现显式剪辑建模，这是信息冗余的，并且需要很大的存储开销。为了解决PRVR方法的效率问题，本文提出了一种基于高斯混合模型的Transformer model GMMFormer，它隐式地模拟了剪辑表示。在帧交互过程中，我们结合高斯混合模型约束，将每一帧聚焦在相邻的帧上，而不是整个视频上。然后生成的表示将包含多尺度剪辑信息，实现隐式剪辑建模。此外，PRVR方法忽略了与同一视频相关的文本查询之间的语义差异，导致嵌入空间稀疏。我们提出了一种查询多样性损失来区分这些文本查询，使嵌入空间更加密集，包含更多的语义信息。在三个大规模视频数据集（即TVR、ActivityNet字幕和Charades-STA）上的大量实验证明了GMMFormer的优越性和效率。代码可从\url{https://github.com/huangmozhi9527/GMMFormer}获得。

摘要: Given a text query, partially relevant video retrieval (PRVR) seeks to find
untrimmed videos containing pertinent moments in a database. For PRVR, clip
modeling is essential to capture the partial relationship between texts and
videos. Current PRVR methods adopt scanning-based clip construction to achieve
explicit clip modeling, which is information-redundant and requires a large
storage overhead. To solve the efficiency problem of PRVR methods, this paper
proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models
clip representations implicitly. During frame interactions, we incorporate
Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames
instead of the whole video. Then generated representations will contain
multi-scale clip information, achieving implicit clip modeling. In addition,
PRVR methods ignore semantic differences between text queries relevant to the
same video, leading to a sparse embedding space. We propose a query diverse
loss to distinguish these text queries, making the embedding space more
intensive and contain more semantic information. Extensive experiments on three
large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA)
demonstrate the superiority and efficiency of GMMFormer. Code is available at
\url{https://github.com/huangmozhi9527/GMMFormer}.

[Downlink:]http://arxiv.org/abs/2310.05195v2

[GitHub:]https://github.com/huangmozhi9527/GMMFormer|https://github.com/huangmozhi9527/GMMFormer|

== diffusion policy@diffusion formulation@diffusion model ==

标题: VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

作者: Haoxin Chen, Yong Zhang, Xiaodong Cun

中文摘要: 文本到视频生成旨在根据给定的提示生成视频。最近，一些商业视频模型已经能够生成具有最小噪声、出色细节和高美学分数的似是而非的视频。然而，这些模型依赖于大规模、过滤良好、高质量的视频，社区无法访问这些视频。许多现有的研究工作使用低质量的WebVid-10M数据集训练模型，很难生成高质量的视频，因为模型经过优化以适应WebVid-10M。在这项工作中，我们探索了从稳定扩散扩展的视频模型的训练方案，并研究了利用低质量视频和合成的高质量图像来获得高质量视频模型的可行性。我们首先分析了视频模型的空间和时间模块之间的联系以及向低质量视频的分布转移。我们观察到，与仅训练时间模块相比，所有模块的完全训练导致空间和时间模块之间更强的耦合。基于这种更强的耦合，我们通过用高质量图像微调空间模块，将分布转移到更高的质量而没有运动退化，从而产生通用的高质量视频模型。进行评估以证明所提出的方法的优越性，特别是在图像质量、运动和概念合成方面。

摘要: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.

[Downlink:]http://arxiv.org/abs/2401.09047v1

[Project:]https://ailab-cvc.github.io/videocrafter;|

[GitHub:]https://github.com/AILab-CVC/VideoCrafter|

标题: InstantID: Zero-shot Identity-Preserving Generation in Seconds

作者: Qixun Wang, Xu Bai, Haofan Wang

中文摘要: 使用文本反转、DreamBooth和LoRA等方法进行个性化图像合成已经取得了重大进展。然而，它们在现实世界中的适用性受到高存储需求、冗长的微调过程以及对多个参考图像的需求的阻碍。相反，现有的基于ID嵌入的方法虽然只需要单一的正向推理，但面临着挑战：它们要么需要跨众多模型参数进行广泛的微调，缺乏与社区预训练模型的兼容性，要么无法保持高人脸保真度。针对这些限制，我们引入了InstantID，这是一个强大的基于扩散模型的解决方案。我们的即插即用模块仅使用一张面部图像就能熟练地处理各种风格的图像个性化，同时确保高保真。为了实现这一点，我们设计了一个新的身份网，通过施加强语义和弱空间条件，将面部和地标图像与文本提示相结合来指导图像生成。InstantID展示了卓越的性能和效率，证明在身份保护至关重要的实际应用中非常有益。此外，我们的工作与SD1.5和SDXL等流行的预训练文本到图像扩散模型无缝集成，作为一个适应性强的插件。我们的代码和预先训练的检查点将在https：//github.com/InstantID/InstantID上提供。

摘要: There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.

[Downlink:]http://arxiv.org/abs/2401.07519v1

[Project:]https://instantid.github.io/|

[GitHub:]https://github.com/InstantID/InstantID.|

标题: Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

作者: Xiang Li, Varun Belagali, Jinghuan Shang

中文摘要: 序列建模方法在机器人模仿学习中显示出有希望的结果。最近，扩散模型已经以序列建模的方式被用于行为克隆，受益于它们在建模复杂数据分布方面的卓越能力。标准的基于扩散的策略根据输入状态从随机噪声中迭代地生成动作序列。尽管如此，扩散策略的模型可以在视觉表示方面进一步改进。在这项工作中，我们提出了交叉扩散，这是一种简单而有效的方法，通过精心设计的状态解码器和辅助的自我监督学习（SSL）目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。利用SSL目标和初始扩散损失对整个模型进行了联合优化。我们的实验证明了交叉扩散在各种模拟和真实世界机器人任务中的有效性，证实了它相对于标准的基于扩散的策略的一致优势以及对基线的实质性改进。

[Downlink:]http://arxiv.org/abs/2307.01849v3

[Project:]https://youtu.be/9deKHueZBuk|

[GitHub:]https://github.com/LostXine/crossway_diffusion|

标题: Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

作者: David Junhao Zhang, Dongxu Li, Hung Le

[Downlink:]http://arxiv.org/abs/2401.01827v1

[Project:]https://showlab.github.io/Moonshot/|

[GitHub:]https://github.com/salesforce/LAVIS.|

标题: CoMoSVC: Consistency Model-based Singing Voice Conversion

作者: Yiwen Lu, Zhen Ye, Wei Xue

中文摘要: 基于扩散的歌声转换（SVC）方法取得了显著的性能，产生了与目标音色高度相似的自然音频。然而，迭代采样过程导致推理速度慢，因此加速变得至关重要。本文提出了一种基于一致性模型的SVC方法CoMoSVC，旨在实现高质量生成和高速采样。首先专门为SVC设计了一个基于扩散的教师模型，并在自洽属性下进一步提取学生模型，以实现一步采样。在单个NVIDIA GTX4090 GPU上的实验表明，尽管CoMoSVC的推理速度明显快于最先进的（SOTA）基于扩散的SVC系统，但基于主观和客观指标，它仍然实现了相当或更好的转换性能。音频样本和代码可在https：//comosvc.github.io/。

摘要: The diffusion-based Singing Voice Conversion (SVC) methods have achieved
remarkable performances, producing natural audios with high similarity to the
target timbre. However, the iterative sampling process results in slow
inference speed, and acceleration thus becomes crucial. In this paper, we
propose CoMoSVC, a consistency model-based SVC method, which aims to achieve
both high-quality generation and high-speed sampling. A diffusion-based teacher
model is first specially designed for SVC, and a student model is further
distilled under self-consistency properties to achieve one-step sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a
significantly faster inference speed than the state-of-the-art (SOTA)
diffusion-based SVC system, it still achieves comparable or superior conversion
performance based on both subjective and objective metrics. Audio samples and
codes are available at https://comosvc.github.io/.

[Downlink:]http://arxiv.org/abs/2401.01792v1

[Project:]https://comosvc.github.io/.|

标题: SVGDreamer: Text Guided SVG Generation with Diffusion Model

作者: Ximing Xing, Haitao Zhou, Chuang Wang

中文摘要: 最近，文本引导的可缩放矢量图形（SVG）合成在图像学和草图等领域显示出了前景。然而，现有的文本到SVG的生成方法缺乏可编辑性，并且难以满足视觉质量和结果多样性。为了解决这些限制，我们提出了一种新的文本引导矢量图形合成方法，称为SVGDreamer。SVGDreamer整合了语义驱动的图像矢量化（SIVE）过程，能够将合成分解为前景对象和背景，从而增强可编辑性。具体来说，SIVE过程引入了基于注意力的原始控制和注意力屏蔽损失函数，用于有效控制和操作单个元素。此外，我们提出了一种矢量化的基于粒子的分数提取（VPSD）方法来解决现有文本到SVG生成方法中颜色过饱和度、矢量基元过平滑和有限结果多样性的挑战。此外，在VPSD的基础上，我们引入了奖励反馈学习（ReFL）来加速VPSD的收敛和提高审美情趣。已经进行了大量的实验来验证SVGDreamer的有效性，证明了它在可编辑性、视觉质量和多样性方面优于基线方法。SVGDreamer的代码和演示可以在\href{https：//ximinng.github.io/SVGDreamer-project/}{https：//ximinng.github.io/SVGDreamer-project/}找到

[Downlink:]http://arxiv.org/abs/2312.16476v2

[Project:]https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|https://ximinng.github.io/SVGDreamer-project/|

== Visual Navigation@Visual Exploration @ VSLAM ==

标题: Surgical-DINO: Adapter Learning of Foundation Models for Depth Estimation in Endoscopic Surgery

作者: Beilei Cui, Mobarakol Islam, Long Bai

中文摘要: 目的：机器人手术中的深度估计在3D重建、手术导航和增强现实可视化中至关重要。尽管基础模型在许多视觉任务中表现出出色的性能，包括深度估计（例如，DINOv2），但最近的工作观察到其在医疗和外科领域特定应用中的局限性。这项工作提出了一个低排名适应(LoRA)的基础模型的手术深度估计。方法：我们设计了一种基于基础模型的深度估计方法，称为Surgical-DINO，这是DINOv2的低秩适应，用于内窥镜手术的深度估计。我们构建LoRA层，并将它们集成到DINO中，以适应外科手术特定领域的知识，而不是传统的微调。在训练过程中，我们冻结了显示出出色视觉表现能力的DINO图像编码器，并且只优化了LoRA层和深度解码器，以整合来自手术场景的特征。结果：我们的模型在从达芬奇Xi内窥镜手术中收集的MICCAI挑战数据集上得到了广泛的验证。我们的经验表明，在内窥镜深度估计任务中，Surgical-DINO明显优于所有最先进的模型。消融研究的分析显示了我们的LoRA层和适应性的显著效果的证据。结论：Surgical-DINO为基础模型成功适应外科领域的深度估计提供了一些启示。结果中有明确的证据表明，对计算机视觉数据集中预训练权重的零镜头预测或简单的微调不足以直接在外科领域使用基础模型。代码可在https：//github.com/BeileiCui/SurgicalDINO获得。

摘要: Purpose: Depth estimation in robotic surgery is vital in 3D reconstruction, surgical navigation and augmented reality visualization. Although the foundation model exhibits outstanding performance in many vision tasks, including depth estimation (e.g., DINOv2), recent works observed its limitations in medical and surgical domain-specific applications. This work presents a low-ranked adaptation (LoRA) of the foundation model for surgical depth estimation. Methods: We design a foundation model-based depth estimation method, referred to as Surgical-DINO, a low-rank adaptation of the DINOv2 for depth estimation in endoscopic surgery. We build LoRA layers and integrate them into DINO to adapt with surgery-specific domain knowledge instead of conventional fine-tuning. During training, we freeze the DINO image encoder, which shows excellent visual representation capacity, and only optimize the LoRA layers and depth decoder to integrate features from the surgical scene. Results: Our model is extensively validated on a MICCAI challenge dataset of SCARED, which is collected from da Vinci Xi endoscope surgery. We empirically show that Surgical-DINO significantly outperforms all the state-of-the-art models in endoscopic depth estimation tasks. The analysis with ablation studies has shown evidence of the remarkable effect of our LoRA layers and adaptation. Conclusion: Surgical-DINO shed some light on the successful adaptation of the foundation models into the surgical domain for depth estimation. There is clear evidence in the results that zero-shot prediction on pre-trained weights in computer vision datasets or naive fine-tuning is not sufficient to use the foundation model in the surgical domain directly. Code is available at https://github.com/BeileiCui/SurgicalDINO.

[Downlink:]http://arxiv.org/abs/2401.06013v2

[GitHub:]https://github.com/BeileiCui/SurgicalDINO.|

标题: UMIE: Unified Multimodal Information Extraction with Instruction Tuning

作者: Lin Sun, Kai Zhang, Qingyuan Li

中文摘要: 随着多媒体内容的普及，多模态信息提取（MIE）获得了极大的关注。然而，当前的MIE方法经常求助于使用特定于任务的模型结构，这导致跨任务的可推广性有限，并且没有充分利用跨MIE任务的共享知识。为了解决这些问题，我们提出了UMIE，一个统一的多模态信息提取器，使用指令调整将三个MIE任务统一为生成问题，能够有效地提取文本和视觉提及。大量实验表明，我们的单个UMIE在三项任务上优于六个MIE数据集的各种最先进（SoTA）方法。此外，深入的分析证明了UMIE在零镜头设置中的强大泛化，对指令变体的鲁棒性和可解释性。我们的研究是迈向统一MIE模型的第一步，并启动了对MIE领域内指令调优和大型语言模型的探索。我们的代码、数据和模型可在https：//github.com/ZUCC-AI/UMIE

摘要: Multimodal information extraction (MIE) gains significant attention as the popularity of multimedia content increases. However, current MIE methods often resort to using task-specific model structures, which results in limited generalizability across tasks and underutilizes shared knowledge across MIE tasks. To address these issues, we propose UMIE, a unified multimodal information extractor to unify three MIE tasks as a generation problem using instruction tuning, being able to effectively extract both textual and visual mentions. Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks. Furthermore, in-depth analysis demonstrates UMIE’s strong generalization in the zero-shot setting, robustness to instruction variants, and interpretability. Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain. Our code, data, and model are available at https://github.com/ZUCC-AI/UMIE

[Downlink:]http://arxiv.org/abs/2401.03082v1

[GitHub:]https://github.com/ZUCC-AI/UMIE|

标题: Multi-Technique Sequential Information Consistency For Dynamic Visual Place Recognition In Changing Environments

作者: Bruno Arcanjo, Bruno Ferrarini, Michael Milford

中文摘要: 视觉位置识别（VPR）是机器人导航和定位系统的一个重要组成部分，它允许机器人仅使用图像数据来识别位置。VPR具有挑战性，因为不同的日常照明、季节性天气变化和不同的视角会导致一个地方的外观发生重大变化。目前，没有一种VPR技术在每种环境条件下都表现出独特的优点和缺点，因此将多种技术相结合可以实现更可靠的VPR性能。目前的多方法方法要么依赖于通常不可用的在线地面实况信息，要么依赖于强力技术组合，这可能会降低高方差技术集的性能。针对这些缺点，我们提出了一种称为多序列信息一致性（MuSIC）的VPR系统，该系统利用序列信息在每帧在线的基础上选择最具凝聚力的技术。对于集合中的每种技术，MuSIC通过分析其顶部匹配候选者的帧到帧连续性来计算其各自的序列一致性，然后直接将其进行比较，以选择用于当前查询图像的最佳技术。使用顺序信息在VPR方法之间进行选择，可以提高不同基准数据集的整体VPR性能，同时避免对运行时环境的额外事实的需要

摘要: Visual place recognition (VPR) is an essential component of robot navigation and localization systems that allows them to identify a place using only image data. VPR is challenging due to the significant changes in a place’s appearance driven by different daily illumination, seasonal weather variations and diverse viewpoints. Currently, no single VPR technique excels in every environmental condition, each exhibiting unique benefits and shortcomings, and therefore combining multiple techniques can achieve more reliable VPR performance. Present multi-method approaches either rely on online ground-truth information, which is often not available, or on brute-force technique combination, potentially lowering performance with high variance technique sets. Addressing these shortcomings, we propose a VPR system dubbed Multi-Sequential Information Consistency (MuSIC) which leverages sequential information to select the most cohesive technique on an online per-frame basis. For each technique in a set, MuSIC computes their respective sequential consistencies by analysing the frame-to-frame continuity of their top match candidates, which are then directly compared to select the optimal technique for the current query image. The use of sequential information to select between VPR methods results in an overall VPR performance increase across different benchmark datasets, while avoiding the need for extra ground-truth of the runtime environment.

[Downlink:]http://arxiv.org/abs/2401.08263v1

标题: Haptic search with the Smart Suction Cup on adversarial objects

作者: Jungpyo Lee, Sebastian D. Lee, Tae Myung Huh

中文摘要: 吸盘是工业机器人应用中的一种重要的抓手类型，先前的文献集中于使用基于视觉的规划器来提高这些任务中的抓取成功率。如果不重新训练学习的算法，基于视觉的规划者可能会由于对立的对象而失败，或者失去对看不见的场景的普遍性。当视觉抓握规划者失败时，我们建议触觉探索来改善吸盘抓握。我们展示了智能吸盘，这是一种利用内部流量测量进行触觉传感的末端执行器。我们表明，在这些流量测量的指导下，基于模型的触觉搜索方法与在拾取箱子任务中仅使用视觉规划器相比，将抓取成功率提高了2.5倍。在几何边缘和曲线上表征智能吸盘时，我们发现即使姿势误差较大，流速也能准确预测理想的运动方向。智能吸盘本身不包括电子设备，因此该设计易于制造，并且触觉探索不会损坏传感器。这项工作激发了在特别敌对的场景中使用具有自主触觉搜索能力的吸盘。

摘要: Suction cups are an important gripper type in industrial robot applications, and prior literature focuses on using vision-based planners to improve grasping success in these tasks. Vision-based planners can fail due to adversarial objects or lose generalizability for unseen scenarios, without retraining learned algorithms. We propose haptic exploration to improve suction cup grasping when visual grasp planners fail. We present the Smart Suction Cup, an end-effector that utilizes internal flow measurements for tactile sensing. We show that model-based haptic search methods, guided by these flow measurements, improve grasping success by up to 2.5x as compared with using only a vision planner during a bin-picking task. In characterizing the Smart Suction Cup on both geometric edges and curves, we find that flow rate can accurately predict the ideal motion direction even with large postural errors. The Smart Suction Cup includes no electronics on the cup itself, such that the design is easy to fabricate and haptic exploration does not damage the sensor. This work motivates the use of suction cups with autonomous haptic search capabilities in especially adversarial scenarios.

[Downlink:]http://arxiv.org/abs/2309.07360v2

标题: Zero-Shot Co-salient Object Detection Framework

作者: Haoke Xiao, Lv Tang, Bo Li

中文摘要: 共同显著对象检测（CoSOD）致力于复制人类视觉系统识别图像集合中常见和显著对象的能力。尽管深度学习模型最近取得了进展，但这些模型仍然依赖于注释良好的CoSOD数据集的训练。对免训练零射击CoSOD框架的探索是有限的。在本文中，受基础计算机视觉模型的零镜头传输能力的启发，我们介绍了第一个零镜头CoSOD框架，该框架利用这些模型而无需任何训练过程。为了实现这一点，我们在我们提出的框架中引入了两个新的组件：组提示生成（GPG）模块和共显著性图生成（CMP）模块。我们评估了该框架在广泛使用的数据集上的性能，并观察到令人印象深刻的结果。我们的方法超越了现有的无监督方法，甚至优于2020年之前开发的完全监督方法，同时与2022年之前开发的一些完全监督方法保持竞争力。

摘要: Co-salient Object Detection (CoSOD) endeavors to replicate the human visual system’s capacity to recognize common and salient objects within a collection of images. Despite recent advancements in deep learning models, these models still rely on training with well-annotated CoSOD datasets. The exploration of training-free zero-shot CoSOD frameworks has been limited. In this paper, taking inspiration from the zero-shot transfer capabilities of foundational computer vision models, we introduce the first zero-shot CoSOD framework that harnesses these models without any training process. To achieve this, we introduce two novel components in our proposed framework: the group prompt generation (GPG) module and the co-saliency map generation (CMP) module. We evaluate the framework’s performance on widely-used datasets and observe impressive results. Our approach surpasses existing unsupervised methods and even outperforms fully supervised methods developed before 2020, while remaining competitive with some fully supervised methods developed before 2022.

[Downlink:]http://arxiv.org/abs/2309.05499v3

标题: On State Estimation in Multi-Sensor Fusion Navigation: Optimization and Filtering

作者: Feng Zhu, Zhuo Xu, Xveqing Zhang

中文摘要: 导航、感知和决策是智能机器人的基本任务，其本质是估计必要的系统状态。其中，导航是其他上层应用的基础，通过整合来自多个传感器的测量值来提供精确的位置和方向。通过对每个传感器的观测值进行适当的建模，导航中的多传感器融合任务被简化为状态估计问题，该问题可以通过优化和滤波两种方法来解决。最近的研究表明，基于优化的框架在准确性方面优于基于过滤的框架。然而，这两种方法都基于最大似然估计（MLE），并且在理论上应该等价，具有相同的线性化点、观测模型、测量值和高斯噪声假设。在本文中，我们深入探讨了基于优化和基于过滤的方法中使用的理论和现有策略。结果表明，这两种方法在理论上是等价的，但由于在实时操作中应用的策略不同，这种等价性会受到破坏。通过调整现有的滤波策略，蒙特卡罗仿真和基于视觉里程计（VO）的车辆烧蚀实验表明，调整后的滤波策略严格等同于优化。因此，未来传感器融合问题的研究应该集中在它们自己的算法和策略上，而不是状态估计方法。

摘要: The essential of navigation, perception, and decision-making which are basic tasks for intelligent robots, is to estimate necessary system states. Among them, navigation is fundamental for other upper applications, providing precise position and orientation, by integrating measurements from multiple sensors. With observations of each sensor appropriately modelled, multi-sensor fusion tasks for navigation are reduced to the state estimation problem which can be solved by two approaches: optimization and filtering. Recent research has shown that optimization-based frameworks outperform filtering-based ones in terms of accuracy. However, both methods are based on maximum likelihood estimation (MLE) and should be theoretically equivalent with the same linearization points, observation model, measurements, and Gaussian noise assumption. In this paper, we deeply dig into the theories and existing strategies utilized in both optimization-based and filtering-based approaches. It is demonstrated that the two methods are equal theoretically, but this equivalence corrupts due to different strategies applied in real-time operation. By adjusting existing strategies of the filtering-based approaches, the Monte-Carlo simulation and vehicular ablation experiments based on visual odometry (VO) indicate that the strategy adjusted filtering strictly equals to optimization. Therefore, future research on sensor-fusion problems should concentrate on their own algorithms and strategies rather than state estimation approaches.

[Downlink:]http://arxiv.org/abs/2401.05836v1