[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--强化学习、模仿学习、机器人

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需VX关注公号并回复{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== RLHF ==

标题: Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

作者: Quentin Delfosse, Sebastian Sztwiertnia, Mark Rothermel

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.05821v2

GitHub: https://github.com/k4ntz/SCoBots|

中文摘要: 目标错位、奖励稀疏和困难的信用分配只是使深度强化学习（RL）代理难以学习最优策略的许多问题中的几个。不幸的是，深度神经网络的黑箱性质阻碍了包括领域专家来检查模型和修改次优策略。为此，我们引入了连续概念瓶颈代理（SCoBots），它集成了连续概念瓶颈（CB）层。与当前的CB模型相比，SCoBots不仅将概念表示为单个对象的属性，还表示为对象之间的关系，这对于许多RL任务至关重要。我们的实验结果为SCoBots的竞争表现提供了证据，也为领域专家理解和规范其行为提供了证据。除此之外，SCoBots使我们能够识别标志性视频游戏Pong中以前未知的错位问题，并解决它。总的来说，SCoBots因此产生了更多与人类一致的RL代理。我们的代码可在https：//github.com/k4ntz/SCoBots。获得。

摘要: Goal misalignment, reward sparsity and difficult credit assignment are only a few of the many issues that make it difficult for deep reinforcement learning (RL) agents to learn optimal policies. Unfortunately, the black-box nature of deep neural networks impedes the inclusion of domain experts for inspecting the model and revising suboptimal policies. To this end, we introduce Successive Concept Bottleneck Agents (SCoBots), that integrate consecutive concept bottleneck (CB) layers. In contrast to current CB models, SCoBots do not just represent concepts as properties of individual objects, but also as relations between objects which is crucial for many RL tasks. Our experimental results provide evidence of SCoBots’ competitive performances, but also of their potential for domain experts to understand and regularize their behavior. Among other things, SCoBots enabled us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it. Overall, SCoBots thus result in more human-aligned RL agents. Our code is available at https://github.com/k4ntz/SCoBots .

标题: SERL: A Software Suite for Sample-Efficient Robotic Reinforcement Learning

作者: Jianlan Luo, Zheyuan Hu, Charles Xu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.16013v2

Project: https://serl-robot.github.io/|

中文摘要: 近年来，机器人强化学习（RL）领域取得了重大进展，实现了处理复杂图像观察、在现实世界中训练以及整合辅助数据（如演示和先前经验）的方法。然而，尽管有这些进步，机器人RL仍然很难使用。从业者公认，这些算法的特定实现细节对于性能来说通常与算法的选择一样重要（如果不是更重要的话）。我们认为，机器人RL的广泛采用以及机器人RL方法的进一步发展的一个重大挑战是这种方法的相对不可及性。为了应对这一挑战，我们开发了一个精心实现的库，其中包含一个样本高效的非策略深度RL方法，以及计算奖励和重置环境的方法，一个广泛采用的机器人的高质量控制器，以及许多具有挑战性的示例任务。我们提供这个库作为社区的资源，描述它的设计选择，并展示实验结果。也许令人惊讶的是，我们发现我们的实施可以实现非常有效的学习，平均在每个策略25到50分钟的训练中获得PCB板组装、电缆布线和对象重新定位的策略，比文献中报告的类似任务的最先进结果有所改善。这些策略实现了完美或接近完美的成功率，即使在扰动下也具有极强的鲁棒性，并表现出紧急恢复和修正行为。我们希望这些有希望的结果和我们高质量的开源实现将为机器人社区提供一个工具，以促进机器人RL的进一步发展。我们的代码、文档和视频可以在https://serl-robot.github.io/

摘要: In recent years, significant progress has been made in the field of robotic reinforcement learning (RL), enabling methods that handle complex image observations, train in the real world, and incorporate auxiliary data, such as demonstrations and prior experience. However, despite these advances, robotic RL remains hard to use. It is acknowledged among practitioners that the particular implementation details of these algorithms are often just as important (if not more so) for performance as the choice of algorithm. We posit that a significant challenge to widespread adoption of robotic RL, as well as further development of robotic RL methods, is the comparative inaccessibility of such methods. To address this challenge, we developed a carefully implemented library containing a sample efficient off-policy deep RL method, together with methods for computing rewards and resetting the environment, a high-quality controller for a widely-adopted robot, and a number of challenging example tasks. We provide this library as a resource for the community, describe its design choices, and present experimental results. Perhaps surprisingly, we find that our implementation can achieve very efficient learning, acquiring policies for PCB board assembly, cable routing, and object relocation between 25 to 50 minutes of training per policy on average, improving over state-of-the-art results reported for similar tasks in the literature. These policies achieve perfect or near-perfect success rates, extreme robustness even under perturbations, and exhibit emergent recovery and correction behaviors. We hope that these promising results and our high-quality open-source implementation will provide a tool for the robotics community to facilitate further developments in robotic RL. Our code, documentation, and videos can be found at https://serl-robot.github.io/

标题: DittoGym: Learning to Control Soft Shape-Shifting Robots

作者: Suning Huang, Boyuan Chen, Huazhe Xu

PubTime: 2024-01-29

Downlink: http://arxiv.org/abs/2401.13231v2

Project: https://dittogym.github.io|

中文摘要: 机器人协同设计是一个新兴的研究领域，其中机器人的形态与学习的策略一起优化，以解决特定的任务。它为软机器人带来了特别的希望，软机器人适合新的制造技术，可以实现学习形态和致动器。受自然和最近新颖的机器人设计的启发，我们建议更进一步，探索新颖的可重构机器人，定义为可以在其生命周期内改变形态的机器人。我们将可重构软机器人的控制形式化为高维强化学习（RL）问题。我们在同一个动作空间中统一了形态变化、运动和环境相互作用，并引入了适当的、从粗到细的课程，使我们能够发现实现对最终机器人的细粒度控制的策略。我们还介绍了DittoGym，这是一个针对可重构软机器人的全面RL基准，需要细粒度的形态变化来完成任务。最后，我们在DittoGym上评估了我们提出的从粗到细的算法，并演示了机器人在一个序列中多次学习改变其形态，这是由我们的RL算法唯一实现的。更多结果请访问https：//dittogym.github.io。

摘要: Robot co-design, where the morphology of a robot is optimized jointly with a learned policy to solve a specific task, is an emerging area of research. It holds particular promise for soft robots, which are amenable to novel manufacturing techniques that can realize learned morphologies and actuators. Inspired by nature and recent novel robot designs, we propose to go a step further and explore the novel reconfigurable robots, defined as robots that can change their morphology within their lifetime. We formalize control of reconfigurable soft robots as a high-dimensional reinforcement learning (RL) problem. We unify morphology change, locomotion, and environment interaction in the same action space, and introduce an appropriate, coarse-to-fine curriculum that enables us to discover policies that accomplish fine-grained control of the resulting robots. We also introduce DittoGym, a comprehensive RL benchmark for reconfigurable soft robots that require fine-grained morphology changes to accomplish the tasks. Finally, we evaluate our proposed coarse-to-fine algorithm on DittoGym and demonstrate robots that learn to change their morphology several times within a sequence, uniquely enabled by our RL algorithm. More results are available at https://dittogym.github.io.

标题: Emergent Dominance Hierarchies in Reinforcement Learning Agents

作者: Ram Rachum, Yonatan Nakar, Bill Tomlinson

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.12258v2

中文摘要: 现代强化学习（RL）算法能够在各种各样的任务中胜过人类。多智能体强化学习（MARL）设置提出了额外的挑战，混合动机智能体组中的成功合作取决于个人和群体目标之间的微妙平衡。通常受人类制度启发的社会习俗和规范被用作实现这种平衡的工具。在本文中，我们考察了一个基本的、经过充分研究的社会惯例，它是动物和人类社会合作的基础：统治等级。我们将优势等级的行为学理论应用于人工代理，借用已建立的术语和定义，尽可能少地修改。我们证明了RL代理人的群体，在没有显式编程或内在奖励的情况下运作，可以发明、学习、执行并向新群体传递优势等级。出现的优势等级与在鸡、老鼠、鱼和其他物种中研究的结构相似。

摘要: Modern Reinforcement Learning (RL) algorithms are able to outperform humans in a wide variety of tasks. Multi-agent reinforcement learning (MARL) settings present additional challenges, and successful cooperation in mixed-motive groups of agents depends on a delicate balancing act between individual and group objectives. Social conventions and norms, often inspired by human institutions, are used as tools for striking this balance. In this paper, we examine a fundamental, well-studied social convention that underlies cooperation in both animal and human societies: dominance hierarchies. We adapt the ethological theory of dominance hierarchies to artificial agents, borrowing the established terminology and definitions with as few amendments as possible. We demonstrate that populations of RL agents, operating without explicit programming or intrinsic rewards, can invent, learn, enforce, and transmit a dominance hierarchy to new populations. The dominance hierarchies that emerge have a similar structure to those studied in chickens, mice, fish, and other species.

标题: Toward a Reinforcement-Learning-Based System for Adjusting Medication to Minimize Speech Disfluency

作者: Pavlos Constas, Vikram Rawal, Matthew Honorio Oliveira

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2312.11509v3

中文摘要: 我们提出了一种基于强化学习（RL）的系统，该系统将自动开出一种假设的患者药物，该药物可以帮助患者解决与精神健康相关的言语不流利问题，并根据对患者流利程度的零成本频繁测量来调整药物和剂量。我们展示了该系统的组件：一个在我们建立的大型数据集上检测和评估语音不流畅的模块，以及一个自动找到良好药物组合的RL算法。为了支持这两个模块，我们从文献中收集了关于精神药物对言语不流利的影响的数据，并建立了一个可信的患者模拟系统。我们证明，在某些情况下，RL系统能够收敛到一个良好的药物治疗方案。我们收集并标记了一个可能有语音不流利的人的数据集，并使用该数据集演示了我们的方法。我们的工作是一个概念的证明：我们表明使用自动数据收集来解决语音不流畅的想法是有希望的。

摘要: We propose a reinforcement learning (RL)-based system that would automatically prescribe a hypothetical patient medication that may help the patient with their mental health-related speech disfluency, and adjust the medication and the dosages in response to zero-cost frequent measurement of the fluency of the patient. We demonstrate the components of the system: a module that detects and evaluates speech disfluency on a large dataset we built, and an RL algorithm that automatically finds good combinations of medications. To support the two modules, we collect data on the effect of psychiatric medications for speech disfluency from the literature, and build a plausible patient simulation system. We demonstrate that the RL system is, under some circumstances, able to converge to a good medication regime. We collect and label a dataset of people with possible speech disfluency and demonstrate our methods using that dataset. Our work is a proof of concept: we show that there is promise in the idea of using automatic data collection to address speech disfluency.

标题: Towards a Pretrained Model for Restless Bandits via Multi-arm Generalization

作者: Yunfan Zhao, Nikhil Behari, Edward Hughes

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2310.14526v3

摘要: Restless multi-arm bandits (RMABs), a class of resource allocation problems with broad application in areas such as healthcare, online advertising, and anti-poaching, have recently been studied from a multi-agent reinforcement learning perspective. Prior RMAB research suffers from several limitations, e.g., it fails to adequately address continuous states, and requires retraining from scratch when arms opt-in and opt-out over time, a common challenge in many real world applications. We address these limitations by developing a neural network-based pre-trained model (PreFeRMAB) that has general zero-shot ability on a wide range of previously unseen RMABs, and which can be fine-tuned on specific instances in a more sample-efficient way than retraining from scratch. Our model also accommodates general multi-action settings and discrete or continuous state spaces. To enable fast generalization, we learn a novel single policy network model that utilizes feature information and employs a training procedure in which arms opt-in and out over time. We derive a new update rule for a crucial $\lambda$ -network with theoretical convergence guarantees and empirically demonstrate the advantages of our approach on several challenging, real-world inspired problems.

== Imitation Learning ==

== Embodied Artificial Intelligence@robotic agent@human robot interaction ==

标题: Generative Expressive Robot Behaviors using Large Language Models

作者: Karthik Mahadevan, Jonathan Chien, Noah Brown

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.14673v2

Project: https://generative-expressive-motion.github.io/|

中文摘要: 人们使用表达行为来有效地与他人交流和协调他们的行动，例如点头表示对瞥他们一眼的人的认可，或者在繁忙的走廊上说“对不起”从人群中经过。我们希望机器人也能在人机交互中表现出富有表现力的行为。先前的工作提出了基于规则的方法，这些方法很难扩展到新的通信模式或社交场合，而数据驱动的方法需要针对机器人所处的每个社交场合的专门数据集。我们建议利用大型语言模型（LLMs）提供的丰富的社会背景及其基于指令或用户偏好生成运动的能力，来生成具有适应性和可组合性的机器人运动，并相互构建。我们的方法利用机器人可用和学习的技能，利用少量思维链提示将人类语言指令翻译成参数化的控制代码。通过用户研究和模拟实验，我们证明了我们的方法产生了用户认为有能力和容易理解的行为。补充材料见https://generative-expressive-motion.github.io/。

摘要: People employ expressive behaviors to effectively communicate and coordinate their actions with others, such as nodding to acknowledge a person glancing at them or saying “excuse me” to pass people in a busy corridor. We would like robots to also demonstrate expressive behaviors in human-robot interaction. Prior work proposes rule-based methods that struggle to scale to new communication modalities or social situations, while data-driven methods require specialized datasets for each social situation the robot is used in. We propose to leverage the rich social context available from large language models (LLMs) and their ability to generate motion based on instructions or user preferences, to generate expressive robot motion that is adaptable and composable, building upon each other. Our approach utilizes few-shot chain-of-thought prompting to translate human language instructions into parametrized control code using the robot’s available and learned skills. Through user studies and simulation experiments, we demonstrate that our approach produces behaviors that users found to be competent and easy to understand. Supplementary material can be found at https://generative-expressive-motion.github.io/.

== Object Detection@ Segmentation@Open vocabulary detection@SAM ==

标题: MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

作者: Guanxiong Sun, Yang Hua, Guosheng Hu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.09923v2

GitHub: https://github.com/guanxiongsun/vfe.pytorch|https://github.com/guanxiongsun/vfe.pytorch|

中文摘要: 最新的视频对象检测方法维护一个存储结构，或者滑动窗口或者存储队列，以使用注意机制来增强当前帧。然而，我们认为这些存储器结构是无效的或不够的，因为有两个隐含的操作：（1）连接存储器中的所有特征以进行增强，导致沉重的计算成本；（2）帧式存储器更新，防止存储器捕获更多的时间信息。在本文中，我们提出了一种通过内存库的多级聚合架构，称为MAMBA。具体来说，我们的记忆库采用了两种新的操作来消除现有方法的缺点：（1）轻量级的密钥集构造，可以显著降低计算成本；（2）细粒度的特征更新策略，使得我们的方法能够利用来自整个视频的知识。为了更好地增强互补级别的特征，即特征地图和建议，我们进一步提出了广义增强操作（GEO），以统一的方式聚合多级特征。我们对具有挑战性的ImageNetVID数据集进行了广泛的评估。与现有的最先进的方法相比，我们的方法在速度和准确性方面都取得了优异的性能。更值得注意的是，MAMBA使用ResNet-101以12.6/9.1 FPS的速度实现了83.7/84.6%的mAP。代码可在https：//github.com/guanxionsun/vfe.pytorch获得。

摘要: State-of-the-art video object detection methods maintain a memory structure, either a sliding window or a memory queue, to enhance the current frame using attention mechanisms. However, we argue that these memory structures are not efficient or sufficient because of two implied operations: (1) concatenating all features in memory for enhancement, leading to a heavy computational cost; (2) frame-wise memory updating, preventing the memory from capturing more temporal information. In this paper, we propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods: (1) light-weight key-set construction which can significantly reduce the computational cost; (2) fine-grained feature-wise updating strategy which enables our method to utilize knowledge from the whole video. To better enhance features from complementary levels, i.e., feature maps and proposals, we further propose a generalized enhancement operation (GEO) to aggregate multi-level features in a unified manner. We conduct extensive evaluations on the challenging ImageNetVID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS with ResNet-101. Code is available at https://github.com/guanxiongsun/vfe.pytorch.

标题: UV-SAM: Adapting Segment Anything Model for Urban Village Identification

作者: Xin Zhang, Yu Liu, Yuming Lin

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.08083v2

GitHub: https://github.com/tsinghua-fib-lab/UV-SAM|

中文摘要: 城中村被定义为城市中心或周围的非正规住宅区，其特点是基础设施不足和生活条件差，与关于贫困、适足住房和可持续城市的可持续发展目标密切相关。传统上，政府严重依赖实地调查方法来监测城中村，然而这是耗时、劳动密集型的，并且可能会延迟。由于广泛可用和及时更新的卫星图像，最近的研究开发了计算机视觉技术来有效地检测城市村庄。然而，现有的研究要么集中在简单的城市村庄图像分类上，要么不能提供准确的边界信息。为了从卫星图像中准确识别城中村边界，我们利用vision foundation模型的强大功能，并将分段任何模型（SAM）应用于城中村分割，称为UV-SAM。具体来说，UV-SAM首先利用小型语义分割模型为城市村庄生成混合提示，包括遮罩、边界框和图像表示，然后将其输入SAM进行细粒度的边界识别。在中国两个数据集上的大量实验结果表明，UV-SAM优于现有基线，多年的识别结果表明，城中村的数量和面积都在随着时间的推移而减少，这为城中村的发展趋势提供了更深入的见解，并为可持续城市的愿景基础模型提供了启示。这项研究的数据集和代码可在https://github.com/tsinghua-fib-lab/UV-SAM获得。

摘要: Urban villages, defined as informal residential areas in or around urban centers, are characterized by inadequate infrastructures and poor living conditions, closely related to the Sustainable Development Goals (SDGs) on poverty, adequate housing, and sustainable cities. Traditionally, governments heavily depend on field survey methods to monitor the urban villages, which however are time-consuming, labor-intensive, and possibly delayed. Thanks to widely available and timely updated satellite images, recent studies develop computer vision techniques to detect urban villages efficiently. However, existing studies either focus on simple urban village image classification or fail to provide accurate boundary information. To accurately identify urban village boundaries from satellite images, we harness the power of the vision foundation model and adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM. Specifically, UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification. Extensive experimental results on two datasets in China demonstrate that UV-SAM outperforms existing baselines, and identification results over multiple years show that both the number and area of urban villages are decreasing over time, providing deeper insights into the development trends of urban villages and sheds light on the vision foundation models for sustainable cities. The dataset and codes of this study are available at https://github.com/tsinghua-fib-lab/UV-SAM.

标题: SAMF: Small-Area-Aware Multi-focus Image Fusion for Object Detection

作者: Xilai Li, Xiaosong Li, Haishu Tan

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.08357v2

GitHub: https://github.com/ixilai/SAMF|

中文摘要: 现有的多焦点图像融合（MFIF）方法通常无法保留不确定的过渡区域，也无法准确检测大散焦区域内的小焦点区域。为了解决这个问题，本研究提出了一种新的小区域感知MFIF算法来增强目标检测能力。首先，我们增强小焦点和边界区域内的像素属性，随后将其与视觉显著性检测相结合，以获得用于区分聚焦像素分布的预融合结果。为了准确地确保像素聚焦，我们将源图像视为聚焦、散焦和不确定区域的组合，并提出了三区域分割策略。最后，我们设计了一个有效的像素选择规则来生成分割决策图，并得到最终的融合结果。实验表明，该方法能够准确检测小而平滑的焦点区域，同时提高目标检测性能，在主观和客观评价方面都优于现有方法。源代码可在https：//github.com/ixilai/SAMF。获得

摘要: Existing multi-focus image fusion (MFIF) methods often fail to preserve the uncertain transition region and detect small focus areas within large defocused regions accurately. To address this issue, this study proposes a new small-area-aware MFIF algorithm for enhancing object detection capability. First, we enhance the pixel attributes within the small focus and boundary regions, which are subsequently combined with visual saliency detection to obtain the pre-fusion results used to discriminate the distribution of focused pixels. To accurately ensure pixel focus, we consider the source image as a combination of focused, defocused, and uncertain regions and propose a three-region segmentation strategy. Finally, we design an effective pixel selection rule to generate segmentation decision maps and obtain the final fusion results. Experiments demonstrated that the proposed method can accurately detect small and smooth focus areas while improving object detection performance, outperforming existing methods in both subjective and objective evaluations. The source code is available at https://github.com/ixilai/SAMF.

标题: UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

作者: Qingdong He, Jinlong Peng, Zhengkai Jiang

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.11395v2

GitHub: https://github.com/hithqd/UniM-OV3D|

中文摘要: 3D开放词汇场景理解旨在识别基本标签空间之外的任意新颖类别。然而，现有的工作不仅不能充分利用3D领域中所有可用的模态信息，而且在表示每个模态的特征方面缺乏足够的粒度。在本文中，我们提出了一个统一的多模态3D开放词汇场景理解网络，即UniM-OV3D，它将点云与图像、语言和深度对齐。为了更好地整合点云的全局和局部特征，我们设计了一个分层的点云特征提取模块，该模块学习全面的细粒度特征表示。此外，为了便于从字幕中学习从粗到细的点语义表示，我们建议利用分层3D字幕对，利用跨3D场景的各种视点的几何约束。大量的实验结果证明了我们的方法在开放词汇语义和实例分割方面的有效性和优越性，在室内和室外基准测试（如ScanNet、ScanNet200、S3IDS和nuScenes）上都达到了最先进的性能。代码可在https：//github.com/hithqd/UniM-OV3D获得。

摘要: 3D open-vocabulary scene understanding aims to recognize arbitrary novel categories beyond the base label space. However, existing works not only fail to fully utilize all the available modal information in the 3D domain but also lack sufficient granularity in representing the features of each modality. In this paper, we propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D, which aligns point clouds with image, language and depth. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module that learns comprehensive fine-grained feature representations. Further, to facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs, capitalizing on geometric constraints across various viewpoints of 3D scenes. Extensive experimental results demonstrate the effectiveness and superiority of our method in open-vocabulary semantic and instance segmentation, which achieves state-of-the-art performance on both indoor and outdoor benchmarks such as ScanNet, ScanNet200, S3IDS and nuScenes. Code is available at https://github.com/hithqd/UniM-OV3D.

标题: CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians with Dual Feature Fusion

作者: Bin Dou, Tianyu Zhang, Yongjia Ma

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.05925v3

Project: https://David-Dou.github.io/CoSSegGaussians|

中文摘要: 我们提出了紧凑和快速分割3D高斯（CoSSegGaussians），这是一种仅输入RGB图像即可在快速渲染速度下进行紧凑3D一致场景分割的方法。以前基于NeRF的分割方法依赖于耗时的神经场景优化。虽然最近的3D高斯飞溅显著提高了速度，但现有的基于高斯的分割方法很难产生紧凑的掩模，特别是在零镜头分割中。这个问题可能源于他们直接将可学习参数分配给每个高斯，导致对交叉视图不一致的2D机器生成的标签缺乏鲁棒性。我们的方法旨在通过使用双特征融合网络作为高斯域来解决这个问题。具体来说，我们首先在RGB监督下优化3D高斯。在高斯定位之后，通过显式反投影应用从图像中提取的DINO特征，这些特征进一步与来自高效点云处理网络的空间特征相结合。特征聚合用于将它们融合在全局到局部的策略中，以实现紧凑的分割特征。实验结果表明，与基于NeRF的方法相比，我们的模型在语义和全景零镜头分割任务上都优于基线，同时消耗不到10%的推理时间。代码和更多结果将在https://David-Dou.github.io/CoSSegGaussians

摘要: We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a method for compact 3D-consistent scene segmentation at fast rendering speed with only RGB images input. Previous NeRF-based segmentation methods have relied on time-consuming neural scene optimization. While recent 3D Gaussian Splatting has notably improved speed, existing Gaussian-based segmentation methods struggle to produce compact masks, especially in zero-shot segmentation. This issue probably stems from their straightforward assignment of learnable parameters to each Gaussian, resulting in a lack of robustness against cross-view inconsistent 2D machine-generated labels. Our method aims to address this problem by employing Dual Feature Fusion Network as Gaussians’ segmentation field. Specifically, we first optimize 3D Gaussians under RGB supervision. After Gaussian Locating, DINO features extracted from images are applied through explicit unprojection, which are further incorporated with spatial features from the efficient point cloud processing network. Feature aggregation is utilized to fuse them in a global-to-local strategy for compact segmentation features. Experimental results show that our model outperforms baselines on both semantic and panoptic zero-shot segmentation task, meanwhile consumes less than 10% inference time compared to NeRF-based methods. Code and more results will be available at https://David-Dou.github.io/CoSSegGaussians

标题: Tiered approach for rapid damage characterisation of infrastructure enabled by remote sensing and deep learning technologies

作者: Nadiia Kopiika, Andreas Karavias, Pavlos Krassakis

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.17759v2

摘要: Critical infrastructure such as bridges are systematically targeted during wars and conflicts. This is because critical infrastructure is vital for enabling connectivity and transportation of people and goods, and hence, underpinning the national and international defence planning and economic growth. Mass destruction of bridges, along with minimal or no accessibility to these assets during natural and anthropogenic disasters, prevents us from delivering rapid recovery. As a result, systemic resilience is drastically reduced. A solution to this challenge is to use technology for stand-off observations. Yet, no method exists to characterise damage at different scales, i.e. regional, asset, and structural (component), and more so there is little or no systematic correlation between assessments at scale. We propose an integrated three-level tiered approach to fill this capability gap, and we demonstrate the methods for damage characterisation enabled by fit-for-purpose digital technologies. Next, this method is applied and validated to a case study in Ukraine that includes 17 bridges. From macro to micro, we deploy technology at scale, from Sentinel-1 SAR images, crowdsourced information, and high-resolution images to deep learning for damaged infrastructure. For the first time, the interferometric coherence difference and semantic segmentation of images were deployed to improve the reliability of damage characterisations from regional to infrastructure component level, when enhanced assessment accuracy is required. This integrated method improves the speed of decision-making, and thus, enhances resilience. Keywords: critical infrastructure, damage characterisation, targeted attacks, restoration