[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--强化学习、模仿学习

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需VX关注公号并回复{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== RL @ RLHF ==

标题: Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

作者: Quentin Delfosse, Sebastian Sztwiertnia, Mark Rothermel

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.05821v2

GitHub: https://github.com/k4ntz/SCoBots|

中文摘要: 目标错位、奖励稀疏和困难的信用分配只是使深度强化学习（RL）代理难以学习最优策略的许多问题中的几个。不幸的是，深度神经网络的黑箱性质阻碍了包括领域专家来检查模型和修改次优策略。为此，我们引入了连续概念瓶颈代理（SCoBots），它集成了连续概念瓶颈（CB）层。与当前的CB模型相比，SCoBots不仅将概念表示为单个对象的属性，还表示为对象之间的关系，这对于许多RL任务至关重要。我们的实验结果为SCoBots的竞争表现提供了证据，也为领域专家理解和规范其行为提供了证据。除此之外，SCoBots使我们能够识别标志性视频游戏Pong中以前未知的错位问题，并解决它。总的来说，SCoBots因此产生了更多与人类一致的RL代理。我们的代码可在https：//github.com/k4ntz/SCoBots。获得

摘要: Goal misalignment, reward sparsity and difficult credit assignment are only a few of the many issues that make it difficult for deep reinforcement learning (RL) agents to learn optimal policies. Unfortunately, the black-box nature of deep neural networks impedes the inclusion of domain experts for inspecting the model and revising suboptimal policies. To this end, we introduce Successive Concept Bottleneck Agents (SCoBots), that integrate consecutive concept bottleneck (CB) layers. In contrast to current CB models, SCoBots do not just represent concepts as properties of individual objects, but also as relations between objects which is crucial for many RL tasks. Our experimental results provide evidence of SCoBots’ competitive performances, but also of their potential for domain experts to understand and regularize their behavior. Among other things, SCoBots enabled us to identify a previously unknown misalignment problem in the iconic video game, Pong, and resolve it. Overall, SCoBots thus result in more human-aligned RL agents. Our code is available at https://github.com/k4ntz/SCoBots .

标题: ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

作者: Chen-Xiao Gao, Chenyang Wu, Mingjun Cao

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2309.05915v2

GitHub: https://github.com/LAMDA-RL/ACT|

中文摘要: 决策Transformer model（DT）采用表达序列建模技术来执行动作生成，已经成为离线策略优化的一种有前途的方法。然而，DT产生的行动是以期望的未来回报为条件的，这是众所周知的，具有一些弱点，如对环境随机性的敏感性。为了克服DT的弱点，我们建议用动态规划来增强DT。我们的方法包括三个步骤。首先，我们采用样本内值迭代来获得近似值函数，这涉及到MDP结构上的动态规划。第二，我们在估计优势的背景下评估行动质量。我们介绍了两种类型的优势估计器，IAE和GAE，它们适用于不同的任务。第三，我们训练一个优势条件Transformer model（ACT）来生成基于估计优势的动作。最后，在测试过程中，ACT根据期望的优势生成动作。我们的评估结果证实，通过利用动态规划的力量，ACT在环境随机性的情况下展示了有效的轨迹拼接和稳健的动作生成，在各种基准测试中优于基线方法。此外，我们通过消融研究对ACT的各种设计选择进行了深入分析。我们的代码可以在https://github.com/LAMDA-RL/ACT上找到。

摘要: Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT’s weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT’s various design choices through ablation studies. Our code is available at https://github.com/LAMDA-RL/ACT.

标题: RLHF and IIA: Perverse Incentives

作者: Wanqiao Xu, Shi Dong, Xiuyuan Lu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2312.01057v3

中文摘要: 现有的人类反馈强化学习算法(RLHF)可以激励与偏好不一致的反应，因为它们基于假设无关替代方案独立的模型(IIA)。IIA引发的不正当激励阻碍了查询格式和学习算法的创新。

摘要: Existing algorithms for reinforcement learning from human feedback (RLHF) can incentivize responses at odds with preferences because they are based on models that assume independence of irrelevant alternatives (IIA). The perverse incentives induced by IIA hinder innovations on query formats and learning algorithms.

标题: Towards Efficient and Exact Optimization of Language Model Alignment

作者: Haozhe Ji, Cheng Lu, Yilin Niu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00856v1

中文摘要: 语言模型与人类偏好的一致性对于它们在现实世界任务中的应用至关重要。该问题被公式化为优化模型的策略，以最大化反映人类偏好的预期回报，同时与初始策略的偏差最小。虽然被认为是一种简单的解决方案，但强化学习（RL）在策略更新方面存在很大的方差，这阻碍了有效的策略改进。最近，直接偏好优化（DPO）被提出来从偏好数据中直接优化策略。尽管实现起来很简单，但DPO是基于在实践中不能保证实现的最优策略导出的，这破坏了它与预期解决方案的收敛性。在本文中，我们提出了对准目标的有效精确优化（EXO）。我们证明了对于策略的任意参数化，EXO保证在与RL算法相同的方向上渐近优化，同时通过规避与RL算法相关的复杂性来实现有效的优化。我们通过理论和实证分析将我们的方法与DPO方法进行了比较，并进一步证明了我们的方法在现实人类偏好数据上优于现有方法的优势。

摘要: The alignment of language models with human preferences is vital for their application in real-world tasks. The problem is formulated as optimizing the model’s policy to maximize the expected reward that reflects human preferences with minimal deviation from the initial policy. While considered as a straightforward solution, reinforcement learning (RL) suffers from high variance in policy updates, which impedes efficient policy improvement. Recently, direct preference optimization (DPO) was proposed to directly optimize the policy from preference data. Though simple to implement, DPO is derived based on the optimal policy that is not assured to be achieved in practice, which undermines its convergence to the intended solution. In this paper, we propose efficient exact optimization (EXO) of the alignment objective. We prove that EXO is guaranteed to optimize in the same direction as the RL algorithms asymptotically for arbitary parametrization of the policy, while enables efficient optimization by circumventing the complexities associated with RL algorithms. We compare our method to DPO with both theoretical and empirical analyses, and further demonstrate the advantages of our method over existing approaches on realistic human preference data.

标题: SLIM: Skill Learning with Multiple Critics

作者: David Emukpere, Bingbing Wu, Julien Perez

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00823v1

中文摘要: 自我监督技能学习旨在获得利用环境潜在动态的有用行为。基于互信息最大化的潜在变量模型在这项任务中特别成功，但在机器人操作的背景下仍然举步维艰。因为它需要影响组成环境的可能很大的自由度集，所以互信息最大化不能单独产生有用的操纵行为。为了解决这一限制，我们引入了SLIM，这是一种用于技能发现的多批评学习方法，特别关注机器人操作。我们的主要见解是，在行动者——批评家框架中利用多个批评家来优雅地组合多个奖励函数，导致机器人操纵的潜在变量技能发现的显著改进，同时克服奖励之间可能发生的干扰，这种干扰阻碍了有用技能的收敛。此外，在桌面操作的背景下，我们证明了我们的新技能发现方法的适用性，以分层强化学习的方式获得安全有效的运动原语，并通过规划来利用它们，大大超过了技能发现的最先进方法。

摘要: Self-supervised skill learning aims to acquire useful behaviors that leverage the underlying dynamics of the environment. Latent variable models, based on mutual information maximization, have been particularly successful in this task but still struggle in the context of robotic manipulation. As it requires impacting a possibly large set of degrees of freedom composing the environment, mutual information maximization fails alone in producing useful manipulation behaviors. To address this limitation, we introduce SLIM, a multi-critic learning approach for skill discovery with a particular focus on robotic manipulation. Our main insight is that utilizing multiple critics in an actor-critic framework to gracefully combine multiple reward functions leads to a significant improvement in latent-variable skill discovery for robotic manipulation while overcoming possible interference occurring among rewards which hinders convergence to useful skills. Furthermore, in the context of tabletop manipulation, we demonstrate the applicability of our novel skill discovery approach to acquire safe and efficient motor primitives in a hierarchical reinforcement learning fashion and leverage them through planning, surpassing the state-of-the-art approaches for skill discovery by a large margin.

标题: Leveraging Approximate Model-based Shielding for Probabilistic Safety Guarantees in Continuous Environments

作者: Alexander W. Goodall, Francesco Belardinelli

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00816v1

摘要: Shielding is a popular technique for achieving safe reinforcement learning (RL). However, classical shielding approaches come with quite restrictive assumptions making them difficult to deploy in complex environments, particularly those with continuous state or action spaces. In this paper we extend the more versatile approximate model-based shielding (AMBS) framework to the continuous setting. In particular we use Safety Gym as our test-bed, allowing for a more direct comparison of AMBS with popular constrained RL algorithms. We also provide strong probabilistic safety guarantees for the continuous setting. In addition, we propose two novel penalty techniques that directly modify the policy gradient, which empirically provide more stable convergence in our experiments.

== Imitation Learning ==

标题: Robust Path Planning via Learning from Demonstrations for Robotic Catheters in Deformable Environments

作者: Zhen Li, Chiara Lambranzi, Di Wu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00537v1

中文摘要: 使用转向能力有限的导管在曲折和可变形的血管中导航强调了对可靠路径规划的需求。最先进的路径规划者没有完全考虑到环境的可变形性。本工作通过从演示中学习的方法提出了一个健壮的路径规划器，称为课程生成对抗性模仿学习（C-GAIL）。该路径规划框架考虑了可操纵导管和血管壁之间的相互作用以及血管的可变形性。计算机对比实验表明，与基于GAIL的最新方法相比，所提出的网络实现了更小的定位误差和更高的成功率。体外验证实验表明，由所提出的C-GAIL路径规划器生成的路径与本研究中使用的气动人工肌肉驱动导管的实际转向能力更好地对齐。因此，与传统的中心线跟随技术相比，所提出的方法可以为用户以更高的精度将导管导向目标提供增强的支持。瞄准和跟踪误差分别为1.26$pm $0.55 mm 和 5.18$ pm$3.48 mm。所提出的路径规划框架在管理与船只变形相关的不确定性方面表现出优异的性能，从而导致较低的跟踪误差。

摘要: Navigation through tortuous and deformable vessels using catheters with limited steering capability underscores the need for reliable path planning. State-of-the-art path planners do not fully account for the deformable nature of the environment. This work proposes a robust path planner via a learning from demonstrations method, named Curriculum Generative Adversarial Imitation Learning (C-GAIL). This path planning framework takes into account the interaction between steerable catheters and vessel walls and the deformable property of vessels. In-silico comparative experiments show that the proposed network achieves smaller targeting errors, and a higher success rate, compared to a state-of-the-art approach based on GAIL. The in-vitro validation experiments demonstrate that the path generated by the proposed C-GAIL path planner aligns better with the actual steering capability of the pneumatic artificial muscle-driven catheter utilized in this study. Therefore, the proposed approach can provide enhanced support to the user in navigating the catheter towards the target with greater precision, in contrast to the conventional centerline-following technique. The targeting and tracking errors are 1.26$\pm $0.55 mman d 5.18$ \pm$3.48mm, respectively. The proposed path planning framework exhibits superior performance in managing uncertainty associated with vessel deformation, thereby resulting in lower tracking errors.

标题: ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update

作者: Liyuan Mao, Haoran Xu, Weinan Zhang

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00348v1

中文摘要: 在这项研究中，我们研究了分布校正估计（DICE）方法，这是离线强化学习（RL）和模仿学习（IL）中的一项重要工作。基于DICE的方法施加状态——动作级行为约束，是离线学习的理想选择。然而，它们的性能通常比仅使用动作级行为约束的当前最先进的（SOTA）方法差得多。在重新审视基于DICE的方法后，我们发现在使用真梯度更新学习值函数时存在两个梯度项：前向梯度（在当前状态下采用）和后向梯度（在下一状态下采用）。使用前向梯度与许多离线RL方法有很大的相似性，因此可以被视为应用动作级约束。然而，如果这两个梯度具有冲突的方向，直接添加向后梯度可能会退化或抵消其效果。为了解决这个问题，我们提出了一种简单而有效的修改，将后向梯度投影到前向梯度的法线平面上，从而产生正交梯度更新，这是基于DICE的方法的一种新的学习规则。我们进行了彻底的理论分析，发现投影的向后梯度带来了状态级行为正则化，这揭示了基于骰子的方法的奥秘：价值学习目标确实试图施加状态——动作级约束，但需要以正确的方式使用。通过玩具实例和复杂离线RL和IL任务的大量实验，我们证明了使用正交梯度更新（O-DICE）的基于DICE的方法实现了SOTA性能和很好的鲁棒性。

摘要: In this study, we investigate the DIstribution Correction Estimation (DICE) methods, an important line of work in offline reinforcement learning (RL) and imitation learning (IL). DICE-based methods impose state-action-level behavior constraint, which is an ideal choice for offline learning. However, they typically perform much worse than current state-of-the-art (SOTA) methods that solely use action-level behavior constraint. After revisiting DICE-based methods, we find there exist two gradient terms when learning the value function using true-gradient update: forward gradient (taken on the current state) and backward gradient (taken on the next state). Using forward gradient bears a large similarity to many offline RL methods, and thus can be regarded as applying action-level constraint. However, directly adding the backward gradient may degenerate or cancel out its effect if these two gradients have conflicting directions. To resolve this issue, we propose a simple yet effective modification that projects the backward gradient onto the normal plane of the forward gradient, resulting in an orthogonal-gradient update, a new learning rule for DICE-based methods. We conduct thorough theoretical analyses and find that the projected backward gradient brings state-level behavior regularization, which reveals the mystery of DICE-based methods: the value learning objective does try to impose state-action-level constraint, but needs to be used in a corrected way. Through toy examples and extensive experiments on complex offline RL and IL tasks, we demonstrate that DICE-based methods using orthogonal-gradient updates (O-DICE) achieve SOTA performance and great robustness.

标题: Interpretable Imitation Learning with Dynamic Causal Relations

作者: Tianxiang Zhao, Wenchao Yu, Suhang Wang

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2310.00489v4

中文摘要: 模仿学习通过模仿专家演示来学习代理策略，在医疗制度和自动驾驶汽车等许多应用中显示出有希望的结果。然而，解释代理学习的控制策略仍然是一项困难的任务。困难主要来自两个方面：1）模仿学习中的agent通常实现为深度神经网络，是黑盒模型，缺乏可解释性；2）代理决策背后的潜在因果机制可能会沿轨迹变化，而不是在整个时间步骤中保持静止。为了增加透明度并提供神经代理更好的可解释性，我们建议以有向无环因果图的形式公开其捕获的知识，节点是动作和状态变量，边表示预测背后的因果关系。此外，我们将这个因果发现过程设计为状态相关的，使其能够对潜在因果图中的动态进行建模。具体来说，我们从格兰杰因果关系的角度进行因果发现，并提出一个可自我解释的模仿学习框架{\method}。所提出的框架由三部分组成：动态因果发现模块、因果编码模块和预测模块，并以端到端的方式进行训练。在模型被学习之后，我们可以获得其决策背后的状态和行动变量之间的因果关系，暴露它所学习的政策。在合成和真实世界数据集上的实验结果证明了所提出的{\method}在学习动态因果图以理解模仿学习的决策同时保持高预测精度方面的有效性。

摘要: Imitation learning, which learns agent policy by mimicking expert demonstration, has shown promising results in many applications such as medical treatment regimes and self-driving vehicles. However, it remains a difficult task to interpret control policies learned by the agent. Difficulties mainly come from two aspects: 1) agents in imitation learning are usually implemented as deep neural networks, which are black-box models and lack interpretability; 2) the latent causal mechanism behind agents’ decisions may vary along the trajectory, rather than staying static throughout time steps. To increase transparency and offer better interpretability of the neural agent, we propose to expose its captured knowledge in the form of a directed acyclic causal graph, with nodes being action and state variables and edges denoting the causal relations behind predictions. Furthermore, we design this causal discovery process to be state-dependent, enabling it to model the dynamics in latent causal graphs. Concretely, we conduct causal discovery from the perspective of Granger causality and propose a self-explainable imitation learning framework, {\method}. The proposed framework is composed of three parts: a dynamic causal discovery module, a causality encoding module, and a prediction module, and is trained in an end-to-end manner. After the model is learned, we can obtain causal relations among states and action variables behind its decisions, exposing policies learned by it. Experimental results on both synthetic and real-world datasets demonstrate the effectiveness of the proposed {\method} in learning the dynamic causal graphs for understanding the decision-making of imitation learning meanwhile maintaining high prediction accuracy.

标题: Extrinsicaly Rewarded Soft Q Imitation Learning with Discriminator

作者: Ryoma Furuyama, Daiki Kuyoshi, Satoshi Yamane

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16772v1

中文摘要: 在奖励设计困难或奖励稀疏的环境中，除了强化学习之外，还经常使用模仿学习，但很难从少量的专家数据和采样数据中很好地模仿未知状态。行为克隆等监督学习方法不需要采样数据，但通常会出现分布偏移。基于强化学习的方法，如逆强化学习和生成对抗模仿学习（GAIL），只能从少数专家数据中学习。然而，它们经常需要与环境互动。软Q模仿学习（SQIL）解决了这些问题，并表明它可以通过将行为克隆和软Q学习与持续奖励相结合来有效地学习。为了使该算法对分布转移更鲁棒，我们提出了更有效和鲁棒的算法，通过向该方法添加基于对抗逆强化学习的奖励函数，该奖励函数奖励代理在类似于演示的状态下执行动作。我们称这种算法为鉴别器软Q模仿学习（DSQIL）。我们在MuJoCo环境中对其进行了评估。

摘要: Imitation learning is often used in addition to reinforcement learning in environments where reward design is difficult or where the reward is sparse, but it is difficult to be able to imitate well in unknown states from a small amount of expert data and sampling data. Supervised learning methods such as Behavioral Cloning do not require sampling data, but usually suffer from distribution shift. The methods based on reinforcement learning, such as inverse reinforcement learning and Generative Adversarial imitation learning (GAIL), can learn from only a few expert data. However, they often need to interact with the environment. Soft Q imitation learning (SQIL) addressed the problems, and it was shown that it could learn efficiently by combining Behavioral Cloning and soft Q-learning with constant rewards. In order to make this algorithm more robust to distribution shift, we propose more efficient and robust algorithm by adding to this method a reward function based on adversarial inverse reinforcement learning that rewards the agent for performing actions in status similar to the demo. We call this algorithm Discriminator Soft Q Imitation Learning (DSQIL). We evaluated it on MuJoCo environments.

标题: ILBiT: Imitation Learning for Robot Using Position and Torque Information based on Bilateral Control with Transformer

作者: Masato Kobayashi, Thanpimon Buamanee, Yuki Uranishi

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16653v1

摘要: Autonomous manipulation in robot arms is a complex and evolving field of study in robotics. This paper introduces an innovative approach to this challenge by focusing on imitation learning (IL). Unlike traditional imitation methods, our approach uses IL based on bilateral control, allowing for more precise and adaptable robot movements. The conventional IL based on bilateral control method have relied on Long Short-Term Memory (LSTM) networks. In this paper, we present the IL for robot using position and torque information based on Bilateral control with Transformer (ILBiT). This proposed method employs the Transformer model, known for its robust performance in handling diverse datasets and its capability to surpass LSTM’s limitations, especially in tasks requiring detailed force adjustments. A standout feature of ILBiT is its high-frequency operation at 100 Hz, which significantly improves the system’s adaptability and response to varying environments and objects of different hardness levels. The effectiveness of the Transformer-based ILBiT method can be seen through comprehensive real-world experiments.

标题: Inverse Reinforcement Learning without Reinforcement Learning

作者: Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell

PubTime: 2024-01-29

Downlink: http://arxiv.org/abs/2303.14623v4

摘要: Inverse Reinforcement Learning (IRL) is a powerful set of techniques for imitation learning that aims to learn a reward function that rationalizes expert demonstrations. Unfortunately, traditional IRL methods suffer from a computational weakness: they require repeatedly solving a hard reinforcement learning (RL) problem as a subroutine. This is counter-intuitive from the viewpoint of reductions: we have reduced the easier problem of imitation learning to repeatedly solving the harder problem of RL. Another thread of work has proved that access to the side-information of the distribution of states where a strong policy spends time can dramatically reduce the sample and computational complexities of solving an RL problem. In this work, we demonstrate for the first time a more informed imitation learning reduction where we utilize the state distribution of the expert to alleviate the global exploration component of the RL subroutine, providing an exponential speedup in theory. In practice, we find that we are able to significantly speed up the prior art on continuous control tasks.

== robotic agent ==

标题: Generative Expressive Robot Behaviors using Large Language Models

作者: Karthik Mahadevan, Jonathan Chien, Noah Brown

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.14673v2

Project: https://generative-expressive-motion.github.io/|

中文摘要: 人们使用表达行为来有效地与他人交流和协调他们的行动，例如点头表示对瞥他们一眼的人的认可，或者在繁忙的走廊上说“对不起”从人群中经过。我们希望机器人也能在人机交互中表现出富有表现力的行为。先前的工作提出了基于规则的方法，这些方法很难扩展到新的通信模式或社交场合，而数据驱动的方法需要针对机器人所处的每个社交场合的专门数据集。我们建议利用大型语言模型（LLMs）提供的丰富的社会背景及其基于指令或用户偏好生成运动的能力，来生成具有适应性和可组合性的机器人运动，并相互构建。我们的方法利用机器人可用和学习的技能，利用少量思维链提示将人类语言指令翻译成参数化的控制代码。通过用户研究和模拟实验，我们证明了我们的方法产生了用户认为有能力和容易理解的行为。补充材料可以在https：//generative-expressive-motion.github.io/。

摘要: People employ expressive behaviors to effectively communicate and coordinate their actions with others, such as nodding to acknowledge a person glancing at them or saying “excuse me” to pass people in a busy corridor. We would like robots to also demonstrate expressive behaviors in human-robot interaction. Prior work proposes rule-based methods that struggle to scale to new communication modalities or social situations, while data-driven methods require specialized datasets for each social situation the robot is used in. We propose to leverage the rich social context available from large language models (LLMs) and their ability to generate motion based on instructions or user preferences, to generate expressive robot motion that is adaptable and composable, building upon each other. Our approach utilizes few-shot chain-of-thought prompting to translate human language instructions into parametrized control code using the robot’s available and learned skills. Through user studies and simulation experiments, we demonstrate that our approach produces behaviors that users found to be competent and easy to understand. Supplementary material can be found at https://generative-expressive-motion.github.io/.

标题: Towards Unified Interactive Visual Grounding in The Wild

作者: Jie Xu, Hanbo Zhang, Qingyi Si

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16699v1

GitHub: https://github.com/jxu124/TiO|

摘要: Interactive visual grounding in Human-Robot Interaction (HRI) is challenging yet practical due to the inevitable ambiguity in natural languages. It requires robots to disambiguate the user input by active information gathering. Previous approaches often rely on predefined templates to ask disambiguation questions, resulting in performance reduction in realistic interactive scenarios. In this paper, we propose TiO, an end-to-end system for interactive visual grounding in human-robot interaction. Benefiting from a unified formulation of visual dialogue and grounding, our method can be trained on a joint of extensive public data, and show superior generality to diversified and challenging open-world scenarios. In the experiments, we validate TiO on GuessWhat?! and InViG benchmarks, setting new state-of-the-art performance by a clear margin. Moreover, we conduct HRI experiments on the carefully selected 150 challenging scenes as well as real-robot platforms. Results show that our method demonstrates superior generality to diversified visual and language inputs with a high success rate. Codes and demos are available at https://github.com/jxu124/TiO.

标题: Transferring human emotions to robot motions using Neural Policy Style Transfer

作者: Raul Fernandez-Fernandez, Bartek Łukawski, Juan G. Victores

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00663v1

中文摘要: 神经风格转移（NST）最初是为了使用神经网络的特征提取能力作为对图像执行风格转移的一种方式而提出的。选择预先训练的图像分类架构进行特征提取，导致新图像显示与原始图像相同的内容，但具有不同的风格。在机器人学中，风格转移可以用来将人类的运动风格转移到机器人的运动中。挑战在于缺乏可用于特征提取的机器人运动预训练分类架构。神经策略风格转移TD3（NPST3）被提出用于将人类运动风格转移到机器人运动。这个框架允许相同的机器人运动以不同的以人为中心的运动风格来执行，例如以愤怒、快乐、平静或悲伤的方式。引入双延迟深度确定性策略梯度（TD3）网络来生成控制策略。自动编码器网络负责样式转移步骤的特征提取。风格转移步骤可以离线和在线两种方式执行：离线用于自主执行人类风格的机器人运动，在线用于在运行时适应例如遥控机器人的风格。该框架使用两种不同的机器人平台进行测试：一种是为远程操作任务设计的机器人机械手，另一种是为社会互动设计的人形机器人。对所提出的方法在两个平台上进行了评估，执行了总共147份问卷，要求人类受试者识别转移到机器人运动中的预定义动作集的人类运动风格。

摘要: Neural Style Transfer (NST) was originally proposed to use feature extraction capabilities of Neural Networks as a way to perform Style Transfer with images. Pre-trained image classification architectures were selected for feature extraction, leading to new images showing the same content as the original but with a different style. In robotics, Style Transfer can be employed to transfer human motion styles to robot motions. The challenge lies in the lack of pre-trained classification architectures for robot motions that could be used for feature extraction. Neural Policy Style Transfer TD3 (NPST3) is proposed for the transfer of human motion styles to robot motions. This framework allows the same robot motion to be executed in different human-centered motion styles, such as in an angry, happy, calm, or sad fashion. The Twin Delayed Deep Deterministic Policy Gradient (TD3) network is introduced for the generation of control policies. An autoencoder network is in charge of feature extraction for the Style Transfer step. The Style Transfer step can be performed both offline and online: offline for the autonomous executions of human-style robot motions, and online for adapting at runtime the style of e.g., a teleoperated robot. The framework is tested using two different robotic platforms: a robotic manipulator designed for telemanipulation tasks, and a humanoid robot designed for social interaction. The proposed approach was evaluated for both platforms, performing a total of 147 questionnaires asking human subjects to recognize the human motion style transferred to the robot motion for a predefined set of actions.

标题: Artificial intelligence is algorithmic mimicry: why artificial "agents" are not (and won't be) proper agents

作者: Johannes Jaeger

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2307.07515v3

中文摘要: 开发人工通用智能(AGI)的前景如何？我通过系统地比较生活系统和算法系统来研究这个问题，特别关注“代理”的概念。有三个基本区别需要考虑：（1）生命系统是自生的，即自我制造，因此能够设定自己的内在目标，而算法存在于计算环境中，其目标函数都由外部代理提供。（2）生命系统体现在它们的符号和物理方面之间没有分离的意义上，而算法运行在最大限度地将软件与硬件隔离的计算架构上。（3）生命系统经历了一个大世界，其中大多数问题是不明确定义的（并不是所有的问题都是可定义的），而算法存在于一个小世界，其中所有的问题都是明确定义的。这三个差异意味着生活系统和算法系统具有非常不同的能力和局限性。特别是，在当前人工智能研究的算法框架下，开发出真正的AGI（超越单纯的模仿）是极其不可能的。因此，关于算法工具的正确开发和部署的讨论应该围绕当前狭隘人工智能的危险和机遇，而不是人工系统中出现真正代理的极不可能的前景。

摘要: What is the prospect of developing artificial general intelligence (AGI)? I investigate this question by systematically comparing living and algorithmic systems, with a special focus on the notion of “agency.” There are three fundamental differences to consider: (1) Living systems are autopoietic, that is, self-manufacturing, and therefore able to set their own intrinsic goals, while algorithms exist in a computational environment with target functions that are both provided by an external agent. (2) Living systems are embodied in the sense that there is no separation between their symbolic and physical aspects, while algorithms run on computational architectures that maximally isolate software from hardware. (3) Living systems experience a large world, in which most problems are ill-defined (and not all definable), while algorithms exist in a small world, in which all problems are well-defined. These three differences imply that living and algorithmic systems have very different capabilities and limitations. In particular, it is extremely unlikely that true AGI (beyond mere mimicry) can be developed in the current algorithmic framework of AI research. Consequently, discussions about the proper development and deployment of algorithmic tools should be shaped around the dangers and opportunities of current narrow AI, not the extremely unlikely prospect of the emergence of true agency in artificial systems.

标题: REACT: Two Datasets for Analyzing Both Human Reactions and Evaluative Feedback to Robots Over Time

作者: Kate Candon, Nicholas C. Georgiou, Helen Zhou

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2402.00190v1

中文摘要: 最近在人机交互（HRI）方面的工作表明，机器人可以利用来自用户的隐含通信信号来了解他们在交互过程中是如何被感知的。例如，这些信号可以是反映人类内部状态的凝视模式、面部表情或身体动作。为了促进这一方向的未来研究，我们贡献了REACT数据库，这是一个由两个人机交互数据集组成的集合，显示了用户在合作游戏和摄影场景中对机器人的自然反应。此外，我们分析了数据集，以表明交互历史是影响人类对机器人反应的一个重要因素。因此，我们认为未来解释HRI中隐性反馈的模型应该明确解释这段历史。REACT为未来这种可能性打开了大门。

摘要: Recent work in Human-Robot Interaction (HRI) has shown that robots can leverage implicit communicative signals from users to understand how they are being perceived during interactions. For example, these signals can be gaze patterns, facial expressions, or body motions that reflect internal human states. To facilitate future research in this direction, we contribute the REACT database, a collection of two datasets of human-robot interactions that display users’ natural reactions to robots during a collaborative game and a photography scenario. Further, we analyze the datasets to show that interaction history is an important factor that can influence human reactions to robots. As a result, we believe that future models for interpreting implicit feedback in HRI should explicitly account for this history. REACT opens up doors to this possibility in the future.

标题: Memory-centered and Affordance-based Framework for Mobile Manipulation

作者: Christoph Pohl, Fabian Reister, Fabian Peller-Konrad

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16899v1

中文摘要: 在以人为中心的环境中执行多功能移动操作动作需要高度复杂的软件框架，这些框架足够灵活以处理特殊用例，但又足够通用以适用于不同的机器人系统、任务和环境。本文提出了一个全面的以记忆为中心、基于启示的模块化单手和多手抓取和移动操纵框架，适用于仿人机器人等具有高自由度的复杂机器人系统。通过启示表示移动操纵动作，即机器人与其环境的交互可能性，我们统一了任意环境中已知和未知物体的自主操纵过程。我们的框架被集成并嵌入到ARMAR人形机器人家族以记忆为中心的认知架构中。通过这种方式，机器人不仅可以与物理世界互动，还可以使用关于物体的常识，学习和调整操纵策略。我们在真实世界的实验中展示了该框架的适用性，包括在两个不同的人形机器人平台上抓取已知和未知物体、物体放置和半自动双手抓取物体。

摘要: Performing versatile mobile manipulation actions in human-centered environments requires highly sophisticated software frameworks that are flexible enough to handle special use cases, yet general enough to be applicable across different robotic systems, tasks, and environments. This paper presents a comprehensive memory-centered, affordance-based, and modular uni- and multi-manual grasping and mobile manipulation framework, applicable to complex robot systems with a high number of degrees of freedom such as humanoid robots. By representing mobile manipulation actions through affordances, i.e., interaction possibilities of the robot with its environment, we unify the autonomous manipulation process for known and unknown objects in arbitrary environments. Our framework is integrated and embedded into the memory-centric cognitive architecture of the ARMAR humanoid robot family. This way, robots can not only interact with the physical world but also use common knowledge about objects, and learn and adapt manipulation strategies. We demonstrate the applicability of the framework in real-world experiments, including grasping known and unknown objects, object placing, and semi-autonomous bimanual grasping of objects on two different humanoid robot platforms.

== Object Detection ==

标题: We're Not Using Videos Effectively: An Updated Domain Adaptive Video Segmentation Baseline

作者: Simar Kareer, Vivek Vijaykumar, Harsh Maheshwari

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00868v1

GitHub: https://github.com/SimarKareer/UnifiedVideoDA|

中文摘要: 在语义分割（DAS）的无监督域适应方面已经有大量的工作，寻求将在图像上训练的模型从标记的源域适应到未标记的目标域。虽然绝大多数先前的工作已经将此作为帧级图像-DAS问题来研究，但是少数视频-DAS工作已经寻求额外利用相邻帧中存在的时间信号。然而，视频DAS作品在历史上研究了一套不同于图像DAS的基准，交叉基准很少。在这项工作中，我们解决了这一差距。令人惊讶的是，我们发现（1）即使在仔细控制数据和模型架构后，最先进的图像DAS方法（HRDA和HRDA+MIC）}在已建立的视频DAS基准上优于视频DAS方法（在Viper $\rightarrow$ CityscapesSeq上+14.5 mIoU，在Synthia $\rightarrow$ CityscapesSeq上+19.0 mIoU），以及（2）图像DAS和视频DAS技术的简单组合仅导致跨数据集的边际改进。为了避免图像DAS和视频DAS之间的孤立进展，我们开源了我们的代码库，在一个公共基准上支持一套全面的视频DAS和图像DAS方法。代码可在https://github.com/sii获得。Markareer/UnifiedVideoDA

摘要: There has been abundant work in unsupervised domain adaptation for semantic segmentation (DAS) seeking to adapt a model trained on images from a labeled source domain to an unlabeled target domain. While the vast majority of prior work has studied this as a frame-level Image-DAS problem, a few Video-DAS works have sought to additionally leverage the temporal signal present in adjacent frames. However, Video-DAS works have historically studied a distinct set of benchmarks from Image-DAS, with minimal cross-benchmarking. In this work, we address this gap. Surprisingly, we find that (1) even after carefully controlling for data and model architecture, state-of-the-art Image-DAS methods (HRDA and HRDA+MIC)} outperform Video-DAS methods on established Video-DAS benchmarks (+14.5 mIoU on Viper $\rightarrow$ CityscapesSeq, +19.0 mIoU on Synthia $\rightarrow$ CityscapesSeq), and (2) naive combinations of Image-DAS and Video-DAS techniques only lead to marginal improvements across datasets. To avoid siloed progress between Image-DAS and Video-DAS, we open-source our codebase with support for a comprehensive set of Video-DAS and Image-DAS methods on a common benchmark. Code available at https://github.com/SimarKareer/UnifiedVideoDA

标题: MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

作者: Guanxiong Sun, Yang Hua, Guosheng Hu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2401.09923v2

GitHub: https://github.com/guanxiongsun/vfe.pytorch|https://github.com/guanxiongsun/vfe.pytorch|

中文摘要: 最新的视频对象检测方法保持一种记忆结构，或者是滑动窗口，或者是记忆队列，以使用注意机制来增强当前帧。然而，我们认为这些记忆结构是不有效或不够的，因为有两个隐含的操作：（1）concatenat将内存中的所有特征进行增强，导致沉重的计算成本；（2）帧式存储器更新，防止存储器捕获更多的时间信息。在本文中，我们提出了一种通过内存库的多级聚合架构，称为MAMBA。具体来说，我们的记忆库采用了两种新的操作来消除现有方法的缺点：（1）轻量级的密钥集构造，可以显著降低计算成本；（2）细粒度的特征更新策略，使得我们的方法能够利用来自整个视频的知识。为了更好地增强互补级别的特征，即特征地图和建议，我们进一步提出了广义增强操作（GEO），以统一的方式聚合多级特征。我们对具有挑战性的ImageNetVID数据集进行了广泛的评估。与现有的最先进的方法相比，我们的方法在速度和准确性方面都取得了优异的性能。更值得注意的是，MAMBA使用ResNet-101以12.6/9.1 FPS的速度实现了83.7/84.6%的mAP。代码可在https://github.com/guanxiongsun/vfe.pytorch。

摘要: State-of-the-art video object detection methods maintain a memory structure, either a sliding window or a memory queue, to enhance the current frame using attention mechanisms. However, we argue that these memory structures are not efficient or sufficient because of two implied operations: (1) concatenating all features in memory for enhancement, leading to a heavy computational cost; (2) frame-wise memory updating, preventing the memory from capturing more temporal information. In this paper, we propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods: (1) light-weight key-set construction which can significantly reduce the computational cost; (2) fine-grained feature-wise updating strategy which enables our method to utilize knowledge from the whole video. To better enhance features from complementary levels, i.e., feature maps and proposals, we further propose a generalized enhancement operation (GEO) to aggregate multi-level features in a unified manner. We conduct extensive evaluations on the challenging ImageNetVID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS with ResNet-101. Code is available at https://github.com/guanxiongsun/vfe.pytorch.

标题: Vehicle Perception from Satellite

作者: Bin Zhao, Pengfei Han, Xuelong Li

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00703v1

GitHub: https://github.com/Chenxi1510/Vehicle-Perception-from-Satellite-Videos|

中文摘要: 卫星能够捕捉高分辨率视频。它使得从卫星感知车辆成为可能。与街道监控、行车记录仪或其他设备相比，卫星视频提供了更广阔的城市范围，因此可以捕捉和显示全球交通动态场景。卫星交通监测是一项具有巨大应用潜力的新课题，包括交通拥堵预测、路径规划、车辆调度、emph等。}。实际上，受分辨率和视野的限制，捕捉到的车辆非常小（几个像素），移动缓慢。更糟糕的是，这些卫星在低地球轨道（LEO）捕捉如此高分辨率的视频，所以背景也在移动。在这种情况下，从卫星视角进行交通监控是一项极具挑战性的任务。为了吸引更多的研究人员进入这一领域，我们建立了一个大规模的卫星交通监控基准。它支持多种任务，包括微小物体检测、计数和密度估计。该数据集是基于从GTA-V记录的12个卫星视频和14个合成视频构建的。它们被分成408个视频剪辑，其中包含7,336幅真实卫星图像和1,960幅合成图像。总共标注了128,801辆车，每幅图像中的车辆数量从0到101不等。在数据集上评估了传统计算机视觉中的几种经典和最先进的方法，以便比较不同方法的性能，分析这项任务中的挑战，并讨论未来的前景。该数据集可从以下网址获得：https：//github.com/Chenxi1510/Vehicle-Perception-from-Satellite-Videos。

摘要: Satellites are capable of capturing high-resolution videos. It makes vehicle perception from satellite become possible. Compared to street surveillance, drive recorder or other equipments, satellite videos provide a much broader city-scale view, so that the global dynamic scene of the traffic are captured and displayed. Traffic monitoring from satellite is a new task with great potential applications, including traffic jams prediction, path planning, vehicle dispatching, \emph{etc.}. Practically, limited by the resolution and view, the captured vehicles are very tiny (a few pixels) and move slowly. Worse still, these satellites are in Low Earth Orbit (LEO) to capture such high-resolution videos, so the background is also moving. Under this circumstance, traffic monitoring from the satellite view is an extremely challenging task. To attract more researchers into this field, we build a large-scale benchmark for traffic monitoring from satellite. It supports several tasks, including tiny object detection, counting and density estimation. The dataset is constructed based on 12 satellite videos and 14 synthetic videos recorded from GTA-V. They are separated into 408 video clips, which contain 7,336 real satellite images and 1,960 synthetic images. 128,801 vehicles are annotated totally, and the number of vehicles in each image varies from 0 to 101. Several classic and state-of-the-art approaches in traditional computer vision are evaluated on the datasets, so as to compare the performance of different approaches, analyze the challenges in this task, and discuss the future prospects. The dataset is available at: https://github.com/Chenxi1510/Vehicle-Perception-from-Satellite-Videos.

标题: Understanding the Role of the Projector in Knowledge Distillation

作者: Roy Miles, Krystian Mikolajczyk

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2303.11098v5

GitHub: https://github.com/roymiles/Simple-Recipe-Distillation|

中文摘要: 在本文中，我们重新审视了知识提炼作为函数匹配和度量学习问题的功效。在此过程中，我们验证了三个重要的设计决策，即归一化、软最大函数和投影层作为关键成分。我们从理论上表明，投影仪隐式编码过去例子的信息，为学生实现关系梯度。然后，我们表明，表征的标准化与该投影仪的训练动态紧密相关，这对学生的表现有很大影响。最后，我们证明了一个简单的软最大值函数可以用来解决任何重大的容量差距问题。在各种基准数据集上的实验结果表明，使用这些见解可以带来优于或相当于最先进的知识提取技术的性能，尽管计算效率更高。特别是，我们在图像分类（CIFAR100和ImageNet）、对象检测（COCO2017）和更困难的蒸馏目标（如训练数据高效转换器）上获得了这些结果，由此我们在ImageNet上使用DeiT-Ti获得了77.2%的top-1准确率。代码和模型是公开可用的。

摘要: In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available.

标题: Lightweight Pixel Difference Networks for Efficient Visual Representation Learning

作者: Zhuo Su, Jiehua Zhang, Longguang Wang

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00422v1

GitHub: https://github.com/hellozhuo/pidinet|https://github.com/hellozhuo/pidinet|

中文摘要: 最近，在开发具有令人满意的精度的轻量级深度神经网络（DNNs）方面已经进行了巨大的努力，这可以使DNNs在边缘设备中的无处不在的部署成为可能。开发紧凑高效的DNNs的核心挑战在于如何平衡实现高精度和高效率的竞争目标。在本文中，我们提出了两种新类型的卷积，称为\emph{像素差分卷积（PDC）和二进制PDC（Bi-PDC）}，它们具有以下优点：捕捉高阶局部微分信息，计算效率高，并且能够与现有的DNN集成。通过PDC和Bi-PDC，我们进一步提出了两个轻量级深度网络，分别命名为\emph{像素差异网络（PiDiNet）}和\emph{二进制PiDiNet（Bi-PiDiNet）}，以学习高效但更准确的视觉任务表示，包括边缘检测和对象识别。在流行数据集（BSDS500、ImageNet、LFW、YTF、\emph{等）上进行大量实验表明PiDiNet和Bi-PiDiNet实现了最佳的精度——效率权衡。对于边缘检测，PiDiNet是第一个可以在没有ImageNet的情况下进行训练的网络，可以在BSDS500上以100 FPS的速度实现人类级别的性能，参数<100万美元。对于对象识别，在现有的二进制DNNs中，Bi-PiDiNet实现了最好的准确性，并且在ResNet18上计算成本降低了近2美元。代码可从\href{https：//github.com/hellozhuo/pidinet}{https：//github.com/hellozhuo/pidinet}获得。

摘要: Recently, there have been tremendous efforts in developing lightweight Deep Neural Networks (DNNs) with satisfactory accuracy, which can enable the ubiquitous deployment of DNNs in edge devices. The core challenge of developing compact and efficient DNNs lies in how to balance the competing goals of achieving high accuracy and high efficiency. In this paper we propose two novel types of convolutions, dubbed \emph{Pixel Difference Convolution (PDC) and Binary PDC (Bi-PDC)} which enjoy the following benefits: capturing higher-order local differential information, computationally efficient, and able to be integrated with existing DNNs. With PDC and Bi-PDC, we further present two lightweight deep networks named \emph{Pixel Difference Networks (PiDiNet)} and \emph{Binary PiDiNet (Bi-PiDiNet)} respectively to learn highly efficient yet more accurate representations for visual tasks including edge detection and object recognition. Extensive experiments on popular datasets (BSDS500, ImageNet, LFW, YTF, \emph{etc.}) show that PiDiNet and Bi-PiDiNet achieve the best accuracy-efficiency trade-off. For edge detection, PiDiNet is the first network that can be trained without ImageNet, and can achieve the human-level performance on BSDS500 at 100 FPS and with $<$1M parameters. For object recognition, among existing Binary DNNs, Bi-PiDiNet achieves the best accuracy and a nearly $2\times$ reduction of computational cost on ResNet18. Code available at \href{https://github.com/hellozhuo/pidinet}{https://github.com/hellozhuo/pidinet}.

标题: Towards Open Vocabulary Learning: A Survey

作者: Jianzong Wu, Xiangtai Li, Shilin Xu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2306.15880v4

GitHub: https://github.com/jianzongwu/Awesome-Open-Vocabulary|https://github.com/jianzongwu/Awesome-Open-Vocabulary|

中文摘要: 在视觉场景理解领域，深度神经网络在分割、跟踪和检测等各种核心任务中取得了令人印象深刻的进步。然而，大多数方法都是在闭集假设下运行的，这意味着模型只能识别训练集中存在的预定义类别。最近，由于视觉语言预训练的快速发展，开放词汇设置被提出。这些新方法寻求定位和识别标注空间以外的类别。与弱监督和零镜头设置相比，开放词汇方法更通用、实用和有效。本文对开放词汇学习进行了全面的回顾，总结和分析了该领域的最新进展。特别是，我们首先将其与相关概念进行比较，如零镜头学习、开集识别和分布外检测。然后，我们回顾了在分割和检测的情况下几个密切相关的任务，包括长尾问题，少数镜头和零镜头设置。对于方法调查，我们首先提出闭集检测和分割的基本知识作为初步知识。接下来，我们检查使用开放词汇学习的各种场景，确定常见的设计元素和核心思想。然后，我们在常用的数据集和基准中比较了最近的检测和分割方法。最后，我们总结了关于未来研究方向的见解、问题和讨论。据我们所知，这是第一篇关于开放式词汇学习的全面文献综述。我们一直在https://github.com/jianzongwu/awesome-open-vokacy上追踪相关作品。

摘要: In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on the close-set assumption, meaning that the model can only identify pre-defined categories that are present in the training set. Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training. These new approaches seek to locate and recognize categories beyond the annotated label space. The open vocabulary approach is more general, practical, and effective compared to weakly supervised and zero-shot settings. This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field. In particular, we begin by comparing it to related concepts such as zero-shot learning, open-set recognition, and out-of-distribution detection. Then, we review several closely related tasks in the case of segmentation and detection, including long-tail problems, few-shot, and zero-shot settings. For the method survey, we first present the basic knowledge of detection and segmentation in close-set as the preliminary knowledge. Next, we examine various scenarios in which open vocabulary learning is used, identifying common design elements and core ideas. Then, we compare the recent detection and segmentation approaches in commonly used datasets and benchmarks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To our knowledge, this is the first comprehensive literature review of open vocabulary learning. We keep tracing related works at https://github.com/jianzongwu/Awesome-Open-Vocabulary.