dm_control 翻译: Software and Tasks for Continuous Control

dm_control: Software and Tasks for Continuous Control

dm_control：连续控制软件及任务集

文章目录

dm_control: Software and Tasks for Continuous Control
dm_control：连续控制软件及任务集
- Abstract
- 1 Introduction
- 1 引言
- - 1.1 Software for research
  - 1.1 研究软件
  - 1.2 Tasks
  - 1.2 任务
Part I
Software Infrastructure
软件基础设施
- 2 MuJoCo Python interface
- 2 MuJoCo Python接口
- - 2.1 The Physics class
  - 2.1 物理类
  - 2.2 Interactive Viewer
  - 2.2 交互式查看器
  - 2.3 Wrapper bindings
  - 2.3 包装绑定
- 3 The PyMJCF library
- 3 PyMJCF库
- - 3.1 PyMJCF Tutorial
  - 3.1 PyMJCF 教程
  - 3.2 Debugging
  - 3.2 调试
- 4 Reinforcement learning interface
- 4 强化学习接口
- - 4.1 Reinforcement Learning API
  - 4.1 强化学习API
  - - 4.1.1 The Environment class
    - 4.1.1 环境类
- 5 The Composer task definition library
- 5 Composer任务定义库
- - 5.1 The observable module
  - 5.1 观察模块
  - 5.2 The variation module
  - 5.2 变化模块
  - 5.3 The Composer callback lifecycle
  - 5.3 Composer回调生命周期
- 5.4 Composer tutoria
- 5.4 Composer教程
- Part II
- Tasks
- 任务
- 6 The Control Suite
- 6 控制套件
- - 6.1 Control Suite design conventions
  - 6.1 控制套件设计约定
  - 6.2 Domains and Tasks
  - 6.2 领域和任务
  - 6.3 Additional domains
  - 附加领域
- 7 Locomotion tasks
- 运动任务
- - 7.1 Humanoid running along corridor with obstacles
  - 7.1 在带有障碍物的走廊中奔跑的人形机器人
  - 7.2 Maze navigation and foraging
  - 7.2 迷宫导航和觅食
  - 7.3 Multi-Agent soccer
  - 7.3 多智能体足球
- 8 Manipulation tasks
- 操纵任务
- - 8.1 Studded brick model
  - 8.1 凸起砖模型
  - 8.2 Task descriptions
  - 8.2 任务描述

Abstract

The dm_control software package is a collection of Python libraries and task suites for reinforcement learning agents in an articulated-body simulation. A MuJoCo wrapper provides convenient bindings to functions and data structures. The PyMJCF and Composer libraries enable procedural model manipulation and task authoring. The Control Suite is a fixed set of tasks with standardised structure, intended to serve as performance benchmarks. The Locomotion framework provides high-level abstractions and examples of locomotion tasks. A set of configurable manipulation tasks with a robot arm and snap-together bricks is also included.

dm_control软件包是一系列Python库和任务套件的集合，旨在为关节体仿真中的强化学习智能体提供支持。MuJoCo封装器方便地绑定了函数和数据结构，以便于使用。PyMJCF和Composer库支持程序化的模型操作和任务创作。Control Suite是一组具有固定和标准化结构的任务，用于作为性能测试的基准。Locomotion框架提供了高级抽象概念和运动任务示例。此外，还包括一组可配置的操纵任务，涉及机器人手臂和可拼接的积木。
在这里插入图片描述

Figure 1: The Control Suite benchmarking domains, described in Section 6 (see video).

图 1：在第6节中描述的Control Suite基准测试领域（可观看视频了解更多）。

在这里插入图片描述
Figure 2: Procedural domains built with the PyMJCF (Sec. 3) and Composer (Sec. 5) task-authoring libraries. Left: Multi-legged creatures from the tutorial in Sec. 3.1. Middle: The “run through corridor” example task from Sec. 7.1. Right: The “stack 3 bricks” example task from Sec. 8.

图 2：使用PyMJCF（第3节介绍）和Composer（第5节介绍）任务创作库生成的程序化领域示例。左图展示了第3.1节教程中的多足生物。中图是第7.1节中的“穿越走廊”任务示例。右图则是第8节中的“堆叠3块积木”任务示例。

1 Introduction

1 引言

Controlling the physical world is an integral part and arguably a prerequisite of general intelligence (Wolpert et al., 2003). Indeed, the only known example of general-purpose intelligence emerged in primates whose behavioural niche was already contingent on two-handed manipulation for millions of years.

控制物理世界不仅是通用智能的核心组成部分，而且很可能是实现通用智能的基础（Wolpert等人，2003年）。事实上，目前已知的通用智能唯一实例出现在灵长类动物中，这些动物的行为生态位依赖于双手操作，这种依赖关系已延续数百万年。

Unlike board games, language and other symbolic domains, physical tasks are fundamentally continuous in state, time and action. Physical dynamics are subject to second-order equations of motion – the underlying state is composed of positions and velocities. Sensory signals (i.e. observations) carry meaningful physical units and vary over corresponding timescales. These properties, along with their prevalence and importance, make control problems a unique subset of general Markov Decision Processes. The most familiar physical control tasks have a fixed subset of degrees of freedom (the body) that are directly actuated, while the rest are unactuated (the environment). Such embodied tasks are the focus of dm_control.

与棋盘游戏、语言和其他符号领域不同，物理任务在状态、时间和行动上具有本质上的连续性。物理动态遵循二阶运动方程——其基础状态由位置和速度构成。感觉信号（即观察数据）携带有意义的物理单位，并且会随着相应的时间尺度变化。这些特性及其普遍性和重要性，使得控制问题成为一般马尔可夫决策过程的一个独特子集。在熟悉的物理控制任务中，有一部分自由度（如身体部分）可以直接驱动，而其他自由度（如环境）则不可驱动。dm_control的关注点正是这类具身任务。

1.1 Software for research

1.1 研究软件

The dm_control package was designed by DeepMind scientists and engineers to facilitate their own continuous control and robotics needs, and is therefore wellsuited for research. It is written in Python, exploiting the agile workflow of a dynamic language, while relying on the C-based MuJoCo physics library, a fast and accurate simulator (Erez et al., 2015), itself designed to facilitate research (Todorov et al., 2012). It is composed of the following modules:
dm_control软件包由DeepMind的科学家和工程师设计，旨在满足他们自己的连续控制和机器人技术需求，因此非常适合用于研究目的。它采用Python编写，利用了动态语言的灵活工作流程，同时依赖于基于C的MuJoCo物理库，这是一个快速且准确的模拟器（Erez等人，2015年），本身就是为了促进研究而设计的（Todorov等人，2012年）。该软件包由以下模块组成：

The ctypes-based MuJoCo wrapper (Sec. 2) provides full access to the simulator, conveniently exposing quantities with named indexing. A Python-based interactive visualiser (Sec. 2.2) allows the user to examine and perturb scene elements with a mouse

基于ctypes的MuJoCo封装器（第2节）提供了对模拟器的完全访问权限，方便地公开了带有命名索引的物理量。基于Python的交互式可视化工具（第2.2节）允许用户使用鼠标检查和干扰场景元素。

The PyMJCF library (Sec. 3) can procedurally assemble model elements and allows the user to configure or randomise parameters and initial states.

PyMJCF库（第3节）能够程序化地组装模型元素，并允许用户配置或随机化参数和初始状态。

An environment API that exposes actions, observations, rewards and terminations in a consistent yet flexible manner (Sec. 4).

一个环境API，以一种一致且灵活的方式公开动作、观察、奖励和终止条件（第4节）。

Finally, we combine the above functionality in the high-level task-definition framework Composer (Sec. 5). Amongst other things it provides a Model Variation module (Sec. 5.2) for policy robustification, and an Observable module for delayed, corrupted, and stacked sensor data (Sec. 5.1).

最后，我们在高级任务定义框架Composer（第5节）中整合了上述功能。Composer提供了许多功能，包括用于策略健壮化的模型变化模块（第5.2节），以及用于处理延迟、损坏和堆叠的传感器数据的可观察模块（第5.1节）。

dm_control has been used extensively in DeepMind, serving as a fundamental component of continuous control research. See youtu.be/CMjoiU482Jk for a montage of clips from selected publications.

dm_control已在DeepMind中得到广泛使用，成为连续控制研究的基础组成部分。您可以在youtu.be/CMjoiU482Jk查看选定期刊剪辑的蒙太奇视频，以了解其在实际研究中的应用示例。

1.2 Tasks

1.2 任务

Recent years have seen rapid progress in the application of Reinforcement Learning (RL) to difficult problem domains such as video games (Mnih, 2015). The Arcade Learning Environment (ALE, Bellemare et al. 2012) was and continues to be a vital facilitator of these developments, providing a set of standard benchmarks for evaluating and comparing learning algorithms. Similarly, it could be argued that control and robotics require well-designed task suites as a standardised playing field, where different approaches can compete and new ones can emerge.
近年来，强化学习（RL）在视频游戏等复杂问题领域的应用取得了飞速发展（Mnih，2015年）。街机学习环境（ALE，Bellemare等人，2012年）一直是，并且仍然是这些进展的重要推动力，它提供了一套标准化的基准，用于评估和比较各种学习算法。同样地，可以认为控制和机器人技术领域也需要一套精心设计的任务套件，作为标准化的竞技场，不同的方法可以在这里竞争，新的方法也可以涌现。

The OpenAI Gym (Brockman et al., 2016) includes a set of continuous control domains that have become a popular benchmark in continuous RL (Duan et al., 2016; Henderson et al., 2017). More recent task suites such as Meta-world (Yu et al., 2019), SURREAL (Fan et al., 2018), RLbench (James et al., 2019) and IKEA (Lee et al., 2019), have been published in an attempt to satisfy the demand for tasks suites that facilitate the study of algorithms related to multi-scale control, multitask transfer, and meta learning. dm_control includes its own sets of control tasks, in three categories:
OpenAI Gym（Brockman等人，2016年）包含一组连续控制域，这些域已成为连续RL（Duan等人，2016年；Henderson等人，2017年）中广泛采用的基准。更近期的任务套件，如Meta-world（Yu等人，2019年）、SURREAL（Fan等人，2018年）、RLbench（James等人，2019年）和IKEA（Lee等人，2019年），已经被发布，以满足与多尺度控制、多任务迁移和元学习相关的算法研究的任务套件需求。dm_control包括自己的控制任务集，分为三个类别：

Control Suite
控制套件

The DeepMind Control Suite (Section 6), first introduced in (Tassa et al., 2018), built directly with the MuJoCo wrapper, provides a set of standard benchmarks for continuous control problems. The unified reward structure offers interpretable learning curves and aggregated suite-wide performance measures. Furthermore, we emphasise high-quality, well-documented code using uniform design patterns, offering a readable, transparent and easily extensible codebase.
DeepMind Control Suite（第6节），最初在（Tassa等人，2018年）中提出，是直接利用MuJoCo封装器构建的，为连续控制问题提供了一套标准化基准。统一的奖励结构使得学习曲线具有可解释性，并且能够提供整个套件的聚合性能评估。此外，我们重视使用统一设计模式的优质、文档齐全的代码，以确保代码库的可读性、透明性和易于扩展性。

Locomotion
运动学

The Locomotion framework (Section 7) was inspired by our work in Heess et al. (2017). It is designed to facilitate the implementation of a wide range of locomotion tasks for RL algorithms by introducing self-contained, reusable components which compose into different task variants. The Locomotion framework has enabled a number of research efforts including Merel et al. (2017), Merel et al. (2019b), Merel et al. (2019a) and more recently has been employed to support Multi-Agent domains in Liu et al. (2019), Sunehag et al. (2019) and Banarse et al. (2019).
Locomotion框架（第7节）是在Heess等人（2017年）的研究启发下开发的。它旨在通过引入独立、可重用的组件来简化各种运动任务在强化学习算法中的实现，这些组件可以组合成不同的任务版本。Locomotion框架已经促进了许多研究工作，包括Merel等人（2017年）、Merel等人（2019b）、Merel等人（2019a），并且最近还被用于支持Liu等人（2019年）、Sunehag等人（2019年）和Banarse等人（2019年）研究中的多智能体领域。

Manipulation
操纵

We also provide examples of constructing robotic manipulation tasks (Sec. 8). These tasks involve grabbing and manipulating objects with a 3D robotic arm. The set of tasks includes examples of reaching, placing, stacking, throwing, assembly and disassembly. The tasks are designed to be solved using a simulated 6 degree-of-freedom robotic arm based on the Kinova Jaco (Campeau-Lecours et al., 2017), though their modular design permit the use of other arms with minimal changes. These tasks make use of reusable components such as bricks that snap together, and provide examples of reward functions for manipulation. Tasks can be run using vision, low-level features, or combinations of both.
我们还提供了构建机器人操纵任务（第8节）的示例。这些任务涉及到使用3D机器手臂抓取和操纵物体。任务集合包括伸手、放置、堆叠、投掷、组装和拆卸等多种动作的例子。这些任务被设计为通过基于Kinova Jaco的6自由度模拟机器手臂来解决，尽管它们的模块化设计允许使用其他类型的手臂，只需进行最小的修改。这些任务使用了可重用的组件，如可拼接的积木，并提供了操纵任务的奖励函数示例。任务可以通过使用视觉、低级特征或二者的结合来执行。

Part I

Software Infrastructure

软件基础设施

Sections 2, 3, 4 and 5 include code snippets showing how to use dm_control software. These snippets are collated in a Google Colab notebook:
第2节、第3节、第4节和第5节包括了展示如何使用dm_control软件的代码片段。这些片段被整理在一个Google Colab笔记本中：github.com/deepmind/dm_control/tutorial.ipynb

2 MuJoCo Python interface

2 MuJoCo Python接口

The mujoco module provides a general-purpose wrapper of the MuJoCo engine, using Python’s ctypes library to auto-generate bindings to MuJoCo structs, enums and API functions. We provide a brief introductory overview which assumes familiarity with Python; see in-code documentation for more detail.

mujoco模块为MuJoCo引擎提供了一个通用的包装器，利用Python的ctypes库自动生成与MuJoCo的结构体、枚举和API函数的绑定。我们提供了一个简短的入门概述，假设您已经熟悉Python；更多详细信息请参阅代码文档。

MuJoCo physics

MuJoCo物理

MuJoCo (Todorov et al., 2012) is a fast, reduced-coordinate, continuous-time physics engine. It compares favourably to other popular simulators (Erez et al., 2015), especially for articulated, low-to-medium degree-of-freedom regimes (/ 100) in the presence of contacts. The MJCF model definition format and reconfigurable computation pipeline have made MuJoCo a popular choice for robotics and reinforcement learning research (e.g. Schulman et al. 2015; Heess et al. 2015; Lillicrap et al. 2015; Duan et al. 2016).

MuJoCo（Todorov等人，2012年）是一个快速、降维、连续时间的物理引擎。与其他流行的模拟器（Erez等人，2015年）相比，尤其是在存在接触的情况下，对于关节体、低到中等自由度的系统（/ 100），MuJoCo表现更为出色。MJCF模型定义格式和可重构计算流水线使得MuJoCo成为机器人学和强化学习研究中的一个受欢迎的选择。

2.1 The Physics class

2.1 物理类

The Physics class encapsulates MuJoCo’s most commonly used functionality.
物理类封装了MuJoCo最常用的功能。

Loading an MJCF model
加载MJCF模型

The Physics.from_xml_string() constructor loads an MJCF model and returns a Physics instance:

Physics.from_xml_string()构造函数用于加载一个MJCF模型，并返回一个物理实例：

MJCF（MuJoCo Physics with Contact Format）表示 MuJoCo XML 文件的格式。描述了仿真场景、物体、关节等的参数，以便在 MuJoCo 引擎中进行仿真。

import PIL.Image
from dm_control import mujoco

# 定义了一个包含Mujoco XML代码的多行字符串，描述了一个简单的物理仿真环境。
# 这个环境包括一个具有旋转关节、红色盒子和绿色球的物体。
simple_MJCF = """
<mujoco>
	<worldbody> 
	<light name="top" pos="0 0 1"/>
	<body name="box_and_sphere" euler="0 0 -30">
		<joint name="swing" type="hinge" axis="1 -1 0" pos="-.2 -.2 -.2"/>
		<geom name="red_box" type="box" size=".2 .2 .2" rgba="1 0 0 1"/>
		<geom name="green_sphere" pos=".2 .2 .2" size=".1" rgba="0 1 0 1"/>
	</body>
	</worldbody>
</mujoco>
"""
# 将XML字符串解析为Mujoco Physics对象，这个对象表示了仿真环境的物理状态。
physics = mujoco.Physics.from_xml_string(simple_MJCF)

XML 文件解释：

<mujoco>: XML的根元素，表示Mujoco仿真模型的开始。
	<worldbody>: 描述了模型的物理世界，包含光源和物体的定义。
	<light name="top" pos="0 0 1"/>: 定义光源，pos三维空间中的坐标。
	<body name="box_and_sphere" euler="0 0 -30">: 定义一个刚体，euler欧拉角。
		<joint name="swing" type="hinge" axis="1 -1 0" pos="-.2 -.2 -.2"/>: 定义了旋转关节，axis轴向，pos位置。
		<geom name="red_box" type="box" size=".2 .2 .2" rgba="1 0 0 1"/>: 定义了几何形状，类型为盒子，size尺寸，颜色为红色。
		<geom name="green_sphere" pos=".2 .2 .2" size=".1" rgba="0 1 0 1"/>: 定义了几何形状，类型为球，pos位置，size尺寸，颜色为绿色。
	</body>: 结束对刚体的定义。
	</worldbody>: 结束对物理世界的定义。
</mujoco>: 结束Mujoco仿真模型的定义。

Rendering
渲染

The Physics.render() method outputs a NumPy array of pixel values.
物理类的Physics.render()方法生成一个包含像素值的NumPy数组。

# 生成当前仿真环境的图像，渲染的图像以像素数组的形式存储在变量pixels中。
pixels = physics.render()
# Display the rendered image in the jupyter notebook
PIL.Image.fromarray(pixels)

在这里插入图片描述

Optional arguments to render() specify the resolution, camera ID, whether to render RGB, depth or segmentation images, and other visualisation options (e.g. the joint visualisation on the left). dm_control on Linux supports both OSMesa software rendering and hardware-accelerated rendering using either EGL or GLFW. The rendering backend can be selected by setting the MUJOCO_GL environment variable to glfw, egl, or osmesa, respectively.

可选的render()方法参数包括分辨率、相机ID、是否渲染RGB、深度或分割图像，以及其他可视化选项（例如左侧的关节可视化）。dm_control在Linux上支持OSMesa软件渲染和硬件加速渲染，分别使用EGL或GLFW。通过设置MUJOCO_GL环境变量为glfw、egl或osmesa，可以选择渲染后端。

Physics.model and Physics.data

MuJoCo’s underlying mjModel and mjData data structures, describing static and dynamic simulation parameters respectively, can be accessed via the model and data properties of Physics. They contain NumPy arrays that have direct, writeable views onto MuJoCo’s internal memory. Because the memory is owned by MuJoCo, an attempt to overwrite an entire array throws an error:
物理类的model和data属性分别提供了对MuJoCo底层mjModel和mjData数据结构的访问，这些数据结构分别描述了静态和动态模拟参数。它们包含直接写入MuJoCo内部内存的NumPy数组。由于内存由MuJoCo所有，尝试覆盖整个数组将引发错误：

import numpy as np
# This fails with ‘AttributeError: can’t set attribute‘: 尝试分配一个新的数组
physics.data.qpos = np.random.randn(physics.model.nq)
# This succeeds: 这是因为切片操作实际上是在现有的数组上进行修改，而不是尝试分配一个新的数组。
physics.data.qpos[:] = np.random.randn(physics.model.nq)

mj_step() and Physics.step()

MuJoCo’s top-level mj_step() function computes the next state — the joint-space configuration qpos and velocity qvel — in two stages. Quantities that depend only on the state are computed in the first stage, mj_step1(), and those that also depend on the control (including forces) are computed in the subsequent mj_step2(). Physics.step() calls these sub-functions in reverse order, as follows. Assuming that mj_step1() has already been called, it first completes the computation of the new state with mj_step2(), and then calls mj_step1(), updating the quantities that depend on the state alone. In particular, this means that after a Physics.step(), rendered pixels will correspond to the current state, rather than the previous one. Quantities that depend on forces, like accelerometer and touch sensor readings, are still with respect to the last transition

MuJoCo的顶层mj_step()函数在两个阶段计算下一个状态——关节空间配置qpos和速度qvel。在第一阶段，即mj_step1()中，仅依赖于状态的量被计算，而在随后的mj_step2()中，那些同时依赖于控制（包括力）的量被计算。Physics.step()以相反的顺序调用这些子函数。假设mj_step1()已经被调用，它首先使用mj_step2()完成新状态的计算，然后调用mj_step1()，更新仅依赖于状态的量。这意味着在执行Physics.step()之后，渲染的像素将对应于当前状态，而不是上一个状态。依赖于力的量，如加速度计和触摸传感器的读数，仍然与上一次转换有关。

（由于没有mujoco基础，以下问题答案结合 chatgpt 和个人理解，可能会存在错误）

问题1：mj_step()和Physics.step()区别?

在MuJoCo中，mj_step() 是一个MuJoCo库中的函数，用C语言写的。Physics.step() 是 MuJoCo Python 接口中的一个函数。

问题2：依赖于状态的量和依赖于控制的量区别?

依赖于状态的量：这些量是与系统的当前状态直接相关的，而与外部输入或控制信号无关。包括关节的位置 qpos 和速度 qvel，质心的位置和速度等，这些变量可以通过对系统当前状态的直接查询或计算得到。
依赖于控制的量：这些量是与外部输入或控制信号直接相关的，其计算可能受到控制输入的影响。在机器人控制中，例如，关节上的施加的力或扭矩就是依赖于控制的量。其他例子可能包括外部的电机命令、力传感器的读数等。

问题3：mj_step1()和mj_step2()为什么要分开计算?

数值稳定性：在物理模拟中，数值稳定性是一个关键问题。将系统的内部动力学和外部控制的影响分离开来，可以降低数值不稳定性的风险。这种分离使得在mj_step1()阶段可以专注于解决只依赖于当前状态的动力学方程，而不受到控制输入的影响。

模拟的精度：分离内部动力学和外部控制可以提高模拟的精度。在mj_step1()阶段，模拟器可以采用更精确的数值方法来解决只依赖于当前状态的动力学方程，而不会受到外部控制的影响。这有助于确保在没有外部干扰的情况下，系统的内部动力学得到准确模拟。

可扩展性：将内部动力学和外部控制分阶段计算使得物理引擎更加灵活和可扩展。不同的模型可能有不同的内部动力学，但可以共享相同的控制模块。这种设计使得模拟器更易于适应不同的物理系统和控制策略。

虽然两者可以结合起来计算，但分离它们有助于提高模拟的稳定性和精度，并提供更好的可扩展性。

问题4：Physics.step()的分开计算，举个例子

假设有一个物体在MuJoCo中运动，你想要测量它在某一时刻的加速度。我们考虑一个加速度计（accelerometer）来测量加速度。加速度计是基于控制的量。

先执行依赖于上一次状态过渡的控制相关量： Physics.step()会先执行mj_step2()，如在上一次状态中有一个力F作用于物体，质量为m，这个力将影响物体在当前时刻的状态。而当前加速度计a
= F / m的测量值，就和上一次状态中的力F有关。
然后执行依赖于当前状态的量：在mj_step1()中，计算了仅依赖于当前状态的量，例如物体在加速度a的情况下，计算当前状态的位置和速度。

Physics.step()之后就是渲染，所以渲染的像素就是当前状态计算之后的结果，而不是上一个状态。

Setting the state with reset_context()
使用reset_context()设置状态

For the above assumption above to hold, mj_step1() must always be called after setting the state. We therefore provide the Physics.reset_context(), within which the state should be set:
为了保持上述假设，必须在设置状态之后始终调用mj_step1()。因此，我们提供了Physics.reset_context()，在其中应该设置状态：

with physics.reset_context():
	# mj_reset() is called upon entering the context: default state.
	# 进入上下文时，模拟环境被重置为初始状态
	
	# 设置新的位置和速度
	physics.data.qpos[:] = ... # Set position.
	physics.data.qvel[:] = ... # Set velocity.
# mj_forward() is called upon exiting the context. Now all derived
# quantities and sensor measurements are up-to-date.
# 在退出上下文时默认调用mj_forward()，计算mj_step1()和mj_step2()，现在所有的派生量和传感器测量都是最新的。
# 其实这步就是初始化，控制量和状态量都初始化。

Note that we call mj_forward() upon exit, which includes mj_step1(), but continues up to the computation of accelerations (but does not increment the state). This is so that force- or acceleration-dependent sensors have sensible values even at the initial state, before any steps have been taken.

注意，我们在退出时调用mj_forward()，它包括mj_step1()，但会一直进行到加速度的计算（但不会增加状态）。这样做是为了确保，在执行任何步骤之前，即使在初始状态，力或加速度依赖的传感器也有合理的值。

Named indexing
命名索引

Everything in a MuJoCo model can be named. It is often more convenient and less error-prone to refer to model elements by name rather than by index. To address this, Physics.named.model and Physics.named.data provide array-like containers that provide convenient named views. Using our simple model above:

通常，通过名称而不是索引来引用MuJoCo模型中的元素更为方便且更少出错。为了实现这一点，Physics.named.model和Physics.named.data提供了类似数组的容器，它们提供方便的命名视图。使用我们上面简单的模型：

# 比较直接访问data属性和使用named属性的差异，后者在检查特定数据时更为方便。
print("The ‘geom_xpos‘ array:")
print(physics.data.geom_xpos)
print("Is much easier to inspect using ‘Physics.named‘:")
print(physics.named.data.geom_xpos)


输出：
The ‘geom_xpos‘ array:
[[0.         0.         0.        ]
 [0.27320508 0.07320508 0.2       ]]
Is much easier to inspect using ‘Physics.named‘:
FieldIndexer(geom_xpos):
                 x         y         z         
0      red_box [ 0         0         0       ]
1 green_sphere [ 0.273     0.0732    0.2     ]

These containers can be indexed by name for both reading and writing, and support most forms of NumPy indexing:
这些容器可以通过名称进行读写索引，并支持NumPy的大部分索引形式：

# 在Mujoco仿真环境中通过修改关节角度来重新设置物体的位置，并输出绿色球的新位置信息。
import numpy as np 
with physics.reset_context():
	# 通过xml文件中关节名称“swing”，设置关节的位置（qpos）为π（弧度），即将关节旋转到π弧度。
	physics.named.data.qpos['swing'] = np.pi
# 获取绿色球的位置信息，'green_sphere' 同样是xml定义的球名称
print(physics.named.data.geom_xpos['green_sphere', ['z']])

输出：[-0.6]

Note that in the example above we use a joint name in order to index into the generalised position array qpos. Indexing into a multi-DoF ball or free joint outputs the appropriate slice. We also provide convenient access to MuJoCo’s mj_id2name and mj_name2id:
注意，在上面的示例中，我们使用关节名称来索引通用位置数组qpos。对一个多自由度球形关节或自由关节进行索引，将输出适当的切片。我们还提供了方便访问MuJoCo的mj_id2name和mj_name2id的方法:

# 通过id访问名称，"geom" 这里是类型，表示是几何体
physics.model.id2name(0, "geom")

输出：'red_box'

2.2 Interactive Viewer

2.2 交互式查看器

The viewer module provides playback and interaction with physical models using mouse input. This type of visual debugging is often critical for cases whenV an agent finds an “exploit” in the physics.
Viewer模块允许用户通过鼠标输入与物理模型进行交互和播放。这种交互式观察对于智能体在物理模拟中发现异常或错误时至关重要。

See the documentation at dm_control/tree/master/dm_control/viewer for a screen capture of the viewer application.

有关查看器应用程序的屏幕截图，请参考 dm_control/tree/master/dm_control/viewer 的文档。

TODO 待学习：https://github.com/google-deepmind/dm_control/tree/main/dm_control/viewer

2.3 Wrapper bindings

2.3 包装绑定

The bindings provide easy access to all MuJoCo library functions and enums, automatically converting NumPy arrays to data pointers where appropriate.
Wrapper 绑定使得访问MuJoCo库的所有函数和枚举变得简单，它能在需要时自动处理NumPy数组与数据指针之间的转换。

# 使用MuJoCo库的底层mjlib模块来将给定的四元数转换为旋转矩阵。
# 四元数和旋转矩阵都是描述三维空间中旋转的数学表达形式，当某个向量和旋转矩阵相乘后，结果是旋转后的向量。
from dm_control.mujoco.wrapper.mjbindings import mjlib
import numpy as np
# 定义一个四元数
quat = np.array((.5, .5, .5, .5))
# 初始化一个全零的9维数组，用于存储旋转矩阵，三维向量旋转矩阵就是3x3，二维向量旋转矩阵是2x2
mat = np.zeros(9)
mjlib.mju_quat2Mat(mat, quat) # 用于将四元数，转换为旋转矩阵，两者是可以相互转换的。
# 打印转换前的四元数
print("MuJoCo converts this quaternion:")
print(quat)
# 打印转换后的旋转矩阵
print("To this rotation matrix:")
print(mat.reshape(3,3))

输出：
MuJoCo converts this quaternion:
[ 0.5 0.5 0.5 0.5]
To this rotation matrix:
[[ 0. 0. 1.]
[ 1. 0. 0.]
[ 0. 1. 0.]]

Enums are exposed as a submodule:
Enums枚举作为子模块：

from dm_control.mujoco.wrapper.mjbindings import enums
print(enums.mjtJoint) # 用于枚举不同类型的关节，包括自由、球、滑块、旋转关节

输出：
mjtJoint(mjJNT_FREE=0, mjJNT_BALL=1, mjJNT_SLIDE=2, mjJNT_HINGE=3)

3 The PyMJCF library

3 PyMJCF库

The PyMJCF library provides a Python object model for MuJoCo’s MJCF modelling language, which can describe complex scenes with articulated bodies. The goal of the library is to allow users to interact with and modify MJCF models programmatically using Python, similarly to what the JavaScript DOM does for HTML

PyMJCF库为MuJoCo的MJCF建模语言提供了一个Python对象模型，该语言可以描述具有关节连接的复杂场景。该库的目标是允许用户以编程方式使用Python，并修改MJCF模型，类似于JavaScript DOM对HTML的操作。

A key feature of the library is the ability to compose multiple MJCF models into a larger one, while automatically maintaining a consistent, collision-free namespace. Additionally, it provides Pythonic access to the underlying C data structures with the bind() method of mjcf.Physics, a subclass of Physics which associates a compiled model with the PyMJCF object tree.
该库的一个关键特性是能够将多个MJCF模型组合成一个更大的模型，同时自动维护一致的、无碰撞的命名空间。此外，它通过mjcf.Physics的bind()方法提供对底层C数据结构的Python风格访问，mjcf.Physics是Physics的子类，将编译的模型与PyMJCF对象树关联起来。

3.1 PyMJCF Tutorial

3.1 PyMJCF 教程

The following code snippets constitute a tutorial example of a typical use case.
以下代码片段构成了一个典型用例的教程示例。

from dm_control import mjcf
# 用于创建具有两自由度的腿的类，其中包含位置执行器的定义。
class Leg(object):
	""" A 2-DoF leg with position actuators."""
	def __init__(self, length, rgba):
		self.model = mjcf.RootElement()
		# Defaults:
		self.model.default.joint.damping = 2 # 设置关节的阻尼为2。
		self.model.default.joint.type = 'hinge' # 设置关节的类型为“hinge”（旋转关节）
		self.model.default.geom.type = 'capsule' # 设置几何体的类型为“capsule”（胶囊体）
		self.model.default.geom.rgba = rgba # Continued below... # 设置几何体的颜色。
		# Thigh : 大腿部分
		self.thigh = self.model.worldbody.add('body') # 大腿
		self.hip = self.thigh.add('joint', axis=[0, 0, 1]) # 髋关节，指定旋转轴为z轴
		self.thigh.add('geom', fromto=[0, 0, 0, length, 0, 0], size=[length/4])  # # 大腿几何体定义，使用了胶囊体
		# Shin: 小腿部分
		self.shin = self.thigh.add('body', pos=[length, 0, 0]) # 小腿，指定位置
		self.knee = self.shin.add('joint', axis=[0, 1, 0]) # 膝盖关节，指定旋转轴为y轴。
		self.shin.add('geom', fromto=[0, 0, 0, 0, 0, -length], size=[length/5]) # 小腿几何体定义，使用了胶囊体
		# Position actuators: 位置执行器是一种控制机制，它通过调整关节的位置来实现控制。
		self.model.actuator.add('position', joint=self.hip, kp=10) # 髋关节位置执行器，比例增益为10
		self.model.actuator.add('position', joint=self.knee, kp=10) # 膝盖关节位置执行器，比例增益为10

The Leg class describes an abstract articulated leg, with two joints and corresponding proportional-derivative actuators. Note the following.
Leg 类描述了一个抽象的关节连接的腿，具有两个关节和相应的比例-微分执行器。请注意以下几点。

MJCF attributes correspond directly to arguments of the add() method1 .
When referencing elements, e.g. when specifying the joint to which an actuator is attached in the last two lines above, the MJCF element itself can be used, rather than its name (though a name string is also supported).

MJCF 属性直接对应于 add() 方法的参数。
在引用元素时，例如在上述代码的最后两行中指定执行器连接到哪个关节时，可以使用 MJCF 元素本身，而不仅仅是其名称（尽管名称字符串也受支持）。

# 定义了身体的半径和大小。
BODY_RADIUS = 0.1
BODY_SIZE = (BODY_RADIUS, BODY_RADIUS, BODY_RADIUS / 2)
def make_creature(num_legs):
	"""Constructs a creature with ‘num_legs‘ legs."""
	rgba = np.random.uniform([0, 0, 0, 1], [1, 1, 1, 1]) # 随机生成颜色
	model = mjcf.RootElement() #  创建MuJoCo XML模型的根元素
	model.compiler.angle = 'radian' # Use radians. 设置编译器使用弧度制。
	# Make the torso geom. 在模型的世界部分添加一个椭球形状的身体几何体，表示生物模型的主体部分。
	torso = model.worldbody.add('geom', name='torso', type='ellipsoid', size=BODY_SIZE, rgba=rgba)
	# Attach legs to equidistant sites on the circumference.  
	# 循环添加腿部分
	for i in range(num_legs): 
		theta = 2 * i * np.pi / num_legs # 计算腿的角度，确保它们均匀分布在环上。
		hip_pos = BODY_RADIUS * np.array([np.cos(theta), np.sin(theta), 0]) # 计算每条腿的髋关节位置。
		# 在模型的世界部分添加一个站点，表示髋关节的位置，并设置站点的欧拉角。
		# 站点（site）是模型中的虚拟点，通常用于定义连接约束、观察点或其他用途。
		hip_site = model.worldbody.add('site', pos=hip_pos, euler=[0, 0, theta])
		leg = Leg(length=BODY_RADIUS, rgba=rgba) # 创建一个腿的实例，使用之前定义的Leg类。
		hip_site.attach(leg.model) # 将腿连接到之前定义的站点上，实现腿与主体的连接。
	return model # 返回构建好的生物模型的MuJoCo XML根元素。

The make_creature function uses PyMJCF’s attach() method to procedurally attach legs to the torso. Note that both the torso and hip attachment sites are children of the worldbody, since their parent body has yet to be instantiated. We will now make an arena with a chequered floor and two lights:

make_creature 函数使用了 PyMJCF 的 attach() 方法来动态地将腿附加到躯干上。请注意，躯干和髋关节的附着点都是 worldbody 的子元素，因为它们的父体尚未被实例化。接下来，我们将创建一个带有方格地板和两个灯光的模拟场地：

# 创建了一个简单的仿真环境，包括地面、纹理和两个光源。
arena = mjcf.RootElement() # 创建MuJoCo XML模型的根元素
# 在MJCF的资产（asset）中添加一个2D纹理。这个纹理是一个棋盘图案，具有指定的宽度、高度和两种颜色（rgb1和rgb2）。
checker = arena.asset.add('texture', type='2d', builtin='checker', width=300,
height=300, rgb1=[.2, .3, .4], rgb2=[.3, .4, .5])
# 在MJCF的asset中添加一个材质。这个材质使用前面创建的纹理作为贴图，指定了贴图的重复次数（texrepeat）和反射率（reflectance）。
grid = arena.asset.add('material', name='grid', texture=checker, texrepeat=[5,5], reflectance=.2)
# 在模型的世界部分（worldbody）添加一个平面几何体。这个平面具有指定的大小（size）和之前创建的材质（material）。
arena.worldbody.add('geom', type='plane', size=[2, 2, .1], material=grid)
#  在世界部分添加两个光源。每个光源有指定的位置（pos）和方向（dir）。这些光源被放置在地面上方，用于照明仿真场景。
for x in [-2, 2]:
    arena.worldbody.add('light', pos=[x, -1, 3], dir=[-x, 1, -2])

Placing several creatures in the arena, arranged in a grid:
将多个生物放置在场地中，以网格形式排列：

import PIL.Image
# 利用上一步创建的arena仿真环境，继续生成多个具有不同腿数的生物模型，并通过物理仿真和渲染展示它们的状态。
# Instantiate 6 creatures with 3 to 8 legs. 创建6个生物模型，每个模型具有3到8条腿，数量随机。
creatures = [make_creature(num_legs=num_legs) for num_legs in (3,4,5,6,7,8)]
# Place them on a grid in the arena.  指定生物模型放置的高度。
height = .15
grid = 5 * BODY_RADIUS # 计算一个用于放置生物模型的网格的大小。
xpos, ypos, zpos = np.meshgrid([-grid, 0, grid], [0, grid], [height]) # 创建生物模型在X、Y、Z轴上的位置网格。
for i, model in enumerate(creatures): # 
    # Place spawn sites on a grid. 
    spawn_pos = (xpos.flat[i], ypos.flat[i], zpos.flat[i]) # 获取当前生物模型的放置位置。
    spawn_site = arena.worldbody.add('site', pos=spawn_pos, group=3) #  在场景的世界部分添加站点，用于指定生物模型的生成位置。
    # Attach to the arena at the spawn sites, with a free joint. 
    # 将生物模型通过自由关节（free joint）连接到生成位置的站点上。
    spawn_site.attach(model).add('freejoint')
# Instantiate the physics and render. : 使用MuJoCo创建物理引擎，从场景的MJCF模型中实例化物理引擎。
physics = mjcf.Physics.from_mjcf_model(arena) 
pixels = physics.render() # 进行物理仿真并渲染场景，生成像素数组。
PIL.Image.fromarray(pixels) # notebook中展示用

在这里插入图片描述

Multi-legged creatures, ready to roam! Let us inject some controls and watch them move. We will generate a sinusoidal open-loop control signal of fixed frequency and random phase, recording both a video and the horizontal positions of the torso geoms, in order to plot the movement trajectories.
多腿生物，准备好漫游！现在，我们将注入一些控制信号并观察它们的运动。我们将生成一个具有固定频率和随机相位的正弦开环控制信号，并记录视频和躯干几何体的水平位置，以绘制运动轨迹：

%matplotlib inline # 在Jupyter Notebook中使用matplotlib的内联模式，使图形能够直接嵌入到Notebook中。
import matplotlib
import matplotlib.animation as animation # 用于创建动画
import matplotlib.pyplot as plt
from IPython.display import HTML
import PIL.Image

# 用于notebook 展示video用
def display_video(frames, framerate=30): # 接受帧序列和帧率作为输入参数。
    height, width, _ = frames[0].shape # 获取帧的高度和宽度。
    dpi = 70
    orig_backend = matplotlib.get_backend() # 获取当前使用的matplotlib后端。
    matplotlib.use('Agg')  # Switch to headless 'Agg' to inhibit figure rendering. 切换到无界面的'Agg'后端，以抑制图形渲染。
    fig, ax = plt.subplots(1, 1, figsize=(width / dpi, height / dpi), dpi=dpi)
    matplotlib.use(orig_backend)  # Switch back to the original backend. 切换回原始的matplotlib后端
	
	# 设置轴和图像的一些信息
    ax.set_axis_off()
    ax.set_aspect('equal') # 设置图形的纵横比为相等，以避免图像变形。
    ax.set_position([0, 0, 1, 1]) #  设置轴的位置，使其占据整个图形。
    im = ax.imshow(frames[0]) # 在轴上显示第一帧图像
    def update(frame): # 用于更新图像内容。
        im.set_data(frame)
        return [im]
    interval = 1000/framerate # 计算动画帧之间的时间间隔，以控制帧率。
    # 用于生成动画。
    anim = animation.FuncAnimation(fig=fig, func=update, frames=frames,
                                   interval=interval, blit=True, repeat=False)
    return HTML(anim.to_html5_video()) # 将动画转换为HTML5视频格式，并返回HTML对象，以在Notebook中显示动画。

duration = 10   # (Seconds) 设置模拟的时间为10秒。
framerate = 30  # (Hz) 设置动画的帧率为30帧每秒。
video = []
pos_x = []
pos_y = []
# 存储所有生物模型的torso（躯干） geom元素和actuator元素。
# 后面会用torsos生成运动轨迹。actuators 用于控制关节运动。
torsos = []  # List of torso geom elements. 
actuators = []  # List of actuator elements.
# 遍历每个生物模型，将其torso和actuator元素添加到相应的列表中。
for creature in creatures:
    torsos.append(creature.find('geom', 'torso'))
    actuators.extend(creature.find_all('actuator'))

# Control signal frequency, phase, amplitude.设置控制信号的频率、相位和振幅。
freq = 5
phase = 2 * np.pi * np.random.rand(len(actuators))
amp = 0.9

# Simulate, saving video frames and torso locations. 模拟，保存视频帧和torso位置
physics.reset()
while physics.data.time < duration: # 在指定的模拟时间内进行循环。
    # Inject controls and step the physics. 注入控制信号并进行物理仿真步进
    physics.bind(actuators).ctrl = amp * np.sin(freq * physics.data.time + phase)
    physics.step() # 重置物理引擎的状态，分step1和step2
    
    # Save torso horizontal positions using bind(). 使用保存torso的水平位置，后续用于生成躯干轨迹
    pos_x.append(physics.bind(torsos).xpos[:, 0].copy())
    pos_y.append(physics.bind(torsos).xpos[:, 1].copy())

  # Save video frames.  在适当的时间间隔内保存渲染的视频帧。
    if len(video) < physics.data.time * framerate:
        pixels = physics.render()
        video.append(pixels.copy())

display_video(video, framerate)

Plotting the movement trajectories, getting creature colours using bind():
绘制运动轨迹，使用 bind() 获取生物的颜色：

# 通过绘制轨迹，可视化模拟过程中每个生物模型 torso geom 元素的运动路径
# 绘制的运动轨迹颜色使用了生物模型相同的颜色
creature_colors = physics.bind(torsos).rgba[:, :3] 
fig, ax = plt.subplots(figsize=(8, 8))
ax.set_prop_cycle(color=creature_colors)
ax.plot(pos_x, pos_y, linewidth=4)

在这里插入图片描述

youtu.be/0Lw_77PErjg shows a clip of the locomotion. The plot on the right shows the corresponding movement trajectories of creature positions. Note how physics.bind(torsos) was used to access both xpos and rgba values. Once the Physics had been instantiated by from_mjcf_model(), the bind() method will expose both the associated mjData and mjModel fields of an mjcf element, providing unified access to all quantities in the simulation.

youtu.be/0Lw_77PErjg 显示了运动的片段。图表显示了生物位置的对应运动轨迹。请注意，physics.bind(torsos) 用于获取 xpos 和 rgba 值。一旦通过 from_mjcf_model() 实例化了 Physics，bind() 方法将公开 mjcf 元素的关联 mjData 和 mjModel 字段，提供对模拟中所有量的统一访问。

3.2 Debugging

3.2 调试

In order to aid in troubleshooting MJCF compilation problems on models that are assembled programmatically, PyMJCF implements a debug mode where individual XML elements and attributes can be traced back to the line of Python code that last modified it. This feature is not enabled by default since the tracking mechanism is expensive to run. However when a compilation error is encountered the user is instructed to restart the program with the --pymjcf_debug runtime flag. This flag causes PyMJCF to internally log the Python stack trace each time the model is modified. MuJoCo’s error message is parsed to determine the line number in the generated XML document, which can be used to cross-reference to the XML element that is causing the error. The logged stack trace then allows PyMJCF to report the line of Python code that is likely to be responsible.

为了帮助解决在以编程方式组装的模型上出现的MJCF编译问题，PyMJCF实现了一种调试模式，可以将单个XML元素和属性追溯到最后修改它的Python代码行。由于跟踪机制运行成本较高，此功能默认未启用。然而，当遇到编译错误时，建议用户使用 --pymjcf_debug 运行标志重新启动程序。此标志导致PyMJCF在每次修改模型时内部记录Python堆栈跟踪。通过解析MuJoCo的错误消息，确定在生成的XML文档中的行号，可用于交叉引用导致错误的XML元素。然后，记录的堆栈跟踪允许PyMJCF报告，可能负责的Python代码行。

Occasionally, the XML compilation error arises from incompatibility between attached models or broken cross-references. Such errors are not necessarily local to the line of code that last modified a particular element. For such a scenario, PyMJCF provides an additional --pymjcf_debug_full_dump_dir flag that causes the entirety of the internal stack trace logs to be written to files at the specified directory.
偶尔，XML编译错误是由于连接的模型之间的不兼容性或破损的交叉引用引起的。对于这种情况，PyMJCF提供了额外的
–pymjcf_debug_full_dump_dir 标志，它导致将内部堆栈跟踪日志全部写入到指定目录的文件中。

4 Reinforcement learning interface

4 强化学习接口

Reinforcement learning is a computational framework wherein an agent, through sequential interactions with an environment, tries to learn a behaviour policy that maximises future rewards (Sutton and Barto, 2018).
强化学习是一种计算框架，在这种框架中，代理通过与环境的序列交互，试图学习一个最大化未来奖励的行为策略（Sutton和Barto，2018）。

4.1 Reinforcement Learning API

4.1 强化学习API

Environments in dm_control adhere to DeepMind’s dm_env interface, defined in the github.com/deepmind/dm_env repository. In brief, a run-loop using dm_env may look like
dm_control中的环境遵循DeepMind的dm_env接口，该接口在github.com/deepmind/dm_env存储库中定义。简而言之，使用dm_env的运行循环可能如下所示：

for _ in range(num_episodes):
  timestep = env.reset() # 重置环境，获取第一个时间步的状态
  while True:
    action = agent.step(timestep) # agent根据当前时间步的状态选择一个动作
    timestep = env.step(action) # 将选定的动作应用于环境，获取下一个时间步的状态。
    if timestep.last(): # 检查当前时间步是否是回合的最后一个时间步。
      _ = agent.step(timestep)
      break

Each call to an environment’s step() method returns a TimeStep namedtuple with step_type, reward, discount and observation fields. Each episode starts with a step_type of FIRST, ends with a step_type of LAST, and has a step_type of MID for all intermediate timesteps. A TimeStep also has corresponding first(), mid() and last() methods, as illustrated above. Please see the dm_env repository documentation for more details.
每次对环境的 step() 方法的调用都会返回一个带有 step_type、reward、discount 和 observation 字段的 TimeStep 元组。每个 episode 以 FIRST 的 step_type 开始，以 LAST 的 step_type 结束，对于所有中间时间步，都有一个 MID 的 step_type。TimeStep 还具有相应的 first()、mid() 和 last() 方法，如上所示。有关更多详细信息，请参阅 dm_env 存储库的文档。

TODO 待学习：https://github.com/google-deepmind/dm_env

4.1.1 The Environment class

4.1.1 环境类

The class Environment, found within the dm_control.rl.control module, implements the dm_env environment interface:
reset() Initialises the state, sampling from some initial state distribution.
step() Accepts an action, advances the simulation by one time-step, and returns a TimeStep namedtuple.
action_spec() describes the actions accepted by an Environment. The method returns an ArraySpec, with attributes that describe the shape, data type, and optional lower and upper bounds for the action arrays. For example, random agent interaction can be implemented as
observation_spec() returns an OrderedDict of ArraySpecs describing the shape and data type of each corresponding observation.
A TimeStep namedtuple contains:
• step_type, an enum with a value in [FIRST, MID, LAST].
• reward, a floating point scalar, representing the reward from the previous transition.
• discount, a scalar floating point number γ ∈ [0, 1].
• observation, an OrderedDict of NumPy arrays matching the specification returned by observation_spec().

Environment 类位于 dm_control.rl.control 模块中，实现了 dm_env 环境接口：

reset() 初始化状态，从某个初始状态分布中进行采样。
step() 接受一个动作，推进模拟一步时间，并返回一个 TimeStep 元组。
action_spec() 描述环境接受的动作。该方法返回一个 ArraySpec，具有描述动作数组的形状、数据类型以及可选的下界和上界的属性。例如，可以将随机代理交互实现为：

spec = env.action_spec() # 包含了有关动作的信息，例如动作的形状、最小值和最大值等
time_step = env.reset()
while not time_step.last():
    action = np.random.uniform(spec.minimum, spec.maximum, spec.shape) # 根据action_spec，生成随机动作
    time_step = env.step(action)

observation_spec() 返回一个 OrderedDict，其中包含描述每个相应观测的形状和数据类型的 ArraySpec。
TimeStep 元组包含：
step_type，一个枚举，其值在 [FIRST, MID, LAST] 中。
reward，一个浮点数标量，表示上一次转换的奖励。
discount，一个标量浮点数 γ ∈ [0, 1]。
observation，一个与由 observation_spec() 返回的规范匹配的 NumPy 数组的 OrderedDict。

Whereas the step_type specifies whether or not the episode is terminating, it is the discount γ that determines the termination type. γ = 0 corresponds to a terminal state2 as in the first-exit or finite-horizon formulations. A terminal TimeStep with γ = 1 corresponds to the infinite-horizon formulation; in this case an agent interacting with the environment should treat the episode as if it could have continued indefinitely, even though the sequence of observations and rewards is truncated. In this case a parametric value function may be used to estimate future returns.
虽然 step_type 指定了该回合是否正在终止，但是是 discount γ （折扣因子：影响未来奖励的权重）决定了终止类型。当 γ = 0 时，对应于终止状态，就像在第一次退出或有限时间概念的情况下一样。当 γ = 1 时，对应于无限时间概念的终止状态；在这种情况下，与环境互动的代理应该将该回合视为如果没有截断的观察和奖励序列，它本可以无限地继续。在这种情况下，可以使用参数值函数来估计未来的回报。

“有限时间概念”
指的是在强化学习中考虑回合有一个固定的时间限制，即回合在一定的步数后会终止。这与无限时间概念形成对比，即回合不受时间限制，可以无限地持续下去。在有限时间概念中，强化学习代理必须在有限的步数内学会执行任务，而在无限时间概念中，代理可以在不受时间限制的情况下尽可能地积累奖励。
假设有一个强化学习任务，代理需要控制一个机器人在一个迷宫中找到目标点。如果任务是有限时间的，那么这个机器人可能只有一定数量的步数（时间步）来寻找目标，然后这个回合就会结束。机器人必须在这有限的步数内找到目标点，否则任务失败。
相反，如果任务是无限时间的，机器人就可以花费任意多的时间来找到目标，而不受时间步数的限制。这样，机器人可以更加从容地探索环境，找到最优的路径。

4.2 Reward functions
4.2 奖励函数

Rewards in dm_control tasks are in the unit interval, r(s, a) ∈ [0, 1]. Some tasks have “sparse” rewards, i.e., r(s, a) ∈ {0, 1}. This structure is facilitated by the tolerance() function, see Figure 3. Since terms output by tolerance() are in the unit interval, both averaging and multiplication operations maintain that property, facilitating reward design.
dm_control 任务中的奖励位于单位区间内，即 r(s, a) ∈ [0, 1]。一些任务具有“稀疏”奖励，即 r(s, a) ∈ {0, 1}。此结构由 tolerance() 函数实现，参见图3。由于 tolerance() 输出的项位于单位区间内，因此平均和乘法操作都保持该属性，有助于奖励设计。

在这里插入图片描述

Figure 3: The tolerance(x, bounds=(lower, upper)) function returns 1 if x is within the bounds interval and 0 otherwise. If the optional margin argument is given, the output will decrease smoothly with distance from the interval, taking a value of value_at_margin at a distance of margin. Several types of sigmoid-like functions are available by setting the sigmoid argument. Top: Four infinite-support reward functions, for which value_at_margin must be positive. Bottom: Three finite-support reward functions with value_at_margin=0.
图3： tolerance(x, bounds=(lower, upper)) 函数在 x 在区间内时返回1，否则返回0。如果给定了可选的 margin 参数，输出将随着到达间隔的距离平滑减小，并在距离 margin 处取值为 value_at_margin。通过设置 sigmoid 参数，可以使用几种类似 Sigmoid 的函数。顶部：四个具有无限支持的奖励函数，其中 value_at_margin 必须为正。底部：三个有限支持的奖励函数，其中 value_at_margin=0。

x 可以是与任务相关的任何值，比如机器人的位置、目标状态的偏离程度等。tolerance() 函数的目的是帮助设计奖励函数，使得在给定的条件下，奖励值能够在 [0, 1] 范围内合理变化，以便于强化学习算法的训练。

5 The Composer task definition library

5 Composer任务定义库

The Composer framework organises RL environments into a common structure and endows scene elements with optional event handlers. At a high level, Composer defines three main abstractions for task design:
composer.Entity represents a reusable self-contained building block that consists of an MJCF model, observables (Section 5.1), and possibly callbacks executed at specific stages of the environment’s life time, as detailed in Section 5.3. A collection of entities can be organised into a tree structure by attaching one or more child entities to a parent. The root entity is conventionally referred to as an “arena”, and provides a fixed for the final, combined MJCF model.
composer.Task consists of a tree of composer.Entity objects that occupy the physical scene and provides reward, observation and termination methods. A task may also define callbacks to implement “game logic”, e.g. to modify the scene in response to various events, and provide additional task observables.
composer.Environment wraps a composer.Task instance with an RL environment that agents can interact with. It is responsible for compiling the MJCF model, triggering callbacks at appropriate points of an episode (see Section 5.3), and determining when to terminate, either through task-defined termination criteria or a user defined time limit. It also holds a random number generator state that is used by the callbacks, enabling reproducibility.

Composer框架将强化学习环境组织成一个通用结构，并为场景元素提供可选的事件处理程序。在高层次上，Composer为任务设计定义了三个主要的抽象：

composer.Entity 代表一个可重复使用的自包含构建模块，包括一个MJCF模型、可观察项（第5.1节）和可能在环境生命周期的特定阶段执行的回调，详见第5.3节。一组实体可以通过将一个或多个子实体附加到父实体来组织成树结构。根实体通常被称为“场地”，并提供了最终的、组合的MJCF模型的固定。
composer.Task 由占据物理场景的 composer.Entity 对象树组成，并提供奖励、观察和终止方法。任务还可以定义回调来实现“游戏逻辑”，例如根据各种事件修改场景，并提供额外的任务可观察项。
composer.Environment 封装了一个 composer.Task 实例，其中包含代理可以与之交互的RL环境。它负责编译MJCF模型，在回合的适当时刻触发回调（参见第5.3节），并通过任务定义的终止标准或用户定义的时间限制，确定何时终止。它还保存一个由回调使用的随机数生成器状态，实现可重现性。

Section 5.1 describes observable, a Composer module for exposing observations, supporting noise, buffering and delays. Section 5.2 describes variation, a module for implementing model variations. Section 5.3 describes the callbacks used by Composer to implement these and additional user-defined behaviours. A self-contained Composer tutorial follows in Section 5.4

第5.1节描述了 observable，这是一个Composer模块，用于暴露观察结果，支持噪声、缓冲和延迟。第5.2节描述了 variation，一个实现模型变化的模块。第5.3节描述了Composer用于实现这些和其他用户定义行为的回调。第5.4节提供了一个独立的Composer教程。

5.1 The observable module

5.1 观察模块

An “observable” represents a quantity derived from the state of the simulation, that may be returned as an observation to the agent. Observables may be bound to a particular entity (e.g. sensors belonging to a robot), or they may be defined at the task level. The latter is often used for providing observations that relate to more than one entity in the scene (e.g. the distance between an end effector of a robot entity and a site on a different entity). A particular entity may define any number of observables (such as joint angles, pressure sensors, cameras), and it is up to the task designer to select which of these should appear in the agent’s observations. Observables can be configured in a number of ways:
enabled: (boolean) Whether the observable is computed and returned to the agent. Set to False by default.
update_interval: (integer or callable returning an integer) Specifies the interval, in simulation steps, at which the values of the observable will be updated. The last value will be repeated between updates. This parameter may be used to simulate sensors with different sample rates. Sensors with stochastic rates may be modelled by passing a callable that returns a random integer.
buffer_size: (integer) Controls the size of the internal FIFO buffer used to store observations that were sampled on previous simulation time-steps. In the default case where no aggregator is provided (see below), the entire contents of the buffer is returned as an observation at each control timestep. This can be used to avoid discarding observations from sensors whose values may change significantly within the control timestep. If the buffer size is sufficiently large, it will contain observations from previous control timesteps, endowing the environment with a simple form of memory.
corruptor: (callable) Performs a point-wise transformation of each observation value before it is inserted into the buffer. Corruptors are most commonly used to simulate observation noise.
aggregator: (callable or predefined string) Performs a reduction over all of the elements in the observation buffer. For example this can be used to take a moving average over previous observation values.
delay: (integer or callable returning an integer) Specifies a delay (in terms of simulation timesteps) between when the value of the observable is sampled, and when it will appear in the observations returned by the environment. This parameter can be used to model sensor latency. Stochastic latencies may be modelled by passing a callable that returns a randomly sampled integer.

“Observable”（可观测量）表示从模拟状态派生的数量，可能作为观察结果返回给代理。可观察可以绑定到特定的实体（例如，属于机器人的传感器），也可以在任务级别定义。后者通常用于提供与场景中多个实体相关的观察结果（例如，机器人实体的末端执行器与另一个实体上的一个位置之间的距离）。特定实体可以定义任意数量的可观察项（例如，关节角度、压力传感器、摄像机），由任务设计者选择哪些应该出现在代理的观察结果中。可观察项可以以多种方式配置：

enabled：（布尔值）可观察项是否计算并返回给代理。默认设置为False。
update_interval：（整数或返回整数的可调用对象）指定可观察项的值将在模拟步骤中更新的间隔。在更新之间，将重复上一个值。此参数可用于模拟具有不同采样率的传感器。可以通过传递返回随机整数的可调用对象来建模具有随机速率的传感器。
buffer_size：（整数）控制用于存储在先前模拟时间步骤上采样的观察的内部FIFO缓冲区的大小。在默认情况下，如果没有提供聚合器（见下文），则缓冲区的整个内容将在每个控制时间步长返回为一个观察。这可以用于避免丢弃来自传感器的观察，其值在控制时间步长内可能会发生显著变化。如果缓冲区大小足够大，它将包含先前控制时间步长的观察，为环境赋予一种简单的记忆形式。
corruptor：（可调用对象）在将每个观察值插入缓冲区之前，对每个观察值进行逐点转换。corruptor最常用于模拟观察噪声。
aggregator：（可调用对象或预定义的字符串）对观察缓冲区中的所有元素执行缩减操作。例如，这可用于对先前观察值进行移动平均。
delay：（整数或返回整数的可调用对象）指定可观察项的值在采样时与出现在环境返回的观察结果中的时间之间的延迟（以模拟时间步长为单位）。此参数可用于建模传感器延迟。可以通过传递返回随机采样整数的可调用对象来建模随机延迟。

During each control step the evaluation of observables is optimized such that only callables for observables that can appear in future observations are evaluated. For example, if we have an observable with update_interval=1 and buffer_size=1 then it will only be evaluated once per control step, even if there are multiple simulation steps per control step. This avoids the overhead of computing intermediate observations that would be discarded.

在每个控制步骤中，我们对可观测量的评估进行了优化，只评估可能在未来观测中出现的可观测量。例如，如果某个可观测量的更新间隔update_interval为1，缓冲区buffer_size大小为1，那么即使每个控制步骤中有多个模拟步骤，它也只会在每个控制步骤中评估一次。这样做可以避免计算会被丢弃的中间观测，从而减少开销。

可观察量的评估指的是对于在模拟环境中定义的可观察量（observable）进行计算和获取其值的过程。在强化学习环境中，可观察量是代理（智能体）可以接收并利用的信息，通常是从环境的状态中派生出来的。评估可观察量意味着计算这些信息，并将其提供给代理，以便代理能够做出相应的决策和行动。
在给定的代码上下文中，对可观察量的评估是通过计算观测值并将其提供给代理来实现的。这些可观察量可能涉及模拟环境中的各种信息，如关节角度、传感器读数、物体位置等。评估可观察量的方式可以通过配置参数来调整，例如更新间隔、缓冲区大小、延迟等。这些参数的设置影响着如何从模拟环境中获取和利用观测信息。

5.2 The variation module

5.2 变化模块

To improve the realism of the simulation, it is often desirable to randomise elements of the environment, especially those whose values are uncertain. Stochasticity can be added to both the observables e.g. sensor noise (the corruptor of the previous section), as well as the model itself (a.k.a. “domain randomisation”). The latter is a popular method for increasing the robustness learned control policies. The variation module provides methods to add and configure forms of stochasticity in the Composer framework. The following base API is provided:
Variation: The base class. Subclasses should implement the abstract method call(self, initial_value, current_value, random_state), which returns a numerical value, possibly depending on an initial_value (e.g. original geom mass) and current_value (e.g. previously sampled geom mass). Variation objects support arithmetic operations with numerical primitives and other Variation objects.
MJCFVariator: A class for varying attributes of MJCF elements, e.g. geom size. The MJCFVariator keeps track of initial and current attribute values and passes them to the Variation object. It should be called in the initialize_episode_mjcf stage, before the model is compiled.
PhysicsVariator: Similar to MJCFVariator, except for bound attributes, e.g. external forces. Should be called in the initialize_episode stage after the model has been compiled.
evaluate: Method to traverse an arbitrarily nested structure of callables or constant values, and evaluate any callables (such as Variation objects).

为了提高模拟的真实感，通常希望随机化环境的元素，特别是那些值不确定的元素。可以在可观察项上添加随机性，例如传感器噪声（前一节中的corruptor），以及模型本身（也称为“领域随机化”）。后者是增加学到的控制策略的鲁棒性的流行方法。variation模块提供了在Composer框架中添加和配置随机性形式的方法。以下是提供的基本API：

Variation：基类。子类应该实现抽象方法__call__(self, initial_value, current_value, random_state)，该方法返回一个数值，可能取决于initial_value（例如，原始geom质量）和current_value（例如，先前采样的geom质量）。变化对象支持与数值原语和其他变化对象的算术操作。
MJCFVariator：用于变化MJCF元素属性的类，例如geom大小。MJCFVariator跟踪初始和当前属性值，并将它们传递给变化对象。应该在编译模型之前的initialize_episode_mjcf阶段调用。
PhysicsVariator：类似于MJCFVariator，但用于绑定属性，例如外部力。应该在模型编译后的initialize_episode阶段调用。
evaluate：用于遍历任意嵌套结构的可调用对象或常量值，并评估任何可调用对象（如变化对象）的方法。

A number of submodules provide classes for commonly occurring use cases:
colors: Used to define variations in different colour spaces, such as RGB, HSV and grayscale.
deterministic: Deterministic variations such as constant and fixed sequences of values, in case more control over the exact values is required.
distributions: Wraps a number of distributions available in numpy.random as Variation objects. Any distribution parameters passed can themselves also be Variation objects.
noises: Used to define additive and multiplicative noise using the distributions mentioned above, e.g. for modelling sensor noise.
rotations: Useful for defining variations in quaternion space, e.g. random rotation on a composer.Entity’s pose.

多个子模块提供了用于常见用例的类：

colors：用于在不同颜色空间（例如RGB、HSV和灰度）中定义变化。
deterministic：确定性变化，例如常量和固定序列的值，以便更精确地控制值。
distributions：将numpy.random中的许多分布包装为变化对象。传递的任何分布参数本身也可以是变化对象。
noises：用于使用上述分布定义添加和乘法噪声，例如建模传感器噪声。
rotations：用于在四元数空间中定义变化，例如对composer.Entity姿势进行随机旋转。

5.3 The Composer callback lifecycle

5.3 Composer回调生命周期

Figure 4 illustrates the lifecycle of Composer callbacks. These can be organised into those that occur when an RL episode is reset, and those that occur when the environment is stepped. For a given callback, the one that is defined in the Task is executed first, followed by those defined in each Entity following depth-first traversal of the Entity tree starting from the root (which by convention is the arena) and the order in which Entities were attached.
图4说明了Composer回调的生命周期。这些可以分为在reset RL回合时发生的回调和在环境进行步进时发生的回调。对于给定的回调，首先执行在任务中定义的回调，然后是在从根（通常是arena）开始的Entity树的深度优先遍历中定义的每个Entity的回调，以及附加Entity的顺序。
在这里插入图片描述
图4：显示Composer回调生命周期的图表。圆角矩形表示任务（Tasks）和实体（Entities）可能实现的回调。蓝色矩形表示内置的Composer操作。

The first of the two callbacks in reset is initialize_episode_mjcf, which allows the MJCF model to be modified between episodes. It is useful for changing quantities that are fixed once the model has been compiled. These modifications affect the generated XML which is then compiled into a Physics instance and passed to the initialize_episode callback, where the initial state can be set.
reset中的两个回调中的第一个是initialize_episode_mjcf，它允许在回合之间修改MJCF模型。它适用于在模型编译后不再更改的量。这些修改会影响生成的XML，然后将其编译成Physics实例并传递给initialize_episode回调，其中可以设置初始状态。

The Environment.step sequence begins at the before_step callback. One key role of this callback is to translate agent actions into the Physics control vector.
Environment.step序列始于before_step回调。此回调的一个关键作用是将代理动作转换为Physics控制向量。

To guarantee stability, it is often necessary to reduce the time-step of the physics simulation. In order to decouple these possibly very small steps and the agent’s control time-step, we introduce a substep loop. Each Physics substep is preceded by before_substep and followed by after_substep. These callbacks are useful for detecting transient events that may occur in the middle of an environment step, e.g. a button press. The internal observation buffers are then updated according to the configured update_interval of each individual Observable, unless the substep happens to be the last one in the environment step, in which case the after_step callback is called first before the final update of the observation buffers. The internal observation buffers are then processed according to the delay, buffer_size, and aggregator settings of each Observable to generate “output buffers” that are returned externally.
为了保证稳定性，通常需要减小物理模拟的时间步长。为了解耦这些可能非常小的步骤和代理的控制时间步长，我们引入了一个子步骤循环。每个Physics子步骤前面是before_substep，后面是after_substep。这些回调对于检测可能发生在环境步骤中间的瞬态事件（例如按钮按下）非常有用。然后，根据每个单独Observable的配置的update_interval，更新内部观察缓冲区，除非子步骤碰巧是环境步骤中的最后一个子步骤，在这种情况下，会先调用after_step回调，然后再更新观察缓冲区。然后，根据每个Observable的delay、buffer_size和aggregator设置，处理内部观察缓冲区，以生成外部返回的“输出缓冲区”。

At the end of each Environment.step, the Task’s get_reward, get_discount, and should_terminate_episode callbacks are called in order to obtain the step’s reward, discount, and termination status respectively. Usually, the these three are not entirely independent of each other, and it is therefore recommended to compute all of these in the after_step callback, cache the values in the Task instance, and return them in the respective callbacks.
在每个Environment.step结束时，调用Task的get_reward、get_discount和should_terminate_episode回调，以便获取步骤的奖励、折扣和终止状态。通常，这三者不是完全独立的，因此建议在after_step回调中计算所有这些，将值缓存到Task实例中，并在相应的回调中返回它们。

5.4 Composer tutoria

5.4 Composer教程

In this tutorial we will create a task requiring our “creature” from Section 3.1 to press a colour-changing button on the floor with a prescribed force. We begin by implementing our “creature” as a composer.Entity:
The Creature Entity includes generic Observables for joint angles and velocities. Because find_all() is called on the Creature’s MJCF model, it will only return the creature’s leg joints, and not the “free” joint with which it will be attached to the world. Note that Composer Entities should override the _build and _build_observables methods rather than init. The implementation of init in the base class calls _build and _build_observables, in that order, to ensure that the entity’s MJCF model is created before its observables. This was a design choice which allows the user to refer to an observable as an attribute (entity.observables.foo) while still making it clear which attributes are observables. The stateful Button class derives from composer.Entity and implements the initialize_episode and after_substep callbacks.

在本教程中，我们将创建一个任务，要求我们在第3.1节中创建的“生物”按照预定的力按下地板上的一个变色按钮。我们首先将我们的“生物”实现为一个composer.Entity：

from dm_control import composer
from dm_control.composer.observation import observable


class Creature(composer.Entity):
  """A multi-legged creature derived from `composer.Entity`."""
  def _build(self, num_legs):
    self._model = make_creature(num_legs) # 创建了生物体的 MJCF 模型。

  def _build_observables(self):
    return CreatureObservables(self)

  @property
  def mjcf_model(self): # 返回生物体的 MJCF 模型。
    return self._model

  @property
  def actuators(self):
    return tuple(self._model.find_all('actuator')) # 返回模型中所有的执行器（actuator）。


# Add simple observable features for joint angles and velocities.
# 表示生物体的可观测特征。
class CreatureObservables(composer.Observables):

  @composer.observable
  def joint_positions(self): # 定义了关节角度的可观测特征。
    all_joints = self._entity.mjcf_model.find_all('joint') # 只返回生物体的关节，不返回其他关节，如世界连接的 free 关节
    return observable.MJCFFeature('qpos', all_joints)

  @composer.observable
  def joint_velocities(self): # 定义了关节速度的可观测特征
    all_joints = self._entity.mjcf_model.find_all('joint')
    return observable.MJCFFeature('qvel', all_joints)

Creature (Entity类)包括用于关节角度和速度的通用Observables。通过在 Creature 的 MJCF 模型上调用 find_all() 方法，仅返回生物体的腿关节，而不包括其与世界连接的“free”关节。请注意，Composer Entities应该重写_build和_build_observables方法，而不是__init__。基类中__init__的实现调用_build和_build_observables，以确保实体的MJCF模型在其可观察项之前创建。这是一个设计选择，允许用户将可观察项引用为属性（entity.observables.foo），同时仍然清楚哪些属性是可观察项。Stateful Button类派生自composer.Entity，并实现initialize_episode和after_substep回调。


NUM_SUBSTEPS = 25  # The number of physics substeps per control timestep.  


class Button(composer.Entity):
  """A button Entity which changes colour when pressed with certain force."""
  def _build(self, target_force_range=(5, 10)):
  #  初始化按钮的属性，包括目标力范围、MuJoCo XML 模型中的几何体、站点和传感器
    self._min_force, self._max_force = target_force_range
    self._mjcf_model = mjcf.RootElement()
    self._geom = self._mjcf_model.worldbody.add(
        'geom', type='cylinder', size=[0.25, 0.02], rgba=[1, 0, 0, 1])
    self._site = self._mjcf_model.worldbody.add(
        'site', type='cylinder', size=self._geom.size*1.01, rgba=[1, 0, 0, 0])
    self._sensor = self._mjcf_model.sensor.add('touch', site=self._site)
    self._num_activated_steps = 0

  def _build_observables(self):
    return ButtonObservables(self)

  @property
  def mjcf_model(self):
    return self._mjcf_model
  # Update the activation (and colour) if the desired force is applied.
  def _update_activation(self, physics): # 在物理仿真中更新按钮的激活状态和颜色。
    current_force = physics.bind(self.touch_sensor).sensordata[0]
    self._is_activated = (current_force >= self._min_force and
                          current_force <= self._max_force)
    physics.bind(self._geom).rgba = (
        [0, 1, 0, 1] if self._is_activated else [1, 0, 0, 1])
    self._num_activated_steps += int(self._is_activated)

  def initialize_episode(self, physics, random_state): # 在每个仿真周期开始时初始化按钮，包括重置奖励和激活步数。
    self._reward = 0.0
    self._num_activated_steps = 0
    self._update_activation(physics)

  def after_substep(self, physics, random_state): # 在每个物理子步骤之后更新按钮的激活状态。
    self._update_activation(physics)

  @property
  def touch_sensor(self): # 返回按钮的触摸传感器
    return self._sensor

  @property
  def num_activated_steps(self): # 返回按钮被激活的步数。
    return self._num_activated_steps

# 定义按钮实体的可观察特性
class ButtonObservables(composer.Observables):
  """A touch sensor which averages contact force over physics substeps."""
  @composer.observable
  def touch_force(self):
 	 # 按钮的状态和触摸力的平均值是Observable 特性。
    return observable.MJCFFeature('sensordata', self._entity.touch_sensor,
                                  buffer_size=NUM_SUBSTEPS, aggregator='mean')

Note how the Button counts the number of sub-steps during which it is pressed with the desired force. It also exposes an Observable of the force being applied to the button, whose value is an average of the readings over the physics time-steps.
请注意，Button计算其被按下的子步骤数，并暴露一个观察按钮施加的力的Observable，其值是在物理时间步长上读数的平均值。

We import some variation modules and an arena factory (see Section 7):
A simple Variation samples the initial position of the target:
我们导入一些variation模块和一个arena工厂（参见第7节）：
一个简单的Variation样本化目标的初始位置：

from dm_control.composer import variation
from dm_control.composer.variation import distributions
from dm_control.composer.variation import noises
from dm_control.locomotion.arenas import floors

# 用于在一个半径为 distance 的圆上均匀采样水平点。
# 可以模拟机器人以不同的方向和位置来按压按钮，有助于确保机器人更全面地模拟机器人与按钮交互的情况。
class UniformCircle(variation.Variation):
  """A uniformly sampled horizontal point on a circle of radius `distance`."""
  def __init__(self, distance):
    self._distance = distance
    self._heading = distributions.Uniform(0, 2*np.pi)

  def __call__(self, initial_value=None, current_value=None, random_state=None):
    distance, heading = variation.evaluate(
        (self._distance, self._heading), random_state=random_state)
    return (distance*np.cos(heading), distance*np.sin(heading), 0)

We will now define the PressWithSpecificForce Task, which combines all the above elements. The init constructor sets up the scene:
Continuing the PressWithSpecificForce Task definition, we now implement our Composer callbacks, including the reward function:

现在我们将定义PressWithSpecificForce任务，将所有上述元素组合在一起。__init__构造函数设置场景：
继续PressWithSpecificForce任务定义，我们现在实现我们的Composer回调，包括奖励函数：

class PressWithSpecificForce(composer.Task):

  def __init__(self, creature):
    self._creature = creature
    self._arena = floors.Floor() # 创建了一个地板作为仿真环境的地面环境
    self._arena.add_free_entity(self._creature) # 将生物体添加到环境中
    self._arena.mjcf_model.worldbody.add('light', pos=(0, 0, 4))
    self._button = Button()
    self._arena.attach(self._button) # # 将按钮添加到环境中

    # Configure initial poses 配置初始姿势
    self._creature_initial_pose = (0, 0, 0.15)
    button_distance = distributions.Uniform(0.5, .75)
    self._button_initial_pose = UniformCircle(button_distance)

    # Configure variators
    self._mjcf_variator = variation.MJCFVariator()
    self._physics_variator = variation.PhysicsVariator()

    # Configure and enable observables 配置启动可观察量
    pos_corrptor = noises.Additive(distributions.Normal(scale=0.01))
    self._creature.observables.joint_positions.corruptor = pos_corrptor
    self._creature.observables.joint_positions.enabled = True
    vel_corruptor = noises.Multiplicative(distributions.LogNormal(sigma=0.01))
    self._creature.observables.joint_velocities.corruptor = vel_corruptor
    self._creature.observables.joint_velocities.enabled = True
    self._button.observables.touch_force.enabled = True

    def to_button(physics):
      button_pos, _ = self._button.get_pose(physics)
      return self._creature.global_vector_to_local_frame(physics, button_pos)

    self._task_observables = {}
    self._task_observables['button_position'] = observable.Generic(to_button)

    for obs in self._task_observables.values():
      obs.enabled = True

    self.control_timestep = NUM_SUBSTEPS * self.physics_timestep

  @property
  def root_entity(self):
    return self._arena

  @property
  def task_observables(self):
    return self._task_observables

  def initialize_episode_mjcf(self, random_state): # 这个方法在每个仿真周期开始时调用，用于对 MJCF 模型进行一些修改。
    self._mjcf_variator.apply_variations(random_state)

  def initialize_episode(self, physics, random_state): # 在每个仿真周期开始时调用，用于对物理引擎进行一些初始化。
    self._physics_variator.apply_variations(physics, random_state)
    creature_pose, button_pose = variation.evaluate(
        (self._creature_initial_pose, self._button_initial_pose),
        random_state=random_state)
    self._creature.set_pose(physics, position=creature_pose)
    self._button.set_pose(physics, position=button_pose)

  def get_reward(self, physics): # 用于计算任务的奖励。
    return self._button.num_activated_steps / NUM_SUBSTEPS

Finally, we can instantiate a Creature Entity,pass it to the PressWithSpecificForce constructor to instantiate the task,and expose it as an environment complying with the dm_env.Environment API:

最后，我们可以实例化一个Creature Entity，将其传递给PressWithSpecificForce构造函数以实例化任务，并将其公开为符合dm_env.Environment API的环境：

creature = Creature(num_legs=4)
task = PressWithSpecificForce(creature)
env = composer.Environment(task, random_state=np.random.RandomState(42))

env.reset()
PIL.Image.fromarray(env.physics.render())

Here is our creature with a large red button, waiting to be pressed.

这是我们的生物，带有一个大红色按钮，等待被按下。

在这里插入图片描述

Part II

Tasks

任务

6 The Control Suite

6 控制套件

The Control Suite is a set of stable, well-tested tasks designed to serve as a benchmark for continuous control learning agents. Tasks are written using the basic MuJoCo interface of Section 2. Standardised action, observation and reward structures make suite-wide benchmarking simple and learning curves easy to interpret. Unlike the more elaborate domains of the Sections 7 and 8, Control Suite domains are not meant to be modified, in order to facilitate benchmarking. For more details regarding benchmarking, refer to our original publication (Tassa et al., 2018). A video montage of Control Suite domains can be found at youtu.be/rAai4QzcYbs.
控制套件是一组稳定、经过充分测试的任务，旨在作为连续控制学习代理的基准。任务使用第2节的基本MuJoCo接口编写。标准化的动作、观察和奖励结构使整个套件的基准测试变得简单，并且学习曲线易于解释。与第7和第8节中更为复杂的领域不同，控制套件领域不应该被修改，以便进行基准测试。有关基准测试的更多详细信息，请参阅我们的原始出版物（Tassa等人，2018）。控制套件领域的视频剪辑可以在youtu.be/rAai4QzcYbs找到。

6.1 Control Suite design conventions

6.1 控制套件设计约定

Action: With the exception of the LQR domain (see below), the action vector is in the unit box, i.e., a ∈ A ≡ [−1, 1]dim(A) .
Dynamics: While the state notionally evolves according to a continuous ordinary differential equation s˙ = fc(s, a), in practice temporal integration is discrete3 with some fixed, finite time-step: st+h = f(st , at).
Observation: When using the default observations (rather than pixels, see below), all tasks4 are strongly observable, i.e. the state can be recovered from a single observation. Observation features which depend only on the state (position and velocity) are functions of the current state. Features which are also dependent on controls (e.g. touch sensor readings) are functions of the previous transition.
Reward: Rewards in the Control Suite, with the exception of the LQR domain, are in the unit interval, i.e., r(s, a) ∈ [0, 1]. Some rewards are “sparse”, i.e., r(s, a) ∈ {0, 1}. This structure is facilitated by the tolerance() function, see Figure 3.

动作： 除了LQR领域（见下文），动作向量都在单位范围内，即a ∈ A ≡ [−1, 1]dim(A)。
动力学： 虽然状态在概念上按照连续的常规微分方程 s˙ = fc(s, a) 进化，但在实际中，时间积分是离散的，有一个固定的、有限的时间步长：st+h = f(st, at)。
观察： 当使用默认观察项（而不是像素，见下文）时，所有任务都是强可观测的，即可以从单一观察中恢复状态。仅依赖于状态（位置和速度）的观察特征是当前状态的函数。还依赖于控制的特征（例如触觉传感器读数）是前一个转换的函数。
奖励： 除了LQR领域，控制套件中的奖励都在单位间隔内，即r(s, a) ∈ [0, 1]。一些奖励是“稀疏的”，即r(s, a) ∈ {0, 1}。这种结构通过tolerance()函数实现，参见图3。

Termination and Discount: Control problems are usually classified as finite-horizon, first-exit and infinite-horizon (Bertsekas, 1995). Control Suite tasks have no terminal states or time limit and are therefore of the infinite-horizon variety. Notionally the objective is the infinite-horizon average return limT→∞ T −1 R T 0 r(st , at)dt, but in practice our agents internally use the discounted formulation R ∞ 0 e −t/τ r(st , at)dt or in discrete time P∞ i=0 γ i r(si , ai), where γ = e −h/τ is the discount factor. In the limit τ → ∞ (equivalently γ → 1), the policies of the discounted-horizon and averagereturn formulations are identical. All Control Suite tasks with the exception of LQR5 return γ = 1 at every step, including on termination.
终止和折扣： 控制问题通常被分类为有限时间、第一次退出和无限时间（Bertsekas，1995）。控制套件任务没有终止状态或时间限制，因此属于无限时间类型。概念上，目标是无限时间平均回报 limT→∞ T −1 R T 0 r(st , at)dt，但在实践中，我们的代理内部使用折现公式 R ∞ 0 e −t/τ r(st , at)dt 或在离散时间中 P∞ i=0 γ i r(si , ai)，其中 γ = e −h/τ 是折现因子。在 τ → ∞ 的极限情况下（等效于 γ → 1），折现时间和平均回报公式的策略是相同的。所有控制套件任务，除了LQR5，在每一步都返回 γ = 1，包括终止时。

5The LQR task terminates with γ = 0 when the state is very close to the origin, as a proxy for the exponential convergence of stabilised linear systems.
5当状态非常接近原点时，LQR任务以γ =0终止，作为稳定线性系统的指数收敛的代理。

Evaluation: While agents are expected to optimise for infinite-horizon returns, these are difficult to measure. As a proxy we use fixed-length episodes of 1000 time steps. Since all reward functions are designed so that r ≈ 1 near goal states, learning curves measuring total returns can all have the same y-axis limits of [0, 1000], making them easier to interpret and to average over all tasks. While a perfect score of 1000 is not usually achievable, scores outside the [800, 1000] range can be confidently said to be sub-optimal.
评估： 虽然我们期望代理优化无限时间回报，但这很难测量。作为替代，我们使用固定长度的 1000 个时间步的回合。由于所有奖励函数都设计得使 r 在目标状态附近接近 1，测量总回报的学习曲线都可以具有相同的 y 轴限制 [0, 1000]，这样更容易解释并在所有任务上进行平均。虽然通常无法达到完美的 1000 分，但在 [800, 1000] 范围之外的分数可以确定地说是次优的。

Model and Task verification：Verification in this context means making sure that the physics simulation is stable and that the task is solvable:
模型和任务验证：在这个上下文中，验证意味着确保物理模拟是稳定的，任务是可解的：

Simulated physics can easily destabilise and diverge, mostly due to errors introduced by time discretisation. Smaller time-steps are more stable, but require more computation per unit of simulation time, so the choice of time-step is always a trade-off between stability and speed (Erez et al., 2015). What’s more, learning agents are very good at discovering and exploiting instabilities.6
It is surprisingly easy to write tasks that are much easier or harder than intended, that are impossible to solve or that can be solved by very different strategies than expected (i.e. “cheats”). To prevent these situations, the Atari™ games that make up ALE were extensively tested over more than 10 man-years7 . However, many continuous control domains cannot be solved by humans with standard input devices, due to the large action space, so a different approach must be taken.

模拟物理系统很容易因为时间离散化引入的错误而变得不稳定并发散。较小的时间步长更稳定，但每个单位的模拟时间需要更多计算，因此选择时间步长总是在稳定性和速度之间进行权衡（Erez等人，2015）。此外，学习代理非常擅长发现和利用不稳定性。
令人惊讶的是，编写比预期更容易或更困难的任务，这些任务无法解决，或者可以通过与预期不同的策略解决（即“作弊”）。为了防止这些情况，构成ALE的Atari™游戏经过了超过10人年的广泛测试。然而，由于大的动作空间，许多连续控制领域无法由人类使用标准输入设备解决，因此必须采用不同的方法。

In order to tackle both of these challenges, we ran variety of learning agents (e.g. Lillicrap et al. 2015; Mnih et al. 2016) against all tasks, and iterated on each task’s design until we were satisfied that the physics was stable and non-exploitable, and that the task is solved correctly by at least one agent. Tasks that were solvable were collated into the benchmarking set. Tasks which were not yet solved at the time of development are in the extra set of tasks.
为了解决这两个挑战，我们对所有任务运行了各种学习代理（例如Lillicrap等人，2015; Mnih等人，2016），并对每个任务的设计进行了迭代，直到我们确信物理是稳定且无法利用的，并且至少有一个代理正确解决了任务。可解决的任务被整理到基准集中。在开发时尚未解决的任务属于额外的任务集。

The suite module：To load an environment representing a task from the suite, use suite.load():
套件模块： 要加载表示套件任务的环境，请使用suite.load()：

from dm_control import suite
# Load one task:
env = suite.load(domain_name="cartpole", task_name="swingup") # 加载的任务是名为 "swingup" 的 "cartpole" 领域（倒立摆问题）
# Iterate over a task set:
for domain_name, task_name in suite.BENCHMARKING: # 你遍历整个基准集，加载每个任务的环境。
    env = suite.load(domain_name, task_name)

Wrappers can be used to modify the behaviour of environments:
Pixel observations: By default, Control Suite environments return feature observations. The pixel.Wrapper adds or replaces these with images.
可以使用包装器修改环境的行为：
像素观察： 默认情况下，控制套件环境返回特征观察。pixel.Wrapper添加或替换这些观察为图像。

from dm_control.suite.wrappers import pixels
env = suite.load("cartpole", "swingup")
# Replace existing features by pixel observations:
# 该环境是在原始环境 env 的基础上添加了像素观察的包装器。这样，原始环境的其他观察特征被替换为像素观察。
env_only_pixels = pixels.Wrapper(env)
# Pixel observations in addition to existing features.
# 该环境在原始环境 env 的基础上添加了像素观察的包装器，但保留了原始环境的其他观察特征。
# 这意味着现在环境中既有像素观察，又保留了其他原始特征。
env_plus_pixels = pixels.Wrapper(env, pixels_only=False)

Reward visualisation: Models in the Control Suite use a common set of colours and textures for visual uniformity. As illustrated in the video, this also allows us to modify colours in proportion to the reward, providing a convenient visual cue.
奖励可视化： 控制套件中的模型使用一组常见的颜色和纹理，以实现视觉上的一致性。正如视频中所示，这还允许我们按比例修改颜色，以提供方便的视觉提示。

# 加载了一个名为 "swim" 的 "fish" 领域的游泳任务
# 通常用于在图形界面中显示奖励信号，帮助调试和分析模型的性能。
env = suite.load("fish", "swim", task_kwargs, visualize_reward=True)

6.2 Domains and Tasks

6.2 领域和任务

A domain refers to a physical model, while a task refers to an instance of that model with a particular MDP structure. For example the difference between the swingup and balance tasks of the cartpole domain is whether the pole is initialised pointing downwards or upwards, respectively. In some cases, e.g. when the model is procedurally generated, different tasks might have different physical properties. Tasks in the Control Suite are collated into tuples according to predefined tags. Tasks used for benchmarking are in the BENCHMARKING tuple (Figure 1), while those not used for benchmarking (because they are particularly difficult, or because they do not conform to the standard structure) are in the EXTRA tuple. All suite tasks are accessible via the ALL_TASKS tuple. In the domain descriptions below, names are followed by three integers specifying the dimensions of the state, control and observation spaces: Name x0010 dim(S), dim(A), dim(O) .
领域指的是一个物理模型，而任务是该模型的一个具体实例，具有特定的MDP结构。例如，cartpole领域的swingup和balance任务之间的区别在于摆杆的初始方向是向下还是向上。在一些情况下，例如当模型是程序生成的时，不同的任务可能具有不同的物理属性。在控制套件中，任务根据预定义的标签被整理成元组。用于基准测试的任务位于BENCHMARKING元组中（图1），而不用于基准测试的任务（因为它们特别困难，或者因为它们不符合标准结构）位于EXTRA元组中。所有套件任务都可以通过ALL_TASKS元组访问。在下面的领域描述中，名称后面跟着三个整数，指定状态、控制和观察空间的维度：Name x0010 dim(S), dim(A), dim(O)。

Pendulum (2, 1, 3): The classic inverted pendulum. The torquelimited actuator is 1/6 th as strong as required to lift the mass from motionless horizontal, necessitating several swings to swing up and balance. The swingup task has a simple sparse reward: 1 when the pole is within 30◦ of the vertical position and 0 otherwise.
Pendulum (2, 1, 3): 经典的倒立摆。扭矩限制执行器的强度是将质量从静止的水平位置提起所需强度的1/6，需要多次摆动才能摆起并保持平衡。摆动任务具有简单的稀疏奖励：当杆在垂直位置的30°范围内时为1，否则为0。
在这里插入图片描述

Acrobot (4, 1, 6): The underactuated double pendulum, torque applied to the second joint. The goal is to swing up and balance. Despite being low-dimensional, this is not an easy control problem. The physical model conforms to (Coulom, 2002) rather than the earlier Spong 1995. The swingup and swingup_sparse tasks have smooth and sparse rewards, respectively.
Acrobot (4, 1, 6): 低动度的双摆杆，扭矩施加在第二个关节上。目标是摆动并保持平衡。尽管维度较低，但这不是一个简单的控制问题。物理模型符合Coulom（2002）而不是较早的Spong（1995）。swingup和swingup_sparse任务分别具有平滑和稀疏的奖励。
在这里插入图片描述

Cart-pole (4, 1, 5): Swing up and balance an unactuated pole by applying forces to a cart at its base. The physical model conforms to Barto et al. 1983. Four benchmarking tasks: in swingup and swingup_sparse the pole starts pointing down while in balance and balance_sparse the pole starts near the upright position.
Cart-pole (4, 1, 5): 通过对底座上的小车施加力来摆起和平衡一个不可控制的竖直杆。物理模型符合Barto等人的1983年模型。有四个基准任务：在swingup和swingup_sparse中，杆的起始方向向下，而在balance和balance_sparse中，杆的起始位置接近竖直位置。
在这里插入图片描述

Cart-k-pole (2k+2, 1, 3k+2): The cart-pole domain allows to procedurally adding more poles, connected serially. Two non-benchmarking tasks, two_poles and three_poles are available.
Cart-k-pole (2k+2, 1, 3k+2): cart-pole领域允许逐步添加更多的杆，串联连接。有两个非基准任务，two_poles和three_poles可用。
在这里插入图片描述

Ball in cup (8, 2, 8): A planar ball-in-cup task. An actuated planar receptacle can translate in the vertical plane in order to swing and catch a ball attached to its bottom. The catch task has a sparse reward: 1 when the ball is in the cup, 0 otherwise.
Ball in cup (8, 2, 8): 平面杯中的球任务。一个可驱动的平面容器可以在垂直平面内平移，以摆动并捕捉连接到其底部的球。catch任务具有稀疏奖励：当球在杯中时为1，否则为0。

在这里插入图片描述

Point-mass (4, 2, 4): A planar point mass receives a reward of 1 when within a target at the origin. In the easy task, one of simplest in the suite, the two actuators correspond to the global x and y axes. In the hard task, the gain matrix from the controls to the axes is randomised for each episode, making it impossible to solve by memoryless agents. This task is not in the benchmarking set.
Point-mass (4, 2, 4): 平面点质体在原点内获得奖励1。在套件中最简单的任务之一中，两个执行器对应于全局x和y轴。在困难的任务中，从控制到轴的增益矩阵在每个episode中都是随机的，使得无法通过无记忆代理解决。此任务不在基准集中。
在这里插入图片描述

Reacher (4, 2, 6): The simple two-link planar reacher with a randomised target location. The reward is one when the end effector penetrates the target sphere. In the easy task the target sphere is bigger than on the hard task (shown on the left).
Reacher (4, 2, 6): 简单的两链接平面reacher，目标位置随机化。当末端执行器穿过目标球时，奖励为1。在easy任务中，目标球比hard任务中的大（左侧显示）。
在这里插入图片描述

Finger (6, 2, 12): A 3-DoF toy manipulation problem based on (Tassa and Todorov, 2010). A planar ‘finger’ is required to rotate a body on an unactuated hinge. In the turn_easy and turn_hard tasks, the tip of the free body must overlap with a target (the target is smaller for the turn_hard task). In the spin task, the body must be continually rotated.
Finger (6, 2, 12): 基于（Tassa和Todorov，2010）的3-DoF玩具操纵问题。需要平面的“手指”旋转在一个不可驱动的铰链上的身体。在turn_easy和turn_hard任务中，自由体的尖端必须与目标重叠（turn_hard任务的目标较小）。在spin任务中，必须持续旋转身体。

在这里插入图片描述

Hopper (14, 4, 15): The planar one-legged hopper introduced in (Lillicrap et al., 2015), initialised in a random configuration. In the stand task it is rewarded for bringing its torso to a minimal height. In the hop task it is rewarded for torso height and forward velocity
Hopper (14, 4, 15): 平面单腿跳跃器，由(Lillicrap等人，2015)引入，以随机配置初始化。在stand任务中，它会被奖励将其躯干提升到最小高度。在hop任务中，奖励是躯干高度和前进速度
在这里插入图片描述

Fish (26, 5, 24): A fish is required to swim to a target. This domain relies on MuJoCo’s simplified fluid dynamics. There are two tasks: in the upright task, the fish is rewarded only for righting itself with respect to the vertical, while in the swim task it is also rewarded for swimming to the target.
Fish (26, 5, 24): 需要鱼游到目标。该领域依赖于MuJoCo的简化流体动力学。有两个任务：在upright任务中，仅当鱼相对于垂直方向正确时才奖励；而在swim任务中，还奖励鱼游向目标。
在这里插入图片描述

Cheetah (18, 6, 17): A running planar biped based on (Wawrzyński, 2009). The reward r is linearly proportional to the forward velocity v up to a maximum of 10m/s i.e. r(v) = max 0, min(v/10, 1)
Cheetah (18, 6, 17): 基于（Wawrzyński，2009）的平面两足奔跑动物。奖励r与前进速度v成线性比例关系，最大为10m/s，即r(v) = max 0, min(v/10, 1)
在这里插入图片描述

Walker (18, 6, 24): An improved planar walker based on the one introduced in (Lillicrap et al., 2015). In the stand task reward is a combination of terms encouraging an upright torso and some minimal torso height. The walk and run tasks include a component encouraging forward velocity
Walker (18, 6, 24): 基于(Lillicrap等人，2015)引入的改进平面行走者。在stand任务中，奖励是鼓励竖直躯干和一些最小躯干高度的组合。walk和run任务包括鼓励前进速度的组件。
在这里插入图片描述

Manipulator (22, 5, 37): A planar manipulator is rewarded for bringing an object to a target location. In order to assist with exploration, in 10% of episodes the object is initialised in the gripper or at the target. Four manipulator tasks: {bring,insert}{ball,peg} of which only bring_ball is in the benchmarking set. The other three are shown below.
Manipulator (22, 5, 37): 平面机械臂奖励将物体带到目标位置。为了帮助探索，在10%的情节中，物体在夹爪或目标处初始化。四个机械臂任务：{bring,insert}{ball,peg}，其中只有bring_ball在基准集中。其他三个任务如下。
在这里插入图片描述

Manipulator extra: insert_ball: place the ball inside the basket. bring_peg: bring the peg to the target peg. insert_peg: insert the peg into the slot. See this video for solutions of insertion tasks.
Manipulator extra: insert_ball: 将球放入篮子中。 bring_peg: 将销带到目标销。 insert_peg: 将销插入插槽。请查看此视频以获取插入任务的解决方案。
在这里插入图片描述

Stacker (6k+16, 5, 11k+26): Stack k boxes. Reward is given when a box is at the target and the gripper is away from the target, making stacking necessary. The height of the target is sampled uniformly from {1, . . . , k}.
Stacker (6k+16, 5, 11k+26): 堆叠k个箱子。当箱子处于目标位置且夹爪远离目标时，奖励会给予，这使得堆叠成为必要。目标的高度从{1, . . . , k}中均匀采样。
在这里插入图片描述

Swimmer (2k+4, k−1, 4k+1): This procedurally generated k-link planar swimmer is based on Coulom 2002, but using MuJoCo’s highReynolds fluid drag model. A reward of 1 is provided when the nose is inside the target and decreases smoothly with distance like a Lorentzian. The two instances provided in the benchmarking set are the 6-link and 15-link swimmers.
Swimmer (2k+4, k−1, 4k+1): 这是一种基于Coulom 2002的程序生成的k链接平面游泳者，但使用了MuJoCo的高雷诺流体阻力模型。当鼻子在目标内时，奖励为1，并且随着距离的增加而平滑减小，就像洛伦兹曲线一样。基准集中提供的两个实例是6链接和15链接的游泳者。
在这里插入图片描述

Humanoid (54, 21, 67): A simplified humanoid with 21 joints, based on the model in (Tassa et al., 2012). Three tasks: stand, walk and run are differentiated by the desired horizontal speed of 0, 1 and 10m/s, respectively. Observations are in an egocentric frame and many movement styles are possible solutions e.g. running backwards or sideways. This facilitates exploration of local optima.
Humanoid (54, 21, 67): 一个简化的具有21个关节的人形机器人，基于(Tassa等人，2012)的模型。有三个任务：站立，行走和奔跑，分别由所需的水平速度为0、1和10m/s区分。观察是在自我中心坐标系中进行的，许多运动风格都是可能的解决方案，例如向后或向侧面奔跑。这有助于探索局部最优解。
在这里插入图片描述

Humanoid_CMU (124, 56, 137): A humanoid body with 56 joints, adapted from (Merel et al., 2017) and based on the ASF model of subject #8 in the CMU Motion Capture Database. This domain has the same stand, walk and run tasks as the simpler humanoid. We include tools for parsing and playback of the CMU MoCap data, see below. A newer version of this model is now available; see Section 7.
Humanoid_CMU (124, 56, 137): 具有56个关节的人形机器人，改编自(Merel等人，2017)并基于CMU Motion Capture数据库中主题#8的ASF模型。此领域与较简单的人形机器人具有相同的stand，walk和run任务。我们提供了解析和播放CMU MoCap数据的工具，参见下文。此模型的更新版本现已推出；请参阅第7节。
在这里插入图片描述

LQR (2n, m, 2n): n masses, of which m (≤ n) are actuated, move on linear joints which are connected serially. The reward is a quadratic in the position and controls. Analytic transition and control-gain matrices are derived and the optimal policy and value functions are computed in lqr_solver.py using Riccati iterations. Since both controls and reward are unbounded, LQR is not in the benchmarking set.
LQR (2n, m, 2n): n个质点，其中m（≤ n）个被驱动，在线性关节上移动，这些关节串联连接。奖励是位置和控制的二次项。使用Riccati迭代在lqr_solver.py中导出了解析过渡和控制增益矩阵，并计算了最优策略和值函数。由于控制和奖励都是无界的，LQR不在基准集中。
在这里插入图片描述

Control Suite Benchmarking：Please see the original tech report for the Control Suite (Tassa et al., 2018) for detailed benchmarking results of the BENCHMARKING tasks, with several popular Reinforcement Learning algorithms.
Control Suite 基准测试： 请参阅Control Suite的原始技术报告（Tassa等人，2018），了解BENCHMARKING任务的详细基准测试结果，以及与几种流行的强化学习算法的比较。

6.3 Additional domains

附加领域

CMU Motion Capture Data：We enable humanoid_CMU to be used for imitation learning as in Merel et al. (2017), by providing tools for parsing, conversion and playback of human motion capture data from the CMU Motion Capture Database. The convert() function in the parse_amc module loads an AMC data file and returns a sequence of configurations for the humanoid_CMU model. The example script CMU_mocap_demo.py uses this function to generate a video.
卡内基梅隆大学运动捕捉数据： 允许使用humanoid_CMU进行模仿学习，通过提供解析、转换和播放来自CMU Motion Capture Database的人体运动捕捉数据的工具。parse_amc模块中的convert()函数加载AMC数据文件并返回humanoid_CMU模型的一系列配置。示例脚本CMU_mocap_demo.py使用此函数生成视频。

Quadruped (56, 12, 58)：The quadruped (Figure 5) has 56 state dimensions. Each leg has 3 actuators for a total of 12 actions. Besides the basic walk and run tasks on flat ground, in the escape task the quadruped must climb over procedural random terrain using an array of 20 range-finder sensors (Figure 5, middle). In the fetch task (Figure 5, right), the quadruped must run after a moving ball and dribble it to a target at the centre of an enclosed arena. See solutions in youtu.be/RhRLjbb7pBE.
四足机器人 四足机器人（图 5）具有 56 个状态变量。每条腿配备了 3 个驱动器，共有 12 个驱动器。除了在平坦地面上进行基本的行走和奔跑任务外，四足机器人在逃脱任务中必须利用 20 个测距仪传感器攀越由程序生成的崎岖地形（图 5，中间部分）。在取物任务中（图 5，右侧部分），四足机器人需要追逐一个移动的球，并将其带到封闭竞技场中央的目标位置。相关的解决方案视频可以在 youtu.be/RhRLjbb7pBE 中找到。
在这里插入图片描述

Figure 5: Left: The Quadruped. Middle: In the escape task the Quadruped must escape from random mountainous terrain using its rangefinder sensors. Right: In the fetch task the quadruped must fetch a moving ball and bring it to the red target.
图 5: 左侧：四足机器人。中间：在逃脱任务中，四足机器人必须利用其测距仪传感器从随机生成的崎岖地形中逃离。右侧：在取物任务中，四足机器人必须追逐移动的球并将其带到红色目标区域。

Dog (158, 38, 227)：A realistic model of a Pharaoh Dog (Figure 6) was prepared for DeepMind by leo3Dmodels and is made available to the wider research community. The kinematics, skinning weights and collision geometry are created procedurally using PyMJCF. The static model includes muscles and tendon attachment points (Figure 6, Right). Including these in the dynamical model using MuJoCo’s support for tendons and muscles requires detailed anatomical knowledge and remains future work.
法老狗模型：逼真的法老狗模型，由leo3Dmodels为DeepMind准备，并向更广泛的研究社区提供。该模型的运动学、蒙皮权重和碰撞几何是使用PyMJCF进行程序化创建的。静态模型包括肌肉和肌腱附着点。将这些包含在MuJoCo对肌腱和肌肉的支持中需要详细的解剖知识，这将是未来的工作。
在这里插入图片描述

Figure 6: Left: Dog skeleton, joints visualised as light blue elements. Middle-Left: Collision geometry, overlain with skeleton. Middle-Right: Textured skin, used for visualisation only. Right: Dog muscles, included but not rigged to the skeleton. See youtu.be/i0_OjDil0Fg for preliminary solution of the run and fetch tasks.
图 6: 左侧展示的是狗的骨骼结构，其中关节以浅蓝色显示。中间左侧展示的是碰撞检测的几何形状，它叠加在骨骼之上。中间右侧是用于视觉呈现的纹理皮肤。右侧展示了狗的肌肉组织，虽然已经包含在内，但尚未与骨骼结构相连接。关于机器人进行奔跑和取物任务的初步解决方案，可以通过观看 youtu.be/i0_OjDil0Fg 中的视频来了解。

Rodent (184, 38, 107 + 64×64×3 pixels)：In order to better compare learned behaviour with experimental settings common in the life sciences, we have built a model of a rodent. See (Merel et al., 2020) for initial research training a policy to control this model using visual inputs and analysing the resulting neural representations. Related videos of rodent tasks therein: “forage”, “gaps”, “escape” and “two-tap”. The skeleton reference model was made by leo3Dmodels (not included).

啮齿动物：为了更好地比较学到的行为与在生命科学中常见的实验设置，我们构建了一个啮齿动物的模型。有关使用视觉输入训练控制此模型的策略并分析由此产生的神经表示的初步研究，请参见(Merel et al., 2020)。相关的啮齿动物任务视频包括“觅食”、“间隙”、“逃生”和“两次轻敲”。骨架参考模型由leo3Dmodels制作（未包含在内）。
在这里插入图片描述

Figure 7: Rodent; figure reproduced from Merel et al. 2020. Left: Anatomical skeleton of a rodent (as reference; not part of physical simulation). Middle-Left: Collision geometry designed around the skeleton. Middle-Right: Cosmetic skin to cover the body. Right: Semitransparent visualisation of the three layers overlain.
图 7 展示了一个啮齿动物的模型；该图像是根据 Merel 等人在 2020 年的研究复制的。左侧展示的是啮齿动物的解剖学骨骼，用作参考但不在物理模拟中实际使用。中间左侧显示的是为碰撞检测设计的几何形状，它紧密贴合在骨骼周围。中间右侧是用于外观展示的皮肤模型，它覆盖在身体表面。右侧则是将这三层（解剖骨骼、碰撞几何体和装饰性皮肤）以半透明方式叠加在一起的视觉效果。

7 Locomotion tasks

运动任务

Inspired by our early work in Heess et al. 2017, the Locomotion library provides a framework and a set of high-level components for creating rich locomotion-related task domains. The central abstractions are the Walker, an agent-controlled Composer Entity that can move itself, and the Arena, the physical environment in which behaviour takes place. Walkers expose locomotion-specific methods, like observation transformations into an egocentric frame, while Arenas can re-scale themselves to fit Walkers of different sizes. Together with the Task, which includes a specification of episode initialisation, termination, and reward logic, a full RL environment is specified. The library currently includes navigating a corridor with obstacles, foraging for rewards in a maze, traversing rough terrain and multi-agent soccer. Many of these tasks were first introduced in Merel et al. 2019a and Liu et al. 2019.
受到Heess等人在2017年的早期工作的启发，Locomotion库提供了一个框架和一组用于创建丰富的与运动相关的任务领域的高级组件。中心抽象是Walker，一个由代理控制的Composer Entity，可以移动自身，以及Arena，行为发生的物理环境。Walker公开了特定于运动的方法，如将观察转换为自我中心的方法，而Arenas可以重新调整自己以适应不同大小的Walker。与Task一起，其中包括对Episode初始化、终止和奖励逻辑的规范，指定了完整的RL环境。该库目前包括在具有障碍物的走廊中导航、在迷宫中寻找奖励、穿越崎岖地形和多智能体足球等任务。其中许多任务首次出现在Merel等人的2019a和Liu等人的2019年的论文中。

7.1 Humanoid running along corridor with obstacles

7.1 在带有障碍物的走廊中奔跑的人形机器人

As an illustrative example of using the Locomotion infrastructure to build an RL environment, consider placing a humanoid in a corridor with walls, and a task specifying that the humanoid will be rewarded for running along this corridor, navigating around the wall obstacles using vision. We instantiate the environment as a composition of the Walker, Arena, and Task as follows. First, we build a position-controlled CMU humanoid walker.
作为使用Locomotion基础设施构建RL环境的说明性示例，考虑将一个人形机器人放置在一个带有墙壁的走廊中，并指定一个任务，该任务规定人形机器人将在奔跑时被奖励，使用视觉导航绕过墙壁障碍物。我们将环境实例化为Walker、Arena和Task的组合，具体如下。首先，我们构建一个位置控制的CMU人形机器人行走者。

walker = cmu_humanoid.CMUHumanoidPositionControlledV2020(
observable_options={'egocentric_camera': dict(enabled=True)})

Note that this CMU humanoid is “V2020”, an improved version from the initial one released in Tassa et al. (2018). Modifications include overall body height and mass better matching a typical human, more realistic body proportions, and better tuned gains and torque limits for position-control actuators.
请注意，这个CMU人形机器人是“V2020”，是Tassa等人（2018年）发布的初始版本的改进版。修改包括整体身高和质量更符合典型人类，更现实的身体比例以及为位置控制执行器调整得更好的增益和扭矩限制。
在这里插入图片描述

Figure 8: A perspective of the environment in which the humanoid is tasked with navigating around walls along a corridor.
图 8 展示了一个环境透视视图，其中人类形状的机器人被编程要求在走廊中导航并绕过墙壁。

Next, we construct a corridor-shaped arena that is obstructed by walls.
接下来，我们构建一个由墙壁阻挡的走廊形状的场地。

arena = arenas.WallsCorridor(wall_gap=3.,
							wall_width=distributions.Uniform(2., 3.),
							wall_height=distributions.Uniform(2.5, 3.5),
							corridor_width=4.,
							corridor_length=30.)

Finally, a task that rewards the agent for running down the corridor at a specific velocity is instantiated as a composer.Environment.
最后，一个奖励代理以特定速度沿走廊奔跑的任务被实例化为composer.Environment。

task = tasks.RunThroughCorridor(walker=walker,
								arena=arena,
								walker_spawn_position=(0.5, 0, 0),
								target_velocity=3.0,
								physics_timestep=0.005,
								control_timestep=0.03)
environment = composer.Environment(time_limit=10,
								   task=task,
								   strip_singleton_obs_buffer_dim=True)

youtu.be/UfSHdOg-bOA shows a video of a solution of this task, produced with Abdolmaleki et al. 2018’s MPO agent.
youtu.be/UfSHdOg-bOA 展示了使用Abdolmaleki等人（2018年）的MPO代理生成的此任务的解决方案的视频。

7.2 Maze navigation and foraging

7.2 迷宫导航和觅食

We include a procedural maze generator for arenas (the same one used in Beattie et al. 2016), to construct navigation and foraging tasks. On the right is the CMU humanoid in a humansized maze, with spherical rewarding elements. youtu.be/vBIV1qJpJK8 shows the Rodent navigating a rodent-scale maze arena.

我们在环境中包含了一个程序生成的迷宫生成器（与Beattie等人2016年使用的相同），用于构建导航和觅食任务。右侧是在人类大小的迷宫中的CMU人形机器人，周围有球形的奖励元素。youtu.be/vBIV1qJpJK8 展示了啮齿动物在啮齿动物尺度的迷宫场地中导航的视频。
在这里插入图片描述

7.3 Multi-Agent soccer

7.3 多智能体足球

在这里插入图片描述

Figure 9: Rendered scenes of Locomotion multi-agent soccer Left: 2-vs-2 with BoxHead walkers. Right: 3-vs-3 with Ant.
图 9 展示了多智能体运动足球的渲染场景。左侧是 2 对 2 的比赛，其中智能体采用了 BoxHead 形状的行走者。右侧是 3 对 3 的比赛，智能体则采用了蚂蚁形状。

Building on Composer and Locomotion libraries, the Multi-agent soccer environments, introduced in Liu et al. 2019, follow a consistent task structure of Walkers, Arena, and Task where instead of a single walker, we inject multiple walkers that can interact with each other physically in the same scene. The code snippet below shows how to instantiate a 2-vs-2 Multi-agent Soccer environment with the simple, 5 degree-of-freedom BoxHead walker type. For example it can trivially be replaced by WalkerType.ANT, as shown in Figure 9.

在Composer和Locomotion库的基础上，引入了在Liu等人2019年提出的多智能体足球环境，其遵循了Walkers、Arena和Task的一致任务结构，其中我们注入了多个Walker，它们可以在同一场景中进行物理交互。下面的代码片段展示了如何使用简单的5自由度BoxHead walker类型实例化2对2的多智能体足球环境。例如，可以轻松地将其替换为WalkerType.ANT，如图9所示。

from dm_control.locomotion import soccer
team_size = 2
num_walkers = 2 ⁎ team_size
env = soccer.load(team_size=team_size,
				  time_limit=45,
				  walker_type=soccer.WalkerType.BOXHEAD)

To implement a synchronous multi-agent environment, we adopt the convention that each TimeStep contains a sequence of per-agent observation dictionaries and expects a sequence of per-agent action arrays in return.

为了实现同步的多智能体环境，我们采用了这样的约定，即每个TimeStep包含一系列每个代理的观察字典，并期望返回一系列每个代理的动作数组。

assert len(env.action_spec()) == num_walkers
assert len(env.observation_spec()) == num_walkers

# Reset and initialize the environment.
timestep = env.reset()

# Generates a random action according to the ‘action_spec‘.
random_actions = [spec.generate_value() for spec in env.action_spec()]
timestep = env.step(random_actions)

# Check that timestep respects multi-agent action and observation convention.
assert len(timestep.observation) == num_walkers
assert len(timestep.reward) == num_walkers

8 Manipulation tasks

操纵任务

The manipulation module provides a robotic arm, a set of simple objects, and tools for building reward functions for manipulation tasks. Each example environment comes in two different versions that differ in the types of observation available to the agent:
操纵模块提供了一个机械臂、一组简单的物体以及构建操纵任务的奖励函数的工具。每个示例环境都有两个不同版本，代理可以使用不同类型的观察来感知环境：
1.features
• Arm joint positions, velocities, and torques.
• Task-specific privileged features (including the positions and velocities of other movable objects in the scene).

机械臂关节的位置、速度和扭矩。
任务特定的特权特征（包括场景中其他可移动对象的位置和速度）。

2.vision
• Arm joint positions, velocities, and torques.
• Fixed RGB camera view showing the workspace.

机械臂关节的位置、速度和扭矩。
显示工作区的固定RGB相机视图。
在这里插入图片描述

Figure 10: Randomly sampled initial configurations for the lift_brick, place_cradle, and stack_3_bricks environments (left to right). The rightmost panel shows the corresponding 84x84 pixel visual observation returned by stack_3_bricks_vision. Note the stack of three translucent bricks to the right of the workspace, representing the goal configuration.

图10： lift_brick、place_cradle 和 stack_3_bricks 环境的随机采样的初始配置（从左到右）。最右边的面板显示了 stack_3_bricks_vision 返回的相应的 84x84 像素的视觉观察。请注意，右侧显示了三块半透明的砖块，代表目标配置。

All of the manipulation environments return a reward r(s, a) ∈ [0, 1] per timestep, and have an episode time limit of 10 seconds. The following code snippet shows how to import the manipulation tasks and view all of the available environments:
所有操纵环境都返回每个时间步的奖励 r(s, a) ∈ [0, 1]，并且具有10秒的回合时间限制。以下代码片段展示了如何导入操纵任务并查看所有可用的环境：

from dm_control import manipulation
# ‘ALL‘ is a tuple containing the names of all of the environments.
print('\n'.join(manipulation.ALL))

输出:
stack_2_bricks_features
stack_2_bricks_vision
stack_2_bricks_moveable_base_features
stack_2_bricks_moveable_base_vision
stack_3_bricks_features
stack_3_bricks_vision
stack_3_bricks_random_order_features
stack_2_of_3_bricks_random_order_features
stack_2_of_3_bricks_random_order_vision
reassemble_3_bricks_fixed_order_features
reassemble_3_bricks_fixed_order_vision
reassemble_5_bricks_random_order_features
reassemble_5_bricks_random_order_vision
lift_brick_features
lift_brick_vision
lift_large_box_features
lift_large_box_vision
place_brick_features
place_brick_vision
place_cradle_features
place_cradle_vision
reach_duplo_features
reach_duplo_vision
reach_site_features
reach_site_vision

Environments are also tagged according to what types of observation they return. get_environments_by_tag lists the names of environments with specific tags:
环境还根据它们返回的观察类型进行了标记。get_environments_by_tag 列出了带有特定标记的环境的名称：

print('\n'.join(manipulation.get_environments_by_tag('vision')))

Environments are instantiated by name using the load method, which also takes an optional seed argument that can be used to seed the random number generator used by the environment.
环境通过名称使用 load 方法实例化，该方法还接受一个可选的种子参数，可用于为环境使用的随机数生成器提供种子。

env = manipulation.load('stack_3_bricks_vision', seed=42)

8.1 Studded brick model

8.1 凸起砖模型

The stack_3_bricks_vision task, like most of the included manipulation tasks, makes use of the studded bricks shown on the right. These were modelled on Lego Duplo® bricks, snapping together when properly aligned and force is applied, and holding together using friction.

stack_3_bricks_vision 任务，就像包含的大多数操纵任务一样，使用了右侧显示的凸起砖块。这些砖块是基于乐高 Duplo® 砖块建模的，在正确对齐并施加力时会咔哒一声连接在一起，并通过摩擦力保持在一起。
在这里插入图片描述

8.2 Task descriptions

8.2 任务描述

Brief descriptions of each task are given below:
• reach_site: Move the end effector to a target location in 3D space.
将末端执行器移动到三维空间中的目标位置。
• reach_brick: Move the end effector to a brick resting on the ground.
将末端执行器移动到静置在地面上的砖块上。
• lift_brick: Elevate a brick above a threshold height.
将砖块提升到一个阈值高度以上。
• lift_large_box: Elevate a large box above a threshold height. The box is too large to be grasped by the gripper, requiring non-prehensile manipulation.
将大箱子提升到一个阈值高度以上。这个箱子太大了，无法被夹具抓住，需要使用非抓持操纵。
• place_cradle: Place a brick inside a concave ‘cradle’ situated on a pedestal.
将砖块放在位于基座上的凹陷“摇篮”中。
• place_brick: Place a brick on top of another brick that is attached to the top of a pedestal. Unlike the stacking tasks below, the two bricks are not required to be snapped together in order to obtain maximum reward.
将砖块放在附着在基座顶部的另一块砖上。与下面的堆叠任务不同，为了获得最大奖励，这两个砖块不需要咔哒一声连接在一起。
• stack_2_bricks: Snap together two bricks, one of which is attached to the floor.
将两块砖块咔哒一声连接在一起，其中一块附着在地板上。
• stack_2_bricks_moveable_base: Same as stack_2_bricks, except both bricks are movable.
与 stack_2_bricks 相同，只是两个砖块都是可移动的。
• stack_2_of_3_bricks_random_order: Same as stack_2_bricks, except there is a choice of two color-coded movable bricks, and the agent must place the correct one on top of the fixed bottom brick. The goal configuration is represented by a visual hint consisting of a stack of translucent, contactless bricks to the side of the workspace. In the features version of the task the observations also contain a vector of indices representing the desired order of the bricks.
与 stack_2_bricks 相同，只是有两块彩色的可移动砖块可供选择，代理必须将正确的砖块放在固定的底部砖块上。目标配置由一组半透明的、无接触的砖块组成，位于工作区的一侧。在任务的 features 版本中，观察还包含表示砖块期望顺序的索引向量。
• stack_3_bricks: Assemble a tower of 3 bricks. The bottom brick is attached to the floor, whereas the other two are movable. The top two bricks must be assembled in a specific order.
组装一个由3块砖块组成的塔。底部的砖块附着在地板上，而其他两块是可移动的。必须以特定顺序组装顶部两块砖块。
• stack_3_bricks_random_order: Same as stack_3_bricks, except the order of the top two bricks does not matter.
与 stack_3_bricks 相同，只是顶部两块砖块的顺序无关紧要。
• reassemble_3_bricks_fixed_order: The episode begins with all three bricks already assembled in a stack, with the bottom brick being attached to the floor. The agent must disassemble the top two bricks in the stack, and reassemble them in the opposite order.
一集开始时，三块砖块已经组装成一堆，底部的砖块附着在地板上。代理必须拆卸堆栈中的前两块砖块，并以相反的顺序重新组装它们。
• reassemble_5_bricks_random_order: Same as the previous task, except there are 5 bricks in the initial stack. There are therefore 4! − 1 possible alternative configurations in which the top 4 bricks in the stack can be reassembled, of which only one is correct.
与前一个任务相同，只是初始堆栈中有5块砖块。因此，顶部4块砖块可以以4! − 1 种可能的替代配置重新组装，其中只有一种是正确的。
9 Conclusion
9 结论

dm_control is a starting place for the testing and performance comparison of reinforcement learning algorithms for physics-based control. It offers a wide range of pre-designed RL tasks and a rich framework for designing new ones. We are excited to be sharing these tools with the wider community and hope that they will be found useful. We look forward to the diverse research the Control Suite and associated libraries may enable, and to integrating community contributions in future releases.

dm_control是测试和性能比较基于物理的控制强化学习算法的起点。它提供了各种预设计的强化学习任务和一个设计新任务的丰富框架。我们很高兴与更广泛的社区分享这些工具，并希望它们会被发现有用。我们期待着Control Suite和相关库可能促成的多样化研究，并期待在未来的发布中整合社区的贡献。

10 Acknowledgements

We would like to thank Raia Hadsell, Yori Zwols and Joseph Modayil for their reviews; Yazhe Li and Diego de Las Casas for their help with the Control Suite; Ali Eslami and Guy Lever for their contributions to the soccer environment.

我们要感谢Raia Hadsell、Yori Zwols和Joseph Modayil对本文的审查；感谢Yazhe Li和Diego de Las Casas在Control Suite方面的帮助；感谢Ali Eslami和Guy Lever对足球环境的贡献。