onav_rim 复现记录

任务
复现过程
- 克隆项目，创建环境
- 源码安装habitat-sim
- 从github上安装CLIP
- 环境配置收尾工作
- 数据集下载
- 模型评估
- 其他问题
- 训练
- 训练模型

任务

上次复现one4all失败，但我就是想看看我的电脑能不能做end2end的视觉导航任务。这次看到了《Object Goal Navigation with Recursive Implicit Maps》这篇论文，感觉也是非常有意思的工作。这次不头铁，严格按照readme的指示来，如果成功了，以后再换回python3.10的conda环境。

复现过程

克隆项目，创建环境

克隆项目：

git clone --recursive https://github.com/cshizhe/onav_rim.git

–recursive 是递归地克隆所有子模块的意思，所以这句话会很慢。
创建conda环境：

conda create -n ONAV python=3.8
conda activate ONAV
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=10.2 -c pytorch

没出现任何问题

源码安装habitat-sim

他的habitat-sim直接就递归地下载到了dependencies目录下，感觉这样的设计也蛮好的。我看里面的requirements没啥雷点，那就直接进入目录后，install：

pip3 install -r requirements.txt

报错：

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pykitti 0.3.1 requires pandas, which is not installed.
Successfully installed attrs-23.2.0 contourpy-1.1.1 cycler-0.12.1 fonttools-4.49.0 gitdb-4.0.11 gitpython-3.1.42 imageio-2.34.0 imageio-ffmpeg-0.4.9 importlib-metadata-7.0.1 importlib-resources-6.1.2 kiwisolver-1.4.5 llvmlite-0.41.1 matplotlib-3.7.5 numba-0.58.1 numpy-quaternion-2023.0.2 packaging-23.2 pyparsing-3.1.1 scipy-1.10.1 six-1.16.0 smmap-5.0.1 tqdm-4.66.2 zipp-3.17.0

[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: pip install --upgrade pip

不是大问题，一方面说我的pip版本可以更新，一方面说pandas我没安装。
于是：

pip install --upgrade pip
pip3 install pandas

接下来要运行bash脚本，他的命令安装的是无显示，有bullet版本。但我想要显示和bullet。我也懒得看代码了，直接不加headless这个参数：

./build.sh --bullet

编译运行成功！
接下来要设置环境变量，我习惯自己打开bashrc修改，这样添加注释，方便以后更改：

sudo gedit ~/.bashrc

添加：

# Visual Navigation TEST

# Object Goal Navigation with Recursive Implicit Maps
export PYTHONPATH=$PYTHONPATH:/home/lcy-magic/VisualNav_TEST/onav_rim/dependencies/habitat-sim/
	# avoid logging
export GLOG_minloglevel=2
export MAGNUM_LOG=quiet

更新：

source ~/.bashrc
conda activate ONAV

从github上安装CLIP

回到根目录（虽然感觉不用回），执行：

pip3 install ftfy regex tqdm
pip3 install git+https://github.com/openai/CLIP.git

没有问题，很顺利。其中，pip3 install git+https:就相当于把这个仓库的代码clone下来，再通过setup.py安装。参考的这个博客的解释参考博客

环境配置收尾工作

查看requirements没有啥奇怪的东西后：

pip3 install -r requirements.txt
python setup.py develop --all

很顺利。

数据集下载

这个部分他就写的太简略了，他让按照habitat-imitation-baselines中d 指令，下载MP3D的场景数据集、物体仓库和演示数据集。MP3D就是Matterport3D。
我们点开那个链接，发现：
在这里插入图片描述
如果要下载场景数据集，还需要去另一个链接。

下载物品仓库，似乎只要点击，然后解压到指定位置。

如果要下载演示数据集，也是点击下载，然后解压。

为了方便，我先下载物品数据集：
点击那个here，然后下载得到压缩包 objects.gz。
然后下载演示数据集：
他有好多，但这个论文既然只是导航任务，他应该只用ObjectNav-HD里面MP3D的场景吧。于是我点了第一个。获得压缩包 objectnav_mp3d_70k.json.gz
然后按照下载场景数据集的指示跳转到新链接，发现是habitat-lab的项目链接。
查看datasets部分，发现又要跳转链接：
在这里插入图片描述
那就再跳，然后定位到MP3D数据集的地方：

发现如果只下载一个场景的话，运行一个脚本就行。如果要下载整个数据集，则要参考一个官方链接。好吧，我们再跳转，并定位到dataset download部分：

发现要提交一个表格到邮箱，然后才能下载。我发现这个邮件我以前发过，去年我有想尝试用MP3D，然后发过邮件，翻出来原回件：
在这里插入图片描述
整个数据集竟然有1.3T。而我现在只有100多G可用了，呜呜。顶多把我还没挂载的硬盘部分挂载上来，那也就800G。而且不知道1.3T要下载到猴年马月。我之前好像也没下载吧？
我看我当时下载的数据是啥：

发现有很多场景。总共有40+G：
在这里插入图片描述
我也不记得当时咋安装了，这就是不写博客的坏处。那个项目的readme说：

印象里我好想就是只下载了matterport_skybox_images。想起来了，我当时参照的是这个博客：参考博客。里面还有我的评论回复：

这个博客说这样下载：
在这里插入图片描述
指令是：

download-mp.py -o [directory in which to download] --type matterport_skybox_images undistorted_camera_parameters

这是我当时修改后，可以用python3运行的下载脚本：

#!/usr/bin/env python
# Downloads MP public data release
# Run with ./download_mp.py (or python download_mp.py on Windows)
# -*- coding: utf-8 -*-
import argparse
import collections
import os
import tempfile
import urllib.request

BASE_URL = 'http://kaldir.vc.in.tum.de/matterport/'
RELEASE = 'v1/scans'
RELEASE_TASKS = 'v1/tasks/'
RELEASE_SIZE = '1.3TB'
TOS_URL = BASE_URL + 'MP_TOS.pdf'
FILETYPES = [
    'cameras',
    'matterport_camera_intrinsics',
    'matterport_camera_poses',
    'matterport_color_images',
    'matterport_depth_images',
    'matterport_hdr_images',
    'matterport_mesh',
    'matterport_skybox_images',
    'undistorted_camera_parameters',
    'undistorted_color_images',
    'undistorted_depth_images',
    'undistorted_normal_images',
    'house_segmentations',
    'region_segmentations',
    'image_overlap_data',
    'poisson_meshes',
    'sens'
]
TASK_FILES = {
    'keypoint_matching_data': ['keypoint_matching/data.zip'],
    'keypoint_matching_models': ['keypoint_matching/models.zip'],
    'surface_normal_data': ['surface_normal/data_list.zip'],
    'surface_normal_models': ['surface_normal/models.zip'],
    'region_classification_data': ['region_classification/data.zip'],
    'region_classification_models': ['region_classification/models.zip'],
    'semantic_voxel_label_data': ['semantic_voxel_label/data.zip'],
    'semantic_voxel_label_models': ['semantic_voxel_label/models.zip'],
    'minos': ['mp3d_minos.zip'],
    'gibson': ['mp3d_for_gibson.tar.gz'],
    'habitat': ['mp3d_habitat.zip'],
    'pixelsynth': ['mp3d_pixelsynth.zip'],
    'igibson': ['mp3d_for_igibson.zip'],
    'mp360': ['mp3d_360/data_00.zip', 'mp3d_360/data_01.zip', 'mp3d_360/data_02.zip', 'mp3d_360/data_03.zip', 'mp3d_360/data_04.zip', 'mp3d_360/data_05.zip', 'mp3d_360/data_06.zip']
}


def get_release_scans(release_file):
    scan_lines = urllib.request.urlopen(release_file)
    scans = []
    for scan_line in scan_lines:
        scan_id = scan_line.rstrip(b'\n').decode('utf-8')
        scans.append(scan_id)
    return scans


def download_release(release_scans, out_dir, file_types):
    print('Downloading MP release to ' + out_dir + '...')
    for scan_id in release_scans:
        scan_out_dir = os.path.join(out_dir, scan_id)
        download_scan(scan_id, scan_out_dir, file_types)
    print('Downloaded MP release.')


def download_file(url, out_file):
    out_dir = os.path.dirname(out_file)
    if not os.path.isfile(out_file):
        print('\t' + url + ' > ' + out_file)
        fh, out_file_tmp = tempfile.mkstemp(dir=out_dir)
        f = os.fdopen(fh, 'w')
        f.close()
        urllib.request.urlretrieve(url, out_file_tmp)
        os.rename(out_file_tmp, out_file)
    else:
        print('WARNING: skipping download of existing file ' + out_file)

def download_scan(scan_id, out_dir, file_types):
    print('Downloading MP scan ' + scan_id + ' ...')
    if not os.path.isdir(out_dir):
        os.makedirs(out_dir)
    for ft in file_types:
        url = BASE_URL + RELEASE + '/' + scan_id + '/' + ft + '.zip'
        out_file = out_dir + '/' + ft + '.zip'
        download_file(url, out_file)
    print('Downloaded scan ' + scan_id)


def download_task_data(task_data, out_dir):
    print('Downloading MP task data for ' + str(task_data) + ' ...')
    for task_data_id in task_data:
        if task_data_id in TASK_FILES:
            file = TASK_FILES[task_data_id]
            for filepart in file:
                url = BASE_URL + RELEASE_TASKS + '/' + filepart
                localpath = os.path.join(out_dir, filepart)
                localdir = os.path.dirname(localpath)
                if not os.path.isdir(localdir):
                    os.makedirs(localdir)
                    download_file(url, localpath)
                    print('Downloaded task data ' + task_data_id)


def main():
    parser = argparse.ArgumentParser(description=
        '''
        Downloads MP public data release.
        Example invocation:
          python download_mp.py -o base_dir --id ALL --type object_segmentations --task_data semantic_voxel_label_data semantic_voxel_label_models
        The -o argument is required and specifies the base_dir local directory.
        After download base_dir/v1/scans is populated with scan data, and base_dir/v1/tasks is populated with task data.
        Unzip scan files from base_dir/v1/scans and task files from base_dir/v1/tasks/task_name.
        The --type argument is optional (all data types are downloaded if unspecified).
        The --id ALL argument will download all house data. Use --id house_id to download specific house data.
        The --task_data argument is optional and will download task data and model files.
        ''',
        formatter_class=argparse.RawTextHelpFormatter)
    parser.add_argument('-o', '--out_dir', required=True, help='directory in which to download')
    parser.add_argument('--task_data', default=[], nargs='+', help='task data files to download. Any of: ' + ','.join(TASK_FILES.keys()))
    parser.add_argument('--id', default='ALL', help='specific scan id to download or ALL to download entire dataset')
    parser.add_argument('--type', nargs='+', help='specific file types to download. Any of: ' + ','.join(FILETYPES))
    args = parser.parse_args()

    print('By pressing any key to continue you confirm that you have agreed to the MP terms of use as described at:')
    print(TOS_URL)
    print('***')
    print('Press any key to continue, or CTRL-C to exit.')
    key = input('')

    release_file = BASE_URL + RELEASE + '.txt'
    release_scans = get_release_scans(release_file)
    file_types = FILETYPES

    # download task data
    if args.task_data:
        if set(args.task_data) & set(TASK_FILES.keys()):  # download task data
            out_dir = os.path.join(args.out_dir, RELEASE_TASKS)
            download_task_data(args.task_data, out_dir)
        else:
            print('ERROR: Unrecognized task data id: ' + args.task_data)
        print('Done downloading task_data for ' + str(args.task_data))
        key = input('Press any key to continue on to main dataset download, or CTRL-C to exit.')

    # download specific file types?
    if args.type:
        if not set(args.type) & set(FILETYPES):
            print('ERROR: Invalid file type: ' + file_type)
            return
        file_types = args.type

    if args.id and args.id != 'ALL':  # download single scan
        scan_id = args.id
        if scan_id not in release_scans:
            print('ERROR: Invalid scan id: ' + scan_id)
        else:
            out_dir = os.path.join(args.out_dir, RELEASE, scan_id)
            download_scan(scan_id, out_dir, file_types)
    elif 'minos' not in args.task_data and args.id == 'ALL' or args.id == 'all':  # download entire release
        if len(file_types) == len(FILETYPES):
            print('WARNING: You are downloading the entire MP release which requires ' + RELEASE_SIZE + ' of space.')
        else:
            print('WARNING: You are downloading all MP scans of type ' + file_types[0])
        print('Note that existing scan directories will be skipped. Delete partially downloaded directories to re-download.')
        print('***')
        print('Press any key to continue, or CTRL-C to exit.')
        key = input('')
        out_dir = os.path.join(args.out_dir, RELEASE)
        download_release(release_scans, out_dir, file_types)

if __name__ == "__main__": main()

到这里似乎还好，我把以前下载的数据搬过去就行。但其实并不对劲，因为onav项目要求目录长这样：
在这里插入图片描述
明明里面还有模型参数文件，并不是几个数据集文件放一起。
根据他的acknowledge所说，我觉得，问题在habita-imitation-baseline里。这个结构可能是这个baseline项目的结构。我似乎应该先把这个项目搞出来，然后复制过去。
在这里插入图片描述
但我去看habita-imitation-baseline的结构：

也不完全一样啊！

算了，走一步算一步。先把已有的数据搞上：
在这里插入图片描述
其中，datasets的objectnav目录是从演示数据集中弄来的，里面也有train文件夹。和onav以及那个baseline都一样。datasets中的mp3d目录我猜是场景数据集，我暂时没有，先只建了文件夹。baseline里，datasets里还有pick_place的文件夹，但onav不需要，所以也不用了。ddpo-models我猜应该是baseline的模型数据，因为onav不用强化学习。rednet应该一样，因为onav论文里也没提到rednet。scene_datasets我现在觉得这个才是场景数据集的路径。test_assets文件夹是我从物品数据集中弄出来的，应该没问题。
所以，我猜：

没有scene_datasets/mp3d估计是不行的，他里面应该是场景数据集。
没有ddpo-models和rednet-models说不定可以，因为他论文里也没用。搜了下baseline论文，里面缺失用到了rednet和ddppo，所以确定是baseline的文件，这个onav估计用不到。
没有datasets/mp3d不知道行不行，人家baseline里也没这个目录啊。

至于怎么获得场景数据集，我打算先只下载一个场景：

python download-mp.py -o  /home/lcy-magic/VisualNav_TEST/onav_rim/data/scene_datasets/mp3d --id 17DRP5sb8fy

这里面的download-mp.py，就是我上面提供的python脚本。记得先打开可执行权限。
在这里插入图片描述
敲击回车继续。下载时间挺长的。

可恶！发现这里有了：
在这里插入图片描述

花了好长时间又把这个下下来，但解压后发现也不是那回事儿。他的目录是这样的：

看来，要把我的两个data文件整合一下：

现在就mp3d目录下，没有val这个文件夹。因为用那个脚本下载下来的，里面的文件夹名就是scan，里面还全是各种数据的压缩包。但是这里有一个val文件夹：/home/lcy-magic/VisualNav_TEST/onav_rim/data/trn.all-imap.3.pos.sgtype-rgb.rn50.clip-depth-layer.4-hidden.512-history.200-enc.concat-act.linear-attn.pos/step_54426_heuristic_nocld/results/sem_seg_pred/val但里面是depth,rgb,seg,top_down_map四个空目录。后面再看到底啥情况。数据集部分暂时结束。

模型评估

因为已经有训练好的参数了，所以直接试一试模型评估脚本有没有问题：

(ONAV) lcy-magic@lcy-magic:~/VisualNav_TEST/onav_rim$ sbatch job_scripts/onav_eval_imap_transformers.sh

报错：

Command 'sbatch' not found, but can be installed with:

sudo apt install slurm-client

那就装吧：

sudo apt install slurm-client

再次运行报错：

sbatch: error: s_p_parse_file: unable to status file /etc/slurm-llnl/slurm.conf: No such file or directory, retrying in 1sec up to 60sec

查看代码，发现有几个地址设置的肯定不对，更改过来：

# cd /home/shichen/codes/onav_rim
cd /home/lcy-magic/VisualNav_TEST/onav_rim

# python_bin=$HOME/miniconda3/envs/onav/bin/python
python_bin=$HOME/anaconda3/envs/ONAV/bin/python

再运行还是同样的错误，看来是命令本身的问题。我搜了一下slurm，发现是个linux集群工具。好家伙，我就一个笔记本，应该用不到吧。然后我看他的训练部分也分为用slurm和不用slurm两种：
在这里插入图片描述
那我做验证，应该也用不到吧。我决定不走这一步，直接运行python脚本：

python /home/lcy-magic/VisualNav_TEST/onav_rim/evaluate_results.py /home/lcy-magic/VisualNav_TEST/onav_rim/test_result/results/val_pred_trajectories.jsonl

我把原命令的out_dir环境变量换成了我新建的一个空文件夹地址。因为我目前没有output_dir这个环境变量：
在这里插入图片描述
结果报错：

Traceback (most recent call last):
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/evaluate_results.py", line 101, in <module>
    main()
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/evaluate_results.py", line 95, in main
    eval_result_file(
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/evaluate_results.py", line 18, in eval_result_file
    with jsonlines.open(result_file, 'r') as f:
  File "/home/lcy-magic/anaconda3/envs/ONAV/lib/python3.8/site-packages/jsonlines/jsonlines.py", line 643, in open
    fp = builtins.open(file, mode=mode + "t", encoding=encoding)
FileNotFoundError: [Errno 2] No such file or directory: '/home/lcy-magic/VisualNav_TEST/onav_rim/test_result/results/val_pred_trajectories.jsonl'

说明，这个地址不是瞎设定的。需要这个地址里面有/results/val_pred_trajectories.jsonl文件。于是我查找到这个位置有这个文件：

/home/lcy-magic/VisualNav_TEST/onav_rim/data/trn.all-imap.3.pos.sgtype-rgb.rn50.clip-depth-layer.4-hidden.512-history.200-enc.concat-act.linear-attn.pos/step_54426_heuristic_nocld/results

先用他试试看：

python /home/lcy-magic/VisualNav_TEST/onav_rim/evaluate_results.py /home/lcy-magic/VisualNav_TEST/onav_rim/data/trn.all-imap.3.pos.sgtype-rgb.rn50.clip-depth-layer.4-hidden.512-history.200-enc.concat-act.linear-attn.pos/step_54426_heuristic_nocld/results/val_pred_trajectories.jsonl

果然运行成功了：
在这里插入图片描述

其他问题

到目前为止还有几个问题：

那个data的目录啥情况？
output_dir这个环境变量咋设置的？
我的电脑能跑他的训练吗？

我是这么分析的：
看来val_pred_trajectories.jsonl应该是训练结果的一种表示。应该能训练就能得到他。然后data的目录也在这个过程中能理清楚。

训练

运行：

python /home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/preprocess/extract_demo_observations.py \
  --cfg_file habitat_baselines/config/objectnav/il_ddp_objectnav.yaml \
  --scene_id gZ6f7yhEvPG \
  --save_semantic_fts --encode_depth --encode_rgb_clip \
  --outdir data/datasets/objectnav/mp3d_70k_demos_prefts

报错：

2024-03-05 20:01:54,627 Initializing dataset ObjectNav-v2
2024-03-05 20:01:54,852 initializing sim Sim-v0
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0305 20:01:54.859373 133312 StageAttributesManager.cpp:90] StageAttributesManager::registerObjectFinalize : Render asset template handle : data/scene_datasets/mp3d/gZ6f7yhEvPG/gZ6f7yhEvPG.glb specified in stage template with handle : data/scene_datasets/mp3d/gZ6f7yhEvPG/gZ6f7yhEvPG.glb does not correspond to any existing file or primitive render asset.  Aborting. 
段错误 (核心已转储)

我发现个项目中下载的场景文件都没有.glb文件，也就是3D文件。但我记得上一个项目one4all那个，里面下载了好多.glb文件。我从那里找到了/home/lcy-magic/VLN_TEST/habitat-sim/data/mp3d_example/17DRP5sb8fy17DRP5sb8fy.glb。我感觉这个才叫场景文件吧。我把他放到我这个项目的/onav_rim/data/scene_datasets/mp3d中。然后运行：

python /home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/preprocess/extract_demo_observations.py \
  --cfg_file habitat_baselines/config/objectnav/il_ddp_objectnav.yaml \
  --scene_id 17DRP5sb8fy17DRP5sb8fy \
  --save_semantic_fts --encode_depth --encode_rgb_clip \
  --outdir data/datasets/objectnav/mp3d_70k_demos_prefts

然后就开始运行了：
在这里插入图片描述
不知道现在在不在运行，他也没给个进度条。但电脑风扇在转，应该是在运行吧。
好长时间后，风扇不转了，我一看，发现报错了：

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
  0%|                                                                                                                                                  | 0/1141 [01:05<?, ?it/s]
Traceback (most recent call last):
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/preprocess/extract_demo_observations.py", line 392, in <module>
    main()
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/preprocess/extract_demo_observations.py", line 388, in main
    extract_demo_obs_and_fts(args)
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/preprocess/extract_demo_observations.py", line 257, in extract_demo_obs_and_fts
    rgb_fts['clip'] = rgb_clip_encoder.extract_fts(resized_rgb_images)
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/preprocess/encoders.py", line 44, in extract_fts
    fts.append(self.model(inputs).data.cpu().numpy())
  File "/home/lcy-magic/anaconda3/envs/ONAV/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lcy-magic/anaconda3/envs/ONAV/lib/python3.8/site-packages/clip/model.py", line 147, in forward
    x = stem(x)
  File "/home/lcy-magic/anaconda3/envs/ONAV/lib/python3.8/site-packages/clip/model.py", line 140, in stem
    x = self.relu1(self.bn1(self.conv1(x)))
  File "/home/lcy-magic/anaconda3/envs/ONAV/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lcy-magic/anaconda3/envs/ONAV/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 98, in forward
    return F.relu(input, inplace=self.inplace)
  File "/home/lcy-magic/anaconda3/envs/ONAV/lib/python3.8/site-packages/torch/nn/functional.py", line 1297, in relu
    result = torch.relu_(input)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

核心在此：

NVIDIA GeForce RTX 3070 Laptop GPU with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37.

也就是pytorch版本老了。当初项目要求安装pytorch==1.10.0，检查一下：
在这里插入图片描述
果然。那么我尝试直接自动升级torch版本：

pip3 install --upgrade torch

中途报错了：

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    unknown package:
        Expected sha256 5c0c83aa7d94569997f1f474595e808072d80b04d34912ce6f1a0e1c24b0c12a
             Got        c1a3c8ea5ee507c012dd093414c31a27932d2283c6da256d4367332664a8d16d

我没管，又运行了一次，这次应该是装好了：
在这里插入图片描述
再运行一次上一个代码，结果寄：

错误提示都没有，直接核心转储。我猜，跟刚才的报错可能有关系。参照这个博客参考博客，我先记录下当前的各个包的版本：

torch 2.2.1
torchvision 0.11.0
torchaudio 0.10.0
nvcc 11.8

然后我运行：

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

显示Requirement already satisfied，什么也没更新。
又运行：

pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

这回版本为：

Name: torch
Version: 1.8.1+cu111
Name: torchvision
Version: 0.9.1+cu111
Name: torchaudio
Version: 0.8.1
Build cuda_11.8.r11.8/compiler.31833905_0

再运行脚本，又可以继续了，我们静候佳音
在这里插入图片描述
成功！

这次项目下载场景数据集是这样的：

python download-mp.py -o  /home/lcy-magic/VisualNav_TEST/onav_rim/data/scene_datasets/mp3d --id 17DRP5sb8fy

下出来的东西是这样的：
在这里插入图片描述
这明明是各种传感器数据嘛。。。并不是场景文件。
上有一个项目下载场景数据集是这样的：

wget http://dl.fbaipublicfiles.com/habitat/mp3d_example.zip

下出来的东西是这样的：
在这里插入图片描述
或者这个：

python -m habitat_sim.utils.datasets_download --uids habitat_test_scenes --data-path /home/lcy-magic/VLN_TEST/Habitat_data
python -m habitat_sim.utils.datasets_download --uids replica_cad_dataset

也就是说，用habitat_sim.utils.datasets_download这个脚本可以下场景数据集。
插播一下，这两个网站可以在线查看glb文件：
网站1
网站2
我看能不能通过那个脚本，下载更多场景数据集，结果发现那个脚本里面只有这一个示例：mp3d_example_scene。这就是我们已经有的那个。那就不行了。
然后我看到这个参考链接，在写回复时意识到，我可能命令执行错了。具体情况是：

我也遇到了这个困惑。按照这个链接里的指示，我用habitat-sim的datasets_download脚本能下载一个mp3d场景数据集，里面有.glb文件。但我还想要其他场景的，查看这个脚本的代码，里面就没给下载的实现了。只能根据上述链接指示，填表格，发邮件，然后获取回信中的数据集下载脚本download_mp.py。但用这个脚本下载出来的mp3d数据集，只有每个场景下的各个传感器数据，用压缩包保存。所有压缩包中都没有.glb文件。
我当时的下载指令是：
python download-mp.py -o /home/lcy-magic/VisualNav_TEST/onav_rim/data/scene_datasets/mp3d --id 17DRP5sb8fy
可能因为我没有参照上述链接的模板：python download_mp.py --task habitat -o path/to/download/
的原因。我将运行python download_mp.py --task habitat -o /home/lcy-magic/VisualNav_TEST/onav_rim/data/scene_datasets/mp3d --id 17DRP5sb8fy
试一试

然后现在，他就正在下载着，明天再看：
在这里插入图片描述
下载结束：

发现v1文件夹中不只有scans了，也就是传感器数据。还有task文件夹，里面是各个场景的场景文件。因为我电脑没那么大空间全部解压了。只能提取出17DRP5sb8fy这个场景：

但挺傻的。。我都有这个数据了，还下载他干嘛。。。
我应该下载这个的：gZ6f7yhEvPG

训练模型

依照readme，在bashrc中加入：

# non-slurm servers
export WORLD_SIZE=1
export MASTER_ADDR='gpu003'
export MASTER_PORT=10000

再source之后运行：
添加环境变量：

configfile=/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/config/onav_imap_single_transformer.yaml

或者添加到bashrc：

sudo gedit ~/.bashrc
export configfile=/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/config/onav_imap_single_transformer.yaml
source ~/.bashrc

检查：

echo $configfile

没问题

然后运行：

python /home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/train_models.py --exp-config $configfile

报错：

03/06/2024 14:56:06 - WARNING - __main__ -   Output directory (data/exprs_iros23_release/imap_single_transformer/trn.all-imap.3.pos.sgtype-rgb.rn50.clip-depth-layer.4-hidden.512-history.200-enc.concat-act.linear-attn.pos) already exists and is not empty.
03/06/2024 14:56:06 - INFO - __main__ -   device: cuda n_gpu: 1, distributed training: False
03/06/2024 14:56:06 - INFO - __main__ -   #scenes: 1, #episodes: 1140
03/06/2024 14:56:06 - INFO - __main__ -   #num_steps_per_epoch: 36
  0%|                                                                       | 0/900 [00:00<?, ?it/s]03/06/2024 14:56:09 - INFO - __main__ -   

Setting up GPS sensor
03/06/2024 14:56:09 - INFO - __main__ -   

Setting up Compass sensor
03/06/2024 14:56:09 - INFO - __main__ -   Object categories: 28
03/06/2024 14:56:09 - INFO - __main__ -   

Setting up Object Goal sensor
03/06/2024 14:56:09 - INFO - __main__ -   step input size: 512
03/06/2024 14:56:09 - INFO - __main__ -   Model: nweights 44006630 nparams 402
03/06/2024 14:56:09 - INFO - __main__ -   Model: trainable nweights 13409958 nparams 75
Traceback (most recent call last):
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/train_models.py", line 315, in <module>
    main(config)
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/train_models.py", line 124, in main
    LOGGER.info(f"***** Running training with {config.world_size} GPUs *****")
  File "/home/lcy-magic/anaconda3/envs/ONAV/lib/python3.8/site-packages/yacs/config.py", line 141, in __getattr__
    raise AttributeError(name)
AttributeError: world_size
  0%|                                                                       | 0/900 [00:03<?, ?it/s]

看了下代码，这句话就是从config中读取一个变量，然后打印：

LOGGER.info(f"***** Running training with {config.world_size} GPUs *****")

这里就是从config中找不到world_size。我们可以看到，config是从build_args中获取的：

if __name__ == '__main__':
    config = build_args()

    main(config)

build_args中，config就是命令行传入的配置文件地址：
在这里插入图片描述
也就是：

/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/config/onav_imap_single_transformer.yaml

按理说，这个文件没修改过，应该没错。进去这个配置文件查看，里面确实没有world_size。
毕竟只是打印信息，我把打印的两句话注释掉了：

# LOGGER.info(f"***** Running training with {config.world_size} GPUs *****")
# LOGGER.info("  Batch size = %d", config.train_batch_size if config.local_rank == -1 else config.train_batch_size * config.world_size)

再次运行，报错：

Traceback (most recent call last):
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/train_models.py", line 318, in <module>
    main(config)
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/train_models.py", line 145, in main
    loss, logits = model(batch, compute_loss=True)
  File "/home/lcy-magic/anaconda3/envs/ONAV/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/models/onav_imap_models.py", line 165, in forward
    inputs = self.encode_step_obs(batch, step_embeddings=stepid_embeds)
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/models/onav_base.py", line 520, in encode_step_obs
    return self.encode_step_obs_concat(batch, **kwargs)
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/models/onav_base.py", line 417, in encode_step_obs_concat
    compass_observations = torch.concat(
AttributeError: module 'torch' has no attribute 'concat'

根据这个博客参考博客所说，应该是torch改版，应该把concat改为cat。我在整个项目中搜torch.concat，发现有5处，我都改成cat：

# compass_observations = torch.concat(
compass_observations = torch.cat(

# compass_observations = torch.concat(
compass_observations = torch.cat(

# compass_observations = torch.concat(
compass_observations = torch.cat(

# compass_fts = torch.concat(
compass_fts = torch.cat(

# masked_input_embeds = self.gps_embedding(batch['infer_gps']) + \
#                       self.compass_embedding(torch.concat([torch.cos(batch['infer_compass']),
#                       torch.sin(batch['infer_compass'])], -1))
masked_input_embeds = self.gps_embedding(batch['infer_gps']) + \
                         self.compass_embedding(torch.cat([torch.cos(batch['infer_compass']),
                         torch.sin(batch['infer_compass'])], -1))

再次运行，结果报错：

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 7.79 GiB total capacity; 5.38 GiB already allocated; 35.62 MiB free; 5.44 GiB reserved in total by PyTorch)

显存不够了！呜呜呜！
别灰心，说不定不是我显卡不行，是有别的东西占用了我的显存。
另开一个窗口，实时查看显存：

watch -n 1 nvidia-smi

没有运行训练代码时：
在这里插入图片描述
运行时：
![在这里插入图片描述](https://img-blog.csdnimg.cn/direct/3afaab52b67f4ba3b8831aa8e87272eb.png
可以看到，确实快占满了。
参照这个博客参考博客，看能不能解决一下：
尝试把配置文件中，训练的batch_size从32改成4：

train_batch_size: 4 #32

再运行，报新的错误了：

Traceback (most recent call last):
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/train_models.py", line 318, in <module>
    main(config)
  File "/home/lcy-magic/VisualNav_TEST/onav_rim/offline_bc/train_models.py", line 180, in main
    (logits.max(dim=-1)[1].data.cpu() == batch['demonstration'])[batch['inflection_weight'] > 0].float()
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

上次死在145行，这次死在180行，算是有点进步了。
测试下batch_size最大能多少。测试了16，可以，但也快吃满了。测试了8，大概吃一半多一点。先用8来吧。后面都跑通了，就改成16。然后再考虑用核显显示桌面，独显专门给cuda和opengl用的方案，空出更多显存给训练。

至于数据不在一个设备上的问题，我试试直接改成：

# (logits.max(dim=-1)[1].data.cpu() == batch['demonstration'])[batch['inflection_weight'] > 0].float()
(logits.max(dim=-1)[1].data.cuda() == batch['demonstration'])[batch['inflection_weight'] > 0].float()

似乎能跑动了：
在这里插入图片描述
我们静候佳音。
训练成功完成：

就问你牛不牛吧。开心死了。
再跑一个测试：

python /home/lcy-magic/VisualNav_TEST/onav_rim/evaluate_results.py /home/lcy-magic/VisualNav_TEST/onav_rim/data/trn.all-imap.3.pos.sgtype-rgb.rn50.clip-depth-layer.4-hidden.512-history.200-enc.concat-act.linear-attn.pos/step_54426_heuristic_nocld/results/val_pred_trajectories.jsonl

也没毛病：
在这里插入图片描述
大获全胜！虽然还不知道他的代码结构。但至少都能跑通了。也证明了我的电脑缺能训练这个任务的端到端网络（虽然感觉这么快有点出乎意料），训练用时接近1h。可以满意的去健身、干饭了。没有bug的一天就是美好的一天。