mmdetection报错汇总

2022/7/21 mmdetection

# mmdetection训练报错汇总

  • one of the variables needed for gradient computation has been modified by an inplace operation

    • 详细报错信息

      Traceback (most recent call last):
      File "tools/train.py", line 242, in <module>
          main()
      File "tools/train.py", line 231, in main
          train_detector(
      File "/home/lgh/code/mmlab/mmdetection/mmdet/apis/train.py", line 244, in train_detector
          runner.run(data_loaders, cfg.workflow)
      File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
          epoch_runner(data_loaders[i], **kwargs)
      File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 51, in train
          self.call_hook('after_train_iter')
      File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 309, in call_hook
          getattr(hook, fn_name)(self)
      File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/mmcv/runner/hooks/optimizer.py", line 56, in after_train_iter
          runner.outputs['loss'].backward()
      File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
          torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
      File "/home/lgh/miniconda3/envs/mmlab/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
          Variable._execution_engine.run_backward(
      RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 2048, 25, 25]], which is output 0 of ReluBackward1, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
      
    • 问题分析:
      所谓inplace操作就是直接修改地址上的值。

      首先是torch中所有加 _ 的函数,x.squeeze_(),x.unsqueeze_()。这些操作直接修改变量,不返回值。

      其次一些函数可以通过设置是否inplace,比如Pytorch中 torch.relu()和torch.sigmoid()等激活函数不是inplace操作,其中ReLU可通过设置inplace=True进行inplace操作。

      以及一些算术操作,x += res是inplace操作,x = x + res不是。以及一些赋值操作。

      如果一些用于backward的值被inplace操作修改了就会报错,所以最好不使用inplace操作,以及将赋值操作放在各种需要计算梯度运算之前,gather要在赋值之后。

    • 解决方案:
      ReLuinplace=True移除

    • 参考文章:
      https://blog.csdn.net/qq_38163755/article/details/110957133 (opens new window)