Skip to content

OOM in evaluation #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Luo-Z13 opened this issue Sep 5, 2023 · 1 comment
Open

OOM in evaluation #17

Luo-Z13 opened this issue Sep 5, 2023 · 1 comment

Comments

@Luo-Z13
Copy link

Luo-Z13 commented Sep 5, 2023

Environment ( I have tried pytorch 1.9.0 and pytorch 1.8.0 and their corresponding cudnn/ mmcv-full =1.14.0, both will cause memory increasing when training.)

sys.platform: linux
Python: 3.8.17 (default, Jul 5 2023, 21:04:15) [GCC 11.2.0]
CUDA available: True
GPU 0: NVIDIA TITAN RTX
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.10.0
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX512
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

TorchVision: 0.11.1
OpenCV: 4.7.0
MMCV: 1.4.0
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.3
MMDetection: 2.13.0+c820f32

2023-09-05 21:09:00,423 - mmdet - INFO - Distributed training: False
2023-09-05 21:09:01,221 - mmdet - INFO - Config:
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
norm_cfg = dict(type='GN', num_groups=32, requires_grad=True)

Bug Description

[>>> ] 426/3822, 16.0 task/s, elapsed: 27s, ETA: 213s
[>>> ] 427/3822, 15.9 task/s, elapsed: 27s, ETA: 214s
[>>> ] 428/3822, 15.9 task/s, elapsed: 27s, ETA: 213s
[>>> ] 429/3822, 15.9 task/s, elapsed: 27s, ETA: 213s
[>>> ] 430/3822, 15.9 task/s, elapsed: 27s, ETA: 213s
[>>> ] 431/3822, 15.9 task/s, elapsed: 27s, ETA: 214s
[>>> ] 432/3822, 15.9 task/s, elapsed: 27s, ETA: 213s
[>>> ] 433/3822, 15.9 task/s, elapsed: 27s, ETA: 213s
[>>> ] 434/3822, 15.9 task/s, elapsed: 27s, ETA: 213sTraceback (most recent call last):
File "tools/train.py", line 192, in
main()
File "tools/train.py", line 181, in main
train_detector(
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/apis/train.py", line 172, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_epoch')
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
self._do_evaluate(runner)
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/core/evaluation/eval_hooks.py", line 44, in _do_evaluate
results = single_gpu_test(runner.model, self.dataloader, show=False)
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/apis/test.py", line 27, in single_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 50, in forward
return super().forward(*inputs, **kwargs)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
return old_func(*args, **kwargs)
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/models/detectors/base.py", line 177, in forward
return self.forward_test(img, img_metas, **kwargs)
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/models/detectors/base.py", line 150, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/models/detectors/P2BNet.py", line 375, in simple_test
test_result, pseudo_boxes = self.roi_head.simple_test(stage,
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/models/roi_heads/P2B_head.py", line 446, in simple_test
det_bboxes, det_labels, pseudo_bboxes = self.simple_test_bboxes(
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/models/roi_heads/P2B_head.py", line 474, in simple_test_bboxes
bbox_results = self._bbox_forward(x, rois, gt_bboxes, stage)
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/models/roi_heads/P2B_head.py", line 210, in _bbox_forward
bbox_feats = self.bbox_roi_extractor(
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in call_impl
return forward_call(*input, **kwargs)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
return old_func(*args, **kwargs)
File "/media/dell/data1/ljw/code/test1/CPR/P2BNet/TOV_mmdetection/mmdet/models/roi_heads/roi_extractors/single_level_roi_extractor.py", line 102, in forward
roi_feats_t = self.roi_layers[i](feats[i], rois
)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/mmcv/ops/roi_align.py", line 212, in forward
return roi_align(input, rois, self.output_size, self.spatial_scale,
File "/media/dell/data1/miniconda3/envs/mmdetp2b/lib/python3.8/site-packages/mmcv/ops/roi_align.py", line 84, in forward
output = input.new_zeros(output_shape)
RuntimeError: CUDA out of memory. Tried to allocate 2.28 GiB (GPU 0; 23.65 GiB total capacity; 14.99 GiB already allocated; 1.28 GiB free; 21.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@Luo-Z13
Copy link
Author

Luo-Z13 commented Sep 5, 2023

@Darren-pfchen Could you give some advice about controlling the number of rois please?

@Luo-Z13 Luo-Z13 changed the title OOM even in evaluation OOM in evaluation Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant