Requires 800 gigabytes of video memory #147

Mrguanglei · 2024-09-03T05:41:07Z

I finished rendering and when I was ready to train nerf, I only used 20 data sets and found out that I needed quite a lot of memory. What happened? I need your help。

(instantmesh1) mrguanglei@guanglei:~/3D/InstantMesh$ python train.py --base configs/instant-nerf-large-train.yaml --gpus 0 --num_nodes 1
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
Seed set to 42
Running on GPUs 0
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
Some weights of ViTModel were not initialized from the model checkpoint at facebook/dino-vitb16 and are newly initialized: ['encoder.layer.10.adaLN_modulation.1.weig
ht', 'encoder.layer.9.adaLN_modulation.1.bias', 'encoder.layer.5.adaLN_modulation.1.weight', 'encoder.layer.2.adaLN_modulation.1.weight', 'encoder.layer.3.adaLN_modu
lation.1.bias', 'encoder.layer.10.adaLN_modulation.1.bias', 'encoder.layer.2.adaLN_modulation.1.bias', 'encoder.layer.11.adaLN_modulation.1.weight', 'encoder.layer.0
.adaLN_modulation.1.weight', 'encoder.layer.11.adaLN_modulation.1.bias', 'encoder.layer.6.adaLN_modulation.1.weight', 'encoder.layer.7.adaLN_modulation.1.bias', 'enc
oder.layer.5.adaLN_modulation.1.bias', 'encoder.layer.7.adaLN_modulation.1.weight', 'encoder.layer.6.adaLN_modulation.1.bias', 'encoder.layer.0.adaLN_modulation.1.bi
as', 'encoder.layer.1.adaLN_modulation.1.bias', 'encoder.layer.3.adaLN_modulation.1.weight', 'encoder.layer.9.adaLN_modulation.1.weight', 'encoder.layer.8.adaLN_modu
lation.1.bias', 'encoder.layer.8.adaLN_modulation.1.weight', 'encoder.layer.4.adaLN_modulation.1.weight', 'encoder.layer.1.adaLN_modulation.1.weight', 'encoder.layer.4.adaLN_modulation.1.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None`
for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=VGG16_Weights.IMAGENET1K_V1`. You can also use `weights=VGG16_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
============= length of dataset 12 =============
============= length of dataset 11 =============
accumulate_grad_batches = 1
++++ NOT USING LR SCALING ++++
Setting learning rate to 4.00e-04
[rank: 0] Seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

You are using a CUDA device ('NVIDIA GeForce RTX 4060 Ti') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('mediu m' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
============= length of dataset 12 =============
============= length of dataset 11 =============
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Project config
model:
base_learning_rate: 0.0004
target: src.model.MVRecon
params:
input_size: 320
render_size: 192
lrm_generator_config:
target: src.models.lrm.InstantNeRF
params:
encoder_feat_dim: 768
encoder_freeze: false
encoder_model_name: facebook/dino-vitb16
transformer_dim: 512
transformer_layers: 8
transformer_heads: 8
triplane_low_res: 32
triplane_high_res: 64
triplane_dim: 80
rendering_samples_per_ray: 128
data:
target: src.data.objaverse.DataModuleFromConfig
params:
batch_size: 1
num_workers: 4
train:
target: src.data.objaverse.ObjaverseData
params:
root_dir: /home/mrguanglei/3D/InstantMesh/data
meta_fname: valid_paths.json
input_image_dir: rendering_random_32views
target_image_dir: rendering_random_32views
input_view_num: 6
target_view_num: 4
total_view_n: 32
fov: 50
camera_rotation: true
validation: false
validation:
target: src.data.objaverse.ValidationData
params:
root_dir: /home/mrguanglei/3D/InstantMesh/data/vaild
input_view_num: 6
input_image_size: 320
fov: 30
lightning:
modelcheckpoint:
params:
every_n_train_steps: 1000
save_top_k: -1
save_last: true
callbacks: {}
trainer:
benchmark: true
max_epochs: -1
gradient_clip_val: 1.0
val_check_interval: 1000
num_sanity_val_steps: 0
accumulate_grad_batches: 1
check_val_every_n_epoch: null
accelerator: gpu
devices: 1

| Name | Type | Params

0 | lrm_generator | InstantNeRF | 152 M
1 | lpips | LearnedPerceptualImagePatchSimilarity | 14.7 M

152 M Trainable params
14.7 M Non-trainable params
166 M Total params
667.701 Total estimated model params size (MB)
Epoch 0: | | 0/? [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/home/mrguanglei/3D/InstantMesh/train.py", line 284, in
[rank0]: trainer.fit(model, data)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
[rank0]: results = self._run_stage()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
[rank0]: self.fit_loop.run()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
[rank0]: self.advance()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
[rank0]: self.epoch_loop.run(self._data_fetcher)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
[rank0]: self.advance(data_fetcher)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
[rank0]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
[rank0]: self._optimizer_step(batch_idx, closure)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
[rank0]: call._call_lightning_module_hook(
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
[rank0]: output = fn(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
[rank0]: optimizer.step(closure=optimizer_closure)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
[rank0]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 264, in optimizer_step
[rank0]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
[rank0]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 117, in optimizer_step
[rank0]: return optimizer.step(closure=closure, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank0]: return wrapped(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]: out = func(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]: ret = func(self, *args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/adamw.py", line 165, in step
[rank0]: loss = closure()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 104, in _wrap_closure
[rank0]: closure_result = closure()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in call
[rank0]: self._result = self.closure(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 126, in closure
[rank0]: step_output = self._step_fn()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 315, in _training_step
[rank0]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
[rank0]: output = fn(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 381, in training_step
[rank0]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 633, in call
[rank0]: wrapper_output = wrapper_module(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 626, in wrapped_forward
[rank0]: out = method(_args, **_kwargs)
[rank0]: File "/home/mrguanglei/3D/InstantMesh/src/model.py", line 196, in training_step
[rank0]: lrm_generator_input, render_gt = self.prepare_batch_data(batch)
[rank0]: File "/home/mrguanglei/3D/InstantMesh/src/model.py", line 84, in prepare_batch_data
[rank0]: target_depths = v2.functional.resize(
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/transforms/v2/functional/_geometry.py", line 189, in resize
[rank0]: return kernel(inpt, size=size, interpolation=interpolation, max_size=max_size, antialias=antialias)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/transforms/v2/functional/_geometry.py", line 254, in resize_image
[rank0]: image = interpolate(
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/functional.py", line 4028, in interpolate
[rank0]: return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors)
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 844.10 GiB. GPU

The text was updated successfully, but these errors were encountered:

Mrguanglei · 2024-09-03T05:45:57Z

@zawa999 How to use this, looking forward to your reply

Biggaoga · 2024-09-13T15:24:43Z

same question

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Requires 800 gigabytes of video memory #147

Requires 800 gigabytes of video memory #147

Mrguanglei commented Sep 3, 2024

Mrguanglei commented Sep 3, 2024

Biggaoga commented Sep 13, 2024

Requires 800 gigabytes of video memory #147

Requires 800 gigabytes of video memory #147

Comments

Mrguanglei commented Sep 3, 2024

distributed_backend=nccl All distributed processes registered. Starting with 1 processes

| Name | Type | Params

0 | lrm_generator | InstantNeRF | 152 M 1 | lpips | LearnedPerceptualImagePatchSimilarity | 14.7 M

Mrguanglei commented Sep 3, 2024

Biggaoga commented Sep 13, 2024

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

0 | lrm_generator | InstantNeRF | 152 M
1 | lpips | LearnedPerceptualImagePatchSimilarity | 14.7 M