Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requires 800 gigabytes of video memory #147

Open
Mrguanglei opened this issue Sep 3, 2024 · 2 comments
Open

Requires 800 gigabytes of video memory #147

Mrguanglei opened this issue Sep 3, 2024 · 2 comments

Comments

@Mrguanglei
Copy link

I finished rendering and when I was ready to train nerf, I only used 20 data sets and found out that I needed quite a lot of memory. What happened? I need your help。

(instantmesh1) mrguanglei@guanglei:~/3D/InstantMesh$ python train.py --base configs/instant-nerf-large-train.yaml --gpus 0 --num_nodes 1
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/image.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev'If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
Seed set to 42
Running on GPUs 0
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
Some weights of ViTModel were not initialized from the model checkpoint at facebook/dino-vitb16 and are newly initialized: ['encoder.layer.10.adaLN_modulation.1.weig
ht', 'encoder.layer.9.adaLN_modulation.1.bias', 'encoder.layer.5.adaLN_modulation.1.weight', 'encoder.layer.2.adaLN_modulation.1.weight', 'encoder.layer.3.adaLN_modu
lation.1.bias', 'encoder.layer.10.adaLN_modulation.1.bias', 'encoder.layer.2.adaLN_modulation.1.bias', 'encoder.layer.11.adaLN_modulation.1.weight', 'encoder.layer.0
.adaLN_modulation.1.weight', 'encoder.layer.11.adaLN_modulation.1.bias', 'encoder.layer.6.adaLN_modulation.1.weight', 'encoder.layer.7.adaLN_modulation.1.bias', 'enc
oder.layer.5.adaLN_modulation.1.bias', 'encoder.layer.7.adaLN_modulation.1.weight', 'encoder.layer.6.adaLN_modulation.1.bias', 'encoder.layer.0.adaLN_modulation.1.bi
as', 'encoder.layer.1.adaLN_modulation.1.bias', 'encoder.layer.3.adaLN_modulation.1.weight', 'encoder.layer.9.adaLN_modulation.1.weight', 'encoder.layer.8.adaLN_modu
lation.1.bias', 'encoder.layer.8.adaLN_modulation.1.weight', 'encoder.layer.4.adaLN_modulation.1.weight', 'encoder.layer.1.adaLN_modulation.1.weight', 'encoder.layer.4.adaLN_modulation.1.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or None
for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing weights=VGG16_Weights.IMAGENET1K_V1. You can also use weights=VGG16_Weights.DEFAULT to get the most up-to-date weights.
warnings.warn(msg)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
============= length of dataset 12 =============
============= length of dataset 11 =============
accumulate_grad_batches = 1
++++ NOT USING LR SCALING ++++
Setting learning rate to 4.00e-04
[rank: 0] Seed set to 42
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

distributed_backend=nccl
All distributed processes registered. Starting with 1 processes

You are using a CUDA device ('NVIDIA GeForce RTX 4060 Ti') that has Tensor Cores. To properly utilize them, you should set torch.set_float32_matmul_precision('mediu m' | 'high') which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
============= length of dataset 12 =============
============= length of dataset 11 =============
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Project config
model:
base_learning_rate: 0.0004
target: src.model.MVRecon
params:
input_size: 320
render_size: 192
lrm_generator_config:
target: src.models.lrm.InstantNeRF
params:
encoder_feat_dim: 768
encoder_freeze: false
encoder_model_name: facebook/dino-vitb16
transformer_dim: 512
transformer_layers: 8
transformer_heads: 8
triplane_low_res: 32
triplane_high_res: 64
triplane_dim: 80
rendering_samples_per_ray: 128
data:
target: src.data.objaverse.DataModuleFromConfig
params:
batch_size: 1
num_workers: 4
train:
target: src.data.objaverse.ObjaverseData
params:
root_dir: /home/mrguanglei/3D/InstantMesh/data
meta_fname: valid_paths.json
input_image_dir: rendering_random_32views
target_image_dir: rendering_random_32views
input_view_num: 6
target_view_num: 4
total_view_n: 32
fov: 50
camera_rotation: true
validation: false
validation:
target: src.data.objaverse.ValidationData
params:
root_dir: /home/mrguanglei/3D/InstantMesh/data/vaild
input_view_num: 6
input_image_size: 320
fov: 30
lightning:
modelcheckpoint:
params:
every_n_train_steps: 1000
save_top_k: -1
save_last: true
callbacks: {}
trainer:
benchmark: true
max_epochs: -1
gradient_clip_val: 1.0
val_check_interval: 1000
num_sanity_val_steps: 0
accumulate_grad_batches: 1
check_val_every_n_epoch: null
accelerator: gpu
devices: 1

| Name | Type | Params

0 | lrm_generator | InstantNeRF | 152 M
1 | lpips | LearnedPerceptualImagePatchSimilarity | 14.7 M

152 M Trainable params
14.7 M Non-trainable params
166 M Total params
667.701 Total estimated model params size (MB)
Epoch 0: | | 0/? [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/home/mrguanglei/3D/InstantMesh/train.py", line 284, in
[rank0]: trainer.fit(model, data)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
[rank0]: results = self._run_stage()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
[rank0]: self.fit_loop.run()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
[rank0]: self.advance()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
[rank0]: self.epoch_loop.run(self._data_fetcher)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
[rank0]: self.advance(data_fetcher)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
[rank0]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
[rank0]: self._optimizer_step(batch_idx, closure)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
[rank0]: call._call_lightning_module_hook(
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
[rank0]: output = fn(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
[rank0]: optimizer.step(closure=optimizer_closure)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
[rank0]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 264, in optimizer_step
[rank0]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
[rank0]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 117, in optimizer_step
[rank0]: return optimizer.step(closure=closure, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 75, in wrapper
[rank0]: return wrapped(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/optimizer.py", line 391, in wrapper
[rank0]: out = func(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
[rank0]: ret = func(self, *args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/optim/adamw.py", line 165, in step
[rank0]: loss = closure()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision.py", line 104, in _wrap_closure
[rank0]: closure_result = closure()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in call
[rank0]: self._result = self.closure(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 126, in closure
[rank0]: step_output = self._step_fn()
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 315, in _training_step
[rank0]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 309, in _call_strategy_hook
[rank0]: output = fn(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 381, in training_step
[rank0]: return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 633, in call
[rank0]: wrapper_output = wrapper_module(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1593, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1411, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(args, **kwargs)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 626, in wrapped_forward
[rank0]: out = method(
_args, **_kwargs)
[rank0]: File "/home/mrguanglei/3D/InstantMesh/src/model.py", line 196, in training_step
[rank0]: lrm_generator_input, render_gt = self.prepare_batch_data(batch)
[rank0]: File "/home/mrguanglei/3D/InstantMesh/src/model.py", line 84, in prepare_batch_data
[rank0]: target_depths = v2.functional.resize(
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/transforms/v2/functional/_geometry.py", line 189, in resize
[rank0]: return kernel(inpt, size=size, interpolation=interpolation, max_size=max_size, antialias=antialias)
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torchvision/transforms/v2/functional/_geometry.py", line 254, in resize_image
[rank0]: image = interpolate(
[rank0]: File "/home/mrguanglei/anaconda3/envs/instantmesh1/lib/python3.10/site-packages/torch/nn/functional.py", line 4028, in interpolate
[rank0]: return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors)
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 844.10 GiB. GPU

@Mrguanglei
Copy link
Author

@zawa999 How to use this, looking forward to your reply

@Biggaoga
Copy link

same question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@Mrguanglei @Biggaoga and others