You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Amazing repo! Just wanted to put together a short guide to finetune this in 2024 with RunPod or similar, since getting this working too piecing together info from a few different issues.
I struggled a little, hope this helps someone else.
Follow instructions to setup the repo + download the data.
Get an A100 from runpod or similar
To run this on more recent gpus that dont support older cuda arguments, the following env config worked for me. Note it upgrades python vs the one in the repo, and upgrades CUDA. I couldn't get it running on an A100 with any of the setups in the repo, the arch it used was too old and it defaulted to CPU on all GPUs I had available. Might just be me, I'm sure I missed something.
#!/bin/bash
set -e
# User configuration
ENV_DIR="inpenv"
PYTHON_PATH="/usr/bin/python3"
REQUIREMENTS_FILE="requirements.txt"
# PyTorch + CUDA versions
TORCH_VERSION="1.8.0"
TORCHVISION_VERSION="0.9.0"
TORCHAUDIO_VERSION="0.8.0"
CUDA_VERSION_TAG="+cu111"
TORCH_WHL_URL="https://download.pytorch.org/whl/torch_stable.html"
# Install system dependencies
echo "[INFO] Installing system dependencies..."
apt-get update && apt-get install -y \
build-essential \
git \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender-dev \
libgl1-mesa-glx \
wget \
gnupg2
# Add NVIDIA repository and install CUDA
echo "[INFO] Setting up NVIDIA CUDA repository..."
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
dpkg -i cuda-keyring_1.0-1_all.deb
apt-get update
apt-get install -y cuda-11-1
# Remove existing environment if it exists
if [ -d "$ENV_DIR" ]; then
echo "[INFO] Removing existing virtual environment..."
rm -rf $ENV_DIR
fi
echo "[INFO] Creating virtual environment..."
$PYTHON_PATH -m pip install --upgrade pip
$PYTHON_PATH -m pip install virtualenv
$PYTHON_PATH -m virtualenv $ENV_DIR --python=$PYTHON_PATH
echo "[INFO] Activating virtual environment..."
source $ENV_DIR/bin/activate
# Ensure pip is up to date in the virtual environment
python -m pip install --upgrade pip
# Install PyTorch first
echo "[INFO] Installing PyTorch $TORCH_VERSION with CUDA 11.1..."
pip install torch==${TORCH_VERSION}${CUDA_VERSION_TAG} \
torchvision==${TORCHVISION_VERSION}${CUDA_VERSION_TAG} \
torchaudio==${TORCHAUDIO_VERSION} \
-f $TORCH_WHL_URL
# Verify PyTorch installation
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
# Install core dependencies first
echo "[INFO] Installing core dependencies..."
pip install protobuf==3.20.0
pip install hydra-core==1.1.0
pip install pytorch-lightning==1.2.9
pip install numpy==1.19.2
# Now install from requirements
if [ -f "$REQUIREMENTS_FILE" ]; then
echo "[INFO] Installing packages from $REQUIREMENTS_FILE..."
pip install -r $REQUIREMENTS_FILE
else
echo "[WARN] $REQUIREMENTS_FILE not found, skipping requirements installation."
fi
# Set environment variables
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export TORCH_HOME=$(pwd)
export LD_LIBRARY_PATH="/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH"
echo "[INFO] Setup completed successfully!"
echo "[INFO] Environment variables set:"
echo "PYTHONPATH=$PYTHONPATH"
echo "TORCH_HOME=$TORCH_HOME"
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
echo ""
echo "To activate the environment, run:"
echo "source $ENV_DIR/bin/activate"
# Add persistent environment variables
echo 'export PYTHONPATH="${PYTHONPATH}:$(pwd)"' >> $ENV_DIR/bin/activate
echo 'export TORCH_HOME=$(pwd)' >> $ENV_DIR/bin/activate
echo 'export LD_LIBRARY_PATH="/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH"' >> $ENV_DIR/bin/activate
With the following requirements.txt
pyyaml
tqdm
numpy==1.19.2
easydict==1.9.0
scikit-image==0.17.2
scikit-learn==0.24.2
opencv-python
tensorflow==2.6.0
joblib
matplotlib
pandas==1.2.5
albumentations==0.5.2
hydra-core==1.1.0
pytorch-lightning==1.2.9
tabulate
kornia==0.5.0
webdataset
packaging
wldhx.yadisk-direct
protobuf==3.20.0 # Add this line to fix the protobuf issue
The evaluation is great, check workspace/lama/experiments/<your_experiment>/samples/epoch0000_test/batch0000001.jpg
They show visual examples of how your model is training as it goes.
The text was updated successfully, but these errors were encountered:
baptistecumin
changed the title
[not an issue] Finetuning helper for Runpod a100
[not an issue] Finetuning step by step helper for Runpod a100
Dec 13, 2024
Amazing repo! Just wanted to put together a short guide to finetune this in 2024 with RunPod or similar, since getting this working too piecing together info from a few different issues.
I struggled a little, hope this helps someone else.
With the following requirements.txt
Set env variables
export TORCH_HOME=$(pwd) && export PYTHONPATH=$(pwd) && export USER=root
Build your dataset. The steps in the repo readme are good for this. This is up to you, but 3 pointers that helped me
Look at this issue to get the model weights! #96 (comment)
python3 bin/train.py -cn big-lama location=my_dataset ++trainer.kwargs.resume_from_checkpoint=/workspace/lama/model_weights_manual/big-lama-with-discr-remove-loss_segm_pl.ckpt ++data.batch_size=24
The evaluation is great, check workspace/lama/experiments/<your_experiment>/samples/epoch0000_test/batch0000001.jpg
They show visual examples of how your model is training as it goes.
The text was updated successfully, but these errors were encountered: