[not an issue] Finetuning step by step helper for Runpod a100 #341

baptistecumin · 2024-12-13T17:44:12Z

Amazing repo! Just wanted to put together a short guide to finetune this in 2024 with RunPod or similar, since getting this working too piecing together info from a few different issues.

I struggled a little, hope this helps someone else.

Follow instructions to setup the repo + download the data.
Get an A100 from runpod or similar
To run this on more recent gpus that dont support older cuda arguments, the following env config worked for me. Note it upgrades python vs the one in the repo, and upgrades CUDA. I couldn't get it running on an A100 with any of the setups in the repo, the arch it used was too old and it defaulted to CPU on all GPUs I had available. Might just be me, I'm sure I missed something.

#!/bin/bash
set -e

# User configuration
ENV_DIR="inpenv"
PYTHON_PATH="/usr/bin/python3"
REQUIREMENTS_FILE="requirements.txt"

# PyTorch + CUDA versions
TORCH_VERSION="1.8.0"
TORCHVISION_VERSION="0.9.0"
TORCHAUDIO_VERSION="0.8.0"
CUDA_VERSION_TAG="+cu111"
TORCH_WHL_URL="https://download.pytorch.org/whl/torch_stable.html"

# Install system dependencies
echo "[INFO] Installing system dependencies..."
apt-get update && apt-get install -y \
    build-essential \
    git \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    libgl1-mesa-glx \
    wget \
    gnupg2

# Add NVIDIA repository and install CUDA
echo "[INFO] Setting up NVIDIA CUDA repository..."
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
dpkg -i cuda-keyring_1.0-1_all.deb
apt-get update
apt-get install -y cuda-11-1

# Remove existing environment if it exists
if [ -d "$ENV_DIR" ]; then
    echo "[INFO] Removing existing virtual environment..."
    rm -rf $ENV_DIR
fi

echo "[INFO] Creating virtual environment..."
$PYTHON_PATH -m pip install --upgrade pip
$PYTHON_PATH -m pip install virtualenv
$PYTHON_PATH -m virtualenv $ENV_DIR --python=$PYTHON_PATH

echo "[INFO] Activating virtual environment..."
source $ENV_DIR/bin/activate

# Ensure pip is up to date in the virtual environment
python -m pip install --upgrade pip

# Install PyTorch first
echo "[INFO] Installing PyTorch $TORCH_VERSION with CUDA 11.1..."
pip install torch==${TORCH_VERSION}${CUDA_VERSION_TAG} \
            torchvision==${TORCHVISION_VERSION}${CUDA_VERSION_TAG} \
            torchaudio==${TORCHAUDIO_VERSION} \
            -f $TORCH_WHL_URL

# Verify PyTorch installation
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

# Install core dependencies first
echo "[INFO] Installing core dependencies..."
pip install protobuf==3.20.0
pip install hydra-core==1.1.0
pip install pytorch-lightning==1.2.9
pip install numpy==1.19.2

# Now install from requirements
if [ -f "$REQUIREMENTS_FILE" ]; then
    echo "[INFO] Installing packages from $REQUIREMENTS_FILE..."
    pip install -r $REQUIREMENTS_FILE
else
    echo "[WARN] $REQUIREMENTS_FILE not found, skipping requirements installation."
fi

# Set environment variables
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export TORCH_HOME=$(pwd)
export LD_LIBRARY_PATH="/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH"

echo "[INFO] Setup completed successfully!"
echo "[INFO] Environment variables set:"
echo "PYTHONPATH=$PYTHONPATH"
echo "TORCH_HOME=$TORCH_HOME"
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
echo ""
echo "To activate the environment, run:"
echo "source $ENV_DIR/bin/activate"

# Add persistent environment variables
echo 'export PYTHONPATH="${PYTHONPATH}:$(pwd)"' >> $ENV_DIR/bin/activate
echo 'export TORCH_HOME=$(pwd)' >> $ENV_DIR/bin/activate
echo 'export LD_LIBRARY_PATH="/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH"' >> $ENV_DIR/bin/activate

With the following requirements.txt

pyyaml
tqdm
numpy==1.19.2
easydict==1.9.0
scikit-image==0.17.2
scikit-learn==0.24.2
opencv-python
tensorflow==2.6.0
joblib
matplotlib
pandas==1.2.5
albumentations==0.5.2
hydra-core==1.1.0
pytorch-lightning==1.2.9
tabulate
kornia==0.5.0
webdataset
packaging
wldhx.yadisk-direct
protobuf==3.20.0  # Add this line to fix the protobuf issue

Set env variables
export TORCH_HOME=$(pwd) && export PYTHONPATH=$(pwd) && export USER=root
Build your dataset. The steps in the repo readme are good for this. This is up to you, but 3 pointers that helped me

Make sure you update InpaintingTrainDataset if you have PNGs. Otherwise data loader will be empty.
I had to update abl-04-256-mh-dist.yaml with absolute paths to my datasets, for train, eval and test, otherwise it wasn't picking up my data.
I also had to update any_gpu_large_ssim_ddp_final.yaml with lower val_check_interval if you dont have enough validation data

Train.

Look at this issue to get the model weights! #96 (comment)

python3 bin/train.py -cn big-lama location=my_dataset ++trainer.kwargs.resume_from_checkpoint=/workspace/lama/model_weights_manual/big-lama-with-discr-remove-loss_segm_pl.ckpt ++data.batch_size=24

eval

The evaluation is great, check workspace/lama/experiments/<your_experiment>/samples/epoch0000_test/batch0000001.jpg

They show visual examples of how your model is training as it goes.

The text was updated successfully, but these errors were encountered:

baptistecumin changed the title ~~[not an issue] Finetuning helper for Runpod a100~~ [not an issue] Finetuning step by step helper for Runpod a100 Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[not an issue] Finetuning step by step helper for Runpod a100 #341

[not an issue] Finetuning step by step helper for Runpod a100 #341

baptistecumin commented Dec 13, 2024 •

edited

Loading

[not an issue] Finetuning step by step helper for Runpod a100 #341

[not an issue] Finetuning step by step helper for Runpod a100 #341

Comments

baptistecumin commented Dec 13, 2024 • edited Loading

baptistecumin commented Dec 13, 2024 •

edited

Loading