Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not an issue] Finetuning step by step helper for Runpod a100 #341

Open
baptistecumin opened this issue Dec 13, 2024 · 0 comments
Open

[not an issue] Finetuning step by step helper for Runpod a100 #341

baptistecumin opened this issue Dec 13, 2024 · 0 comments

Comments

@baptistecumin
Copy link

baptistecumin commented Dec 13, 2024

Amazing repo! Just wanted to put together a short guide to finetune this in 2024 with RunPod or similar, since getting this working too piecing together info from a few different issues.

I struggled a little, hope this helps someone else.

  1. Follow instructions to setup the repo + download the data.
  2. Get an A100 from runpod or similar
  3. To run this on more recent gpus that dont support older cuda arguments, the following env config worked for me. Note it upgrades python vs the one in the repo, and upgrades CUDA. I couldn't get it running on an A100 with any of the setups in the repo, the arch it used was too old and it defaulted to CPU on all GPUs I had available. Might just be me, I'm sure I missed something.
#!/bin/bash
set -e

# User configuration
ENV_DIR="inpenv"
PYTHON_PATH="/usr/bin/python3"
REQUIREMENTS_FILE="requirements.txt"

# PyTorch + CUDA versions
TORCH_VERSION="1.8.0"
TORCHVISION_VERSION="0.9.0"
TORCHAUDIO_VERSION="0.8.0"
CUDA_VERSION_TAG="+cu111"
TORCH_WHL_URL="https://download.pytorch.org/whl/torch_stable.html"

# Install system dependencies
echo "[INFO] Installing system dependencies..."
apt-get update && apt-get install -y \
    build-essential \
    git \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    libgl1-mesa-glx \
    wget \
    gnupg2

# Add NVIDIA repository and install CUDA
echo "[INFO] Setting up NVIDIA CUDA repository..."
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
dpkg -i cuda-keyring_1.0-1_all.deb
apt-get update
apt-get install -y cuda-11-1

# Remove existing environment if it exists
if [ -d "$ENV_DIR" ]; then
    echo "[INFO] Removing existing virtual environment..."
    rm -rf $ENV_DIR
fi

echo "[INFO] Creating virtual environment..."
$PYTHON_PATH -m pip install --upgrade pip
$PYTHON_PATH -m pip install virtualenv
$PYTHON_PATH -m virtualenv $ENV_DIR --python=$PYTHON_PATH

echo "[INFO] Activating virtual environment..."
source $ENV_DIR/bin/activate

# Ensure pip is up to date in the virtual environment
python -m pip install --upgrade pip

# Install PyTorch first
echo "[INFO] Installing PyTorch $TORCH_VERSION with CUDA 11.1..."
pip install torch==${TORCH_VERSION}${CUDA_VERSION_TAG} \
            torchvision==${TORCHVISION_VERSION}${CUDA_VERSION_TAG} \
            torchaudio==${TORCHAUDIO_VERSION} \
            -f $TORCH_WHL_URL

# Verify PyTorch installation
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

# Install core dependencies first
echo "[INFO] Installing core dependencies..."
pip install protobuf==3.20.0
pip install hydra-core==1.1.0
pip install pytorch-lightning==1.2.9
pip install numpy==1.19.2

# Now install from requirements
if [ -f "$REQUIREMENTS_FILE" ]; then
    echo "[INFO] Installing packages from $REQUIREMENTS_FILE..."
    pip install -r $REQUIREMENTS_FILE
else
    echo "[WARN] $REQUIREMENTS_FILE not found, skipping requirements installation."
fi

# Set environment variables
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
export TORCH_HOME=$(pwd)
export LD_LIBRARY_PATH="/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH"

echo "[INFO] Setup completed successfully!"
echo "[INFO] Environment variables set:"
echo "PYTHONPATH=$PYTHONPATH"
echo "TORCH_HOME=$TORCH_HOME"
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
echo ""
echo "To activate the environment, run:"
echo "source $ENV_DIR/bin/activate"

# Add persistent environment variables
echo 'export PYTHONPATH="${PYTHONPATH}:$(pwd)"' >> $ENV_DIR/bin/activate
echo 'export TORCH_HOME=$(pwd)' >> $ENV_DIR/bin/activate
echo 'export LD_LIBRARY_PATH="/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH"' >> $ENV_DIR/bin/activate

With the following requirements.txt

pyyaml
tqdm
numpy==1.19.2
easydict==1.9.0
scikit-image==0.17.2
scikit-learn==0.24.2
opencv-python
tensorflow==2.6.0
joblib
matplotlib
pandas==1.2.5
albumentations==0.5.2
hydra-core==1.1.0
pytorch-lightning==1.2.9
tabulate
kornia==0.5.0
webdataset
packaging
wldhx.yadisk-direct
protobuf==3.20.0  # Add this line to fix the protobuf issue
  1. Set env variables
    export TORCH_HOME=$(pwd) && export PYTHONPATH=$(pwd) && export USER=root

  2. Build your dataset. The steps in the repo readme are good for this. This is up to you, but 3 pointers that helped me

  • Make sure you update InpaintingTrainDataset if you have PNGs. Otherwise data loader will be empty.
  • I had to update abl-04-256-mh-dist.yaml with absolute paths to my datasets, for train, eval and test, otherwise it wasn't picking up my data.
  • I also had to update any_gpu_large_ssim_ddp_final.yaml with lower val_check_interval if you dont have enough validation data
  1. Train.

Look at this issue to get the model weights! #96 (comment)

python3 bin/train.py -cn big-lama location=my_dataset ++trainer.kwargs.resume_from_checkpoint=/workspace/lama/model_weights_manual/big-lama-with-discr-remove-loss_segm_pl.ckpt ++data.batch_size=24

  1. eval

The evaluation is great, check workspace/lama/experiments/<your_experiment>/samples/epoch0000_test/batch0000001.jpg

They show visual examples of how your model is training as it goes.

@baptistecumin baptistecumin changed the title [not an issue] Finetuning helper for Runpod a100 [not an issue] Finetuning step by step helper for Runpod a100 Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant