Flexible Vision Transformer

A flexible PyTorch implementation of the Vision Transformer (ViT) model for image classification tasks, inspired by the paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" by Dosovitskiy et al.

Overview

This repository provides a modular and customizable Vision Transformer (ViT) model that adapts the Transformer architecture for image classification. By treating an image as a sequence of patches, the model leverages self-attention mechanisms to capture global contextual relationships within the image.

Features

Patch Embedding: Divides images into fixed-size patches and embeds them.
Positional Embedding: Adds positional information to patch embeddings to retain spatial structure.
Transformer Encoder Blocks: Utilizes multi-head self-attention and feed-forward networks with residual connections and layer normalization.
Classification Head: Outputs class probabilities from the encoded features.
Configurable Parameters: Easily adjust model dimensions, number of layers, attention heads, and more.
Checkpointing: Save and load model checkpoints during training.
Visualization: Utility functions to visualize image samples.

Installation

Clone the Repository and Install Dependencies

git clone https://github.com/T4ras123/Flexible-ViT.git
cd Flexible-ViT
pip install -r requirements.txt

Install via PyPI

pip install vision-transformer

Usage

Training the Model

Train the ViT model using the provided train.py script with default parameters:

python train.py --data_path /path/to/dataset --epochs 100

Customizing Training Parameters

You can customize the training process by providing additional command-line arguments:

python train.py \
    --data_path ./data \
    --epochs 200 \
    --learning_rate 0.0005 \
    --batch_size 64 \
    --image_size 224 \
    --patch_size 16 \
    --emb_dim 768 \
    --n_layers 12 \
    --heads 12 \
    --dropout 0.1

Available Arguments

--data_path: Path to the dataset.
--weights : Path to the weights
--epochs: Number of training epochs.
--learning_rate: Learning rate for the optimizer.
--batch_size: Number of samples per batch.
--image_size: Dimension of input images (default: 144).
--patch_size: Size of each image patch (default: 4).
--emb_dim: Embedding dimension (default: 32).
--n_layers: Number of Transformer encoder layers (default: 6).
--heads: Number of attention heads (default: 2).
--dropout: Dropout rate (default: 0.1).

Loading a Saved Model

Load a previously saved model checkpoint:

import torch
from ViT.train import ViT, load_model
import torch.optim as optim

model = ViT(
    ch=3,
    img_size=224,
    patch_size=16,
    emb_dim=768,
    n_layers=12,
    out_dim=1000,
    dropout=0.1,
    heads=12
).to('cuda')

optimizer = optim.AdamW(model.parameters(), lr=0.0005)
epoch, loss = load_model(model, optimizer, 'ViT/models/vit_checkpoint.pt')

Evaluating the Model

Evaluate the trained model on the test dataset:

python evaluate.py --data_path /path/to/dataset --model_path ViT/models/vit_checkpoint.pt

Model Architecture

The Vision Transformer model consists of the following components:

Patch Embedding: Converts input images into a sequence of flattened patch embeddings.
Positional Embedding: Adds positional information to each patch embedding.
Transformer Encoder Blocks: Comprises layers of multi-head self-attention and feed-forward networks with residual connections and layer normalization.
Classification Head: Maps the encoded features to output class probabilities.

Key Components

PatchEmbedding: Splits the image into patches and projects them into an embedding space.
Attention: Implements multi-head self-attention mechanisms.
FeedForward: A two-layer fully connected network with GELU activation and dropout.
Block: Combines attention and feed-forward layers with layer normalization and residual connections.
ViT: The main Vision Transformer model class that assembles all components.

Example Code

import torch
from ViT.train import ViT

model = ViT(
    ch=3,
    img_size=224,
    patch_size=16,
    emb_dim=768,
    n_layers=12,
    out_dim=1000,
    dropout=0.1,
    heads=12
)

inputs = torch.randn(1, 3, 224, 224)
outputs = model(inputs)
print(outputs.shape)  # torch.Size([1, 1000])

Requirements

Python ≥ 3.8
PyTorch
torchvision
einops
matplotlib
numpy

Install Dependencies

pip install -r requirements.txt

License

This project is licensed under the MIT License - see the LICENSE file for details.

References

Citation

If you use this implementation in your research, please cite:

@misc{vision-transformer,
  author       = {vover},
  title        = {Flexible Vision Transformer Implementation},
  year         = {2024},
  publisher    = {vover},
  journal      = {GitHub repository},
  howpublished = {\url{https://github.com/T4ras123/Flexible-ViT}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
ViT		ViT
media		media
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flexible Vision Transformer

Overview

Features

Installation

Clone the Repository and Install Dependencies

Install via PyPI

Usage

Training the Model

Customizing Training Parameters

Available Arguments

Loading a Saved Model

Evaluating the Model

Model Architecture

Key Components

Example Code

Requirements

Install Dependencies

License

References

Citation

About

Releases 1

Packages

Languages

License

T4ras123/Flexible-ViT

Folders and files

Latest commit

History

Repository files navigation

Flexible Vision Transformer

Overview

Features

Installation

Clone the Repository and Install Dependencies

Install via PyPI

Usage

Training the Model

Customizing Training Parameters

Available Arguments

Loading a Saved Model

Evaluating the Model

Model Architecture

Key Components

Example Code

Requirements

Install Dependencies

License

References

Citation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages