This repository contains code for fine-tuning Vision Transformers using two different approaches for different computer vision tasks.
vision-transformer-finetuning/
├── lora-vit-classification/ # LoRA fine-tuning for image classification
│ ├── lora.ipynb # Jupyter notebook with LoRA implementation
│ └── README.md # Detailed documentation
├── dino2-object-detection/ # DINO v2 fine-tuning for object detection
│ ├── finetune_dino2_simple.py # Simple DINO detector
│ ├── improved_dino_detector.py # Enhanced DINO detector
│ ├── comprehensive_validation.py # Complete evaluation
│ ├── plot_results.py # Performance visualization
│ └── README.md # Detailed documentation
└── README.md # This file
Location: lora-vit-classification/
- Task: Image Classification
- Dataset: CIFAR-10, ImageNet
- Model: Vision Transformer (ViT) with LoRA adaptation
- Key Benefit: Parameter-efficient fine-tuning (1-5% trainable parameters)
Quick Start:
cd lora-vit-classification
# Open lora.ipynb in Jupyter notebookLocation: dino2-object-detection/
- Task: Object Detection
- Dataset: COCO Detection Dataset (91 classes)
- Model: DINO v2 backbone with detection heads
- Key Benefit: State-of-the-art performance, outperforms YOLOv8
Quick Start:
cd dino2-object-detection
python improved_dino_detector.py # Train enhanced model
python comprehensive_validation.py # Evaluate performance
python plot_results.py # Generate comparison charts- Parameter Efficiency: 95%+ accuracy with only 1-5% trainable parameters
- Memory Reduction: 50-70% less GPU memory usage
- Speed: Faster training and inference
- Best Performance: 64.2% accuracy (vs YOLOv8x: 56.0%)
- Superior IoU: 0.322 (vs YOLO: ~0.13)
- Dramatic Improvement: 197% better than simple approach
# Clone repository
git clone <repository-url>
cd vision-transformer-finetuning
# Install core dependencies
pip install torch torchvision transformers datasets pillow accelerate
# Optional: For YOLO comparison and visualization
pip install ultralytics matplotlib seaborn| Approach | Task | Dataset | Best Model | Key Metric |
|---|---|---|---|---|
| LoRA ViT | Classification | CIFAR-10/ImageNet | LoRA-adapted ViT | 95%+ accuracy, 1-5% parameters |
| DINO v2 | Object Detection | COCO | Improved DINO | 64.2% accuracy, 0.322 IoU |
-
Choose your task:
- Image Classification →
lora-vit-classification/ - Object Detection →
dino2-object-detection/
- Image Classification →
-
Follow folder-specific README for detailed instructions
-
Run the code and compare with baselines
Each folder contains detailed documentation:
- Technical implementation details
- Training procedures
- Evaluation metrics
- Performance analysis
- Usage instructions
This repository demonstrates:
- Parameter-efficient fine-tuning with LoRA
- Self-supervised model adaptation with DINO v2
- Advanced loss functions for object detection
- Comprehensive evaluation and comparison methodologies
- Performance optimization techniques
[Add your license information here]
[Add contribution guidelines here]