Skip to content

rehan243/LLM-Fine-Tuning-LoRA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

LLM-Fine-Tuning-LoRA

Fine-tuning Large Language Models (LLaMA, Mistral) with parameter-efficient methods (LoRA, QLoRA, PEFT) for domain-specific enterprise use cases.

Python HuggingFace CUDA


Overview

Production-ready pipeline for fine-tuning open-source LLMs using parameter-efficient techniques. Achieves domain-specific performance comparable to hosted APIs while reducing inference costs by 40% and maintaining full data privacy.

Developed and deployed at Verticiti and Reallytics.ai for enterprise clients requiring domain-specific language models in regulated industries.

Supported Methods

Method Description Memory Reduction Use Case
LoRA Low-Rank Adaptation of attention matrices ~60% General fine-tuning
QLoRA Quantized LoRA with 4-bit base model ~75% Memory-constrained environments
PEFT Parameter-Efficient Fine-Tuning framework ~65% Multi-task adaptation

Architecture

┌──────────────────────────────────────────┐
│            Training Pipeline              │
│                                          │
│  ┌──────────┐    ┌───────────────────┐   │
│  │ Dataset  │───▶│  Tokenization &   │   │
│  │ Loader   │    │  Preprocessing    │   │
│  └──────────┘    └────────┬──────────┘   │
│                           │              │
│  ┌────────────────────────▼──────────┐   │
│  │       Base Model Loading          │   │
│  │  (LLaMA / Mistral / Falcon)       │   │
│  │  4-bit quantized (bitsandbytes)   │   │
│  └────────────────────────┬──────────┘   │
│                           │              │
│  ┌────────────────────────▼──────────┐   │
│  │     LoRA Adapter Injection        │   │
│  │  - Target: q_proj, v_proj, k_proj │   │
│  │  - Rank: 16-64                    │   │
│  │  - Alpha: 32-128                  │   │
│  └────────────────────────┬──────────┘   │
│                           │              │
│  ┌────────────────────────▼──────────┐   │
│  │      SFTTrainer (HuggingFace)     │   │
│  │  - Gradient accumulation          │   │
│  │  - Mixed precision (bf16)         │   │
│  │  - Cosine LR scheduler           │   │
│  └────────────────────────┬──────────┘   │
│                           │              │
│  ┌────────────────────────▼──────────┐   │
│  │     Evaluation & Metrics          │   │
│  │  - Perplexity, BLEU, ROUGE       │   │
│  │  - Domain-specific benchmarks     │   │
│  └───────────────────────────────────┘   │
└──────────────────────────────────────────┘

┌──────────────────────────────────────────┐
│           Serving Pipeline                │
│                                          │
│  ┌───────────┐   ┌──────────────────┐    │
│  │  VLLM     │──▶│  FastAPI Server  │    │
│  │  Engine   │   │  (REST + gRPC)   │    │
│  └───────────┘   └──────────────────┘    │
│                                          │
│  Deployed on: AWS SageMaker / ECS+Docker │
└──────────────────────────────────────────┘

Key Features

  • Multi-Model Support: Fine-tune LLaMA-2 (7B/13B/70B), Mistral-7B, Falcon, and other HuggingFace models
  • Memory Efficient: QLoRA enables fine-tuning 70B models on a single A100 GPU
  • Production Serving: VLLM integration for optimized GPU utilization and high-throughput inference
  • CUDA Optimized: Custom CUDA kernels for attention computation and quantization
  • Automated Pipeline: End-to-end from data preparation to model deployment
  • Evaluation Suite: Comprehensive benchmarking with perplexity, BLEU, ROUGE, and custom domain metrics
  • Cost Reduction: 40% reduction vs hosted API costs (GPT-4, Claude)
  • Docker + SageMaker: Containerized deployment on AWS ECS/ECR or SageMaker endpoints

Tech Stack

Category Technologies
Framework HuggingFace Transformers, PEFT, TRL
Models LLaMA-2, Mistral-7B, Falcon
Quantization bitsandbytes (4-bit, 8-bit), GPTQ
Serving VLLM, Text Generation Inference
Compute CUDA, PyTorch, Mixed Precision (bf16/fp16)
Cloud AWS SageMaker, ECS/ECR, Docker
Monitoring Weights & Biases, TensorBoard

Results

Model Method Training Time GPU Memory Cost vs GPT-4 API
LLaMA-2 7B LoRA 4 hours 16GB -60%
LLaMA-2 13B QLoRA 8 hours 24GB -50%
Mistral 7B LoRA 3.5 hours 14GB -65%
LLaMA-2 70B QLoRA 24 hours 48GB -40%

Source Code: The production source code for this project is maintained in a private repository due to proprietary and client confidentiality requirements. This repository documents the architecture, design decisions, and technical approach. For code-level discussions or collaboration inquiries, feel free to reach out.

Author

Rehan Malik - CTO @ Reallytics.ai


About

Fine-tuning LLaMA-2, Mistral with LoRA, QLoRA, PEFT — 40% cost reduction vs hosted APIs. VLLM serving with CUDA optimization on AWS SageMaker.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages