DeepStream SAHI

GStreamer plugins that bring SAHI slicing to NVIDIA DeepStream, keeping slicing, inference, and the cross-tile merge inside the pipeline so it composes with standard DeepStream components (tracking, analytics, brokers, display):

nvstreammux -> nvsahipreprocess -> nvinfer -> nvsahipostprocess -> nvtracker -> nvdsosd

nvsahipreprocess — computes per-frame slices, GPU-crops/rescales them, and feeds nvinfer.
nvsahipostprocess — merges duplicate detections from overlapping slices (two-phase GreedyNMM).

The YOLO26 (NMS-free) and YOLOv9-C/GELAN (EfficientNMS) detector families are pre-trained and selectable at run time with --model. Full plugin reference: docs/PLUGINS.md.

Compatibility

Component	DeepStream 8.0	DeepStream 9.0
DeepStream SDK	8.0	9.0
CUDA Toolkit	12.8	13.1
TensorRT	10.9.0	10.14.1
GStreamer	1.24.2	1.24.2
Python bindings	`pyds 1.2.2`	built from source

install.sh detects the DeepStream version, builds the SAHI plugins, and builds libnvds_infer_yolo.so — the custom parser required by the bundled ONNX models (they use TensorRT EfficientNMS / NMS-free outputs the stock sample parser does not decode). Details: docs/INSTALL.md.

AI coding assistant support

The repository ships a CLAUDE.md and .claude/skills/ so AI coding assistants (e.g. Claude Code) can work with this project out of the box — they pick up the build/run workflow, the registered models, and the pipeline's non-obvious rules (cluster-mode=2, batch = tiles/frame, the overload alert) automatically. If you use an AI code assistant, just open it at the repository root; no extra setup is needed.

Quick Start

This repository uses Git LFS for ONNX model files.

git lfs install
git clone https://github.com/levipereira/deepstream-sahi.git
cd deepstream-sahi

Run a DeepStream container:

docker run -it --name deepstream-sahi --net=host --gpus all \
    -v `pwd`:/apps/deepstream-sahi \
    -w /apps/deepstream-sahi \
    nvcr.io/nvidia/deepstream:9.0-triton-multiarch

Inside the container:

/apps/deepstream-sahi/install.sh
source /opt/nvidia/deepstream/deepstream/sources/deepstream_python_apps/pyds/bin/activate
cd /apps/deepstream-sahi/python_test/deepstream-test-sahi
python3 deepstream_test_sahi.py --model visdrone-full-640 --no-display --csv -i ../videos/aerial_crowding_01.mp4

Test videos are on Google Drive → place them in python_test/videos/. Container variants, display notes, and rebuild mode: docs/INSTALL.md.

Models — pick one and run

Models are pre-trained and selected with --model. Each maps to a pgie + preprocess config; the ONNX (Git LFS) and per-model training/accuracy provenance live in model_zoo/visdrone_yolo26/. Full table + how to add a model: docs/USAGE.md.

`--model`	Family	Input	Output → parser
`visdrone-full-640`	YOLOv9-C (GELAN)	640	EfficientNMS → `NvDsInferYoloNMS`
`visdrone-sliced-448`	YOLOv9-C (GELAN)	448	EfficientNMS → `NvDsInferYoloNMS`
`visdrone-yolo26n-sliced-416`	YOLO26 (NMS-free)	416	`[N,6]` → `NvDsInferYoloE2E`
`visdrone-yolo26s-sliced-448`	YOLO26 (NMS-free)	448	`[N,6]` → `NvDsInferYoloE2E`

# inside the container, pyds venv active, from python_test/deepstream-test-sahi
python3 deepstream_test_sahi.py --model visdrone-yolo26n-sliced-416 --no-display --csv \
    -i ../videos/aerial_crowding_01.mp4            # CSV = per-frame detections; PERF lines = FPS
python3 deepstream_test_sahi.py --model <model> --output-mp4 results/out.mp4 -i <video>   # annotated video

Tuning (important for FPS/accuracy): set batch-size (pgie) = network-input-shape[0] (preprocess) = tiles/frame (41 @ slice 416, 29 @ slice 448, 16 @ slice 640 on a 2560×1440 source) so each frame is one inference — roughly 3× FPS with no accuracy change. Slice size is the speed/recall knob (smaller = more recall, slower). Multi-camera: batch is cameras × tiles/frame, so aim for a low tile count. Always cluster-mode=2. See docs/SAHI_MODEL_BENCHMARK.md → Deployment planning.

Documentation

Document	Description
Installation Guide	container setup, dependencies, plugin build
Usage Guide	pipeline execution, CLI arguments, adding a model
Plugin Reference	plugin properties, algorithms, tuning guide
SAHI Model Benchmark	in-pipeline FPS/detection (YOLO26 vs GELAN), bottleneck analysis, batch/tile tuning
Plugin Review	nvsahipre/postprocess code review + optimizations
Model Zoo	training provenance, accuracy, TensorRT benchmarks
Training Guide	training workflow for the sliced YOLOv9-C samples
Test Results	full evaluation data and charts
Parameter Tests · Dense Crowd	postprocess parameter validation

Results Summary

Detection counts per frame, 2560×1440 input, FP16 — SAHI recovers small-object scale that a single full-frame resize loses:

Video	`full-640` no-SAHI → SAHI	`sliced-448` no-SAHI → SAHI
`aerial_crowding_01`	13.8 → 84.2	2.3 → 85.3
`aerial_crowding_02`	206.2 → 664.7	35.9 → 614.9
`aerial_vehicles`	92.3 → 252.5	28.6 → 226.7

Pipeline FPS

Real, in-pipeline FPS (RTX 4090, FP16, fakesink sync=false, 2560×1440 source, slice = 416/41 tiles for YOLO26n, 448/29 for GELAN). Setting batch-size = tiles/frame makes each frame one inference:

Model	Input	Median FPS	With `batch = tiles/frame`
YOLO26n	416	135.6	389 (~3×)
YOLOv9-C / GELAN	448	76.5	—

The postprocess merge is ~0.18 ms/frame — not the bottleneck; tiling and detection volume dominate. Full benchmark: docs/SAHI_MODEL_BENCHMARK.md · docs/TEST_RESULTS.md. All nvsahipostprocess parameters are validated by an automated suite (21/21 across moderate and very-dense scenes): docs/PARAMETER_TESTS.md.

Trained Models — Accuracy & TensorRT Throughput

Val accuracy (VisDrone sliced, 11 classes) and pure-GPU inferences/second (img/s) swept over TensorRT batch size — FP16, RTX 4090, trtexec v10.14, single stream. img/s = inferences per second; bold = peak. Provenance: model_zoo/visdrone_yolo26/.

Model	Input	mAP .50:.95	mAP .50	b1	b8	b16	b32	b64	b128	b256
YOLO26n	416	0.439	0.694	2,313	11,389	15,557	18,010	18,356	17,021	15,997
YOLO26s	448	0.368	0.649	1,858	6,576	7,604	7,555	7,057	6,767	6,508

Latency per batch (GPU-compute mean, ms — same runs as above). Per-image latency = batch latency ÷ batch, so larger batches are far more efficient per image:

Model	b1	b8	b16	b32	b64	b128	b256
YOLO26n	0.43	0.70	1.03	1.78	3.49	7.52	16.00
YOLO26s	0.54	1.22	2.10	4.23	9.07	18.92	39.33

Peak: yolo26n 18,356 img/s @ batch 64, yolo26s 7,604 img/s @ batch 16; latency/throughput knee at batch 16 for both. Full sweep + per-image latency: model_zoo/visdrone_yolo26/03_results/PERFORMANCE_TRT.md.

Example Charts

Video Demos

Dense Pedestrian Crowd	Very Dense Crowd	Dense Vehicle Traffic

Training (not in this repo)

This repository deploys pre-trained models; it does not train them. Reproduce/retrain with the original upstream repos — YOLO26 via official Ultralytics (yolo detect train …, then yolo export format=onnx dynamic=True simplify=True) and YOLOv9-C/GELAN via the upstream YOLOv9 repo. The exact dataset (VisDrone sliced 416, 11 classes), commands, hyperparameters, accuracy and TensorRT benchmarks are bundled under model_zoo/visdrone_yolo26/; YOLOv9-C notes in docs/TRAINING.md.

Limitations

Object-count overload: above ~2000 objects in a single frame, the OSD draw and GreedyNMM merge become the bottleneck and FPS drops sharply (the pipeline emits a one-time [WARN]). Bound it by raising pre-cluster-threshold, using fewer/larger tiles, or capping max-detections.
cluster-mode=2 required. cluster-mode=4 renders wrong boxes in the OSD.
Bidirectional NMM (transitive merge chains) is not implemented; GreedyNMM covers real-time use.
Merged mask resolution is capped at 512×512. Multi-source runs use OpenMP parallelism but are not benchmarked end-to-end.

See docs/PLUGINS.md for the full property reference and algorithm details.

License

This repository is multi-licensed per component — the SPDX header in each source file is authoritative; see LICENSE at the root for the summary.

Component	License
`nvsahipostprocess` plugin (original work)	Apache-2.0
`nvsahipreprocess` plugin (derivative of NVIDIA's `gst-nvdspreprocess` sample) + rest of the repo	NVIDIA DeepStream SDK EULA
`nvdsinfer_yolo` parser	see its `LICENSE`

The Apache-2.0 license covers the postprocess plugin's source; building and running it still requires the NVIDIA DeepStream SDK, which is governed by NVIDIA's agreement. Preserve all per-file copyright/SPDX notices when redistributing.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.claude/skills		.claude/skills
deepstream_source		deepstream_source
docs		docs
model_zoo		model_zoo
python_test		python_test
scripts		scripts
test_results		test_results
train_yolov9_visdrone		train_yolov9_visdrone
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepStream SAHI

Contents

Compatibility

AI coding assistant support

Quick Start

Models — pick one and run

Documentation

Results Summary

Pipeline FPS

Trained Models — Accuracy & TensorRT Throughput

Example Charts

Video Demos

Training (not in this repo)

Limitations

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepStream SAHI

Contents

Compatibility

AI coding assistant support

Quick Start

Models — pick one and run

Documentation

Results Summary

Pipeline FPS

Trained Models — Accuracy & TensorRT Throughput

Example Charts

Video Demos

Training (not in this repo)

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages