Tencent · nihui · Feb 28, 2026 · Feb 28, 2026 · Feb 28, 2026 · Feb 28, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,37 @@
+# ncnn - AI Agent Developer Guide
+
+ncnn is a high-performance neural network inference framework optimized for mobile and embedded platforms (Tencent, BSD-3-Clause). Written in C/C++ with minimal dependencies. Supports x86, ARM, RISC-V, LoongArch, MIPS CPUs and Vulkan GPU. Includes PNNX for PyTorch/ONNX-to-ncnn conversion.
+
+## Repository Layout
+
+```
+src/                    Core library (mat.h, net.h, layer.h, option.h, ...)
+src/layer/              Generic layer implementations
+src/layer/{x86,arm,riscv,loongarch,mips}/   Arch-optimized layers
+src/layer/vulkan/       Vulkan GPU layers + shader/ (.comp GLSL shaders)
+tools/pnnx/             PyTorch Neural Network eXchange converter
+tools/{caffe,onnx}/     Legacy model converters
+tests/                  Unit tests (test_<layername>.cpp)
+cmake/                  Build modules (ncnn_add_layer.cmake)
+toolchains/             Cross-compilation toolchain files
+docs/                   Documentation
+.clang-format           Code formatting (Allman, 4-space, C++03)
+.github/workflows/      CI (build, test, coverage, format)
+```
+
+## Agent Documentation Index
+
+Read these docs selectively based on the task at hand:
+
+| Topic | Doc | When to read |
+|---|---|---|
+| Key data structures | [docs/agents/data-structures.md](docs/agents/data-structures.md) | Working with Mat, Layer, Net, Blob, ParamDict |
+| Build and test | [docs/agents/build-and-test.md](docs/agents/build-and-test.md) | Building, testing, cross-compilation, coverage |
+| Code style and portability | [docs/agents/code-style.md](docs/agents/code-style.md) | Writing code for src/ (C++03, simplestl, OpenMP rules) |
+| CPU/GPU dispatch | [docs/agents/dispatch.md](docs/agents/dispatch.md) | Understanding layer registration, packing, Vulkan flow |
+| PNNX architecture | [docs/agents/pnnx.md](docs/agents/pnnx.md) | Model conversion pipeline, IR, pass system |
+| Task: Add ncnn operator | [docs/agents/task-add-operator.md](docs/agents/task-add-operator.md) | Adding a new layer to ncnn |
+| Task: Add PNNX operator | [docs/agents/task-add-pnnx-operator.md](docs/agents/task-add-pnnx-operator.md) | Adding PyTorch op support to PNNX |
+| Task: x86 SIMD optimization | [docs/agents/task-x86-optimization.md](docs/agents/task-x86-optimization.md) | SSE/AVX/AVX-512 layer optimization |
+| Task: Vulkan optimization | [docs/agents/task-vulkan-optimization.md](docs/agents/task-vulkan-optimization.md) | GPU compute shader layer |
+| Task: Cross-arch optimization | [docs/agents/task-cross-arch-optimization.md](docs/agents/task-cross-arch-optimization.md) | ARM NEON/SVE, RISC-V RVV, QEMU testing |
diff --git a/docs/agents/build-and-test.md b/docs/agents/build-and-test.md
@@ -0,0 +1,172 @@
+# Build and Test
+
+## Basic Build (Linux)
+
+```bash
+cd ncnn
+mkdir build && cd build
+cmake ..
+cmake --build . -j$(nproc)
+```
+
+## Key CMake Options
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `NCNN_VULKAN` | OFF | Enable Vulkan GPU support |
+| `NCNN_OPENMP` | ON | Enable OpenMP multi-threading |
+| `NCNN_BUILD_TESTS` | OFF | Build unit tests |
+| `NCNN_BUILD_TOOLS` | ON* | Build converter tools |
+| `NCNN_BUILD_EXAMPLES` | ON* | Build example programs |
+| `NCNN_BUILD_BENCHMARK` | ON | Build benchmark tool |
+| `NCNN_SHARED_LIB` | OFF | Build shared library |
+| `NCNN_RUNTIME_CPU` | ON | Runtime CPU feature detection & dispatch |
+| `NCNN_SSE2` | ON | x86 SSE2 support |
+| `NCNN_AVX` | ON | x86 AVX support |
+| `NCNN_AVX2` | ON | x86 AVX2/FMA support |
+| `NCNN_AVX512` | ON* | x86 AVX-512 support |
+| `NCNN_ARM82` | ON | AArch64 fp16 (ARMv8.2) |
+| `NCNN_ARM82DOT` | ON | AArch64 dot product |
+| `NCNN_ARM84BF16` | ON | AArch64 BFloat16 |
+| `NCNN_ARM84I8MM` | ON | AArch64 Int8 matrix multiply |
+| `NCNN_ARM86SVE` | ON | AArch64 SVE |
+| `NCNN_RVV` | ON | RISC-V Vector extension |
+| `NCNN_SIMPLEMATH` | OFF | Use built-in math (no libm) |
+| `NCNN_SIMPLESTL` | OFF | Use built-in STL (no libstdc++) |
+| `WITH_LAYER_xxx` | ON | Enable/disable individual layers |
+
+\* `NCNN_BUILD_TOOLS` and `NCNN_BUILD_EXAMPLES` default to OFF when cross-compiling or targeting Android/iOS. `NCNN_AVX512` defaults to ON only when the compiler supports it and `NCNN_AVX2` is ON.
+
+## Build with Vulkan
+
+```bash
+cmake -DNCNN_VULKAN=ON ..
+cmake --build . -j$(nproc)
+```
+
+Requires the Vulkan SDK. The bundled `glslang/` submodule compiles GLSL shaders to SPIR-V at build time.
+
+## Build with Tests
+
+```bash
+cmake -DNCNN_BUILD_TESTS=ON -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF ..
+cmake --build . -j$(nproc)
+ctest --output-on-failure -j$(nproc)
+```
+
+## Cross-Compilation
+
+Toolchain files are in `toolchains/`. Example for AArch64:
+
+```bash
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake \
+      -DNCNN_BUILD_TESTS=ON ..
+cmake --build . -j$(nproc)
+```
+
+Run tests with QEMU:
+
+```bash
+TESTS_EXECUTABLE_LOADER=qemu-aarch64-static \
+TESTS_EXECUTABLE_LOADER_ARGUMENTS="-L;/usr/aarch64-linux-gnu" \
+ctest --output-on-failure -j8
+```
+
+For RISC-V with RVV:
+
+```bash
+export RISCV_ROOT_PATH=/path/to/riscv-toolchain
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/riscv64-unknown-linux-gnu.toolchain.cmake \
+      -DNCNN_RVV=ON -DNCNN_BUILD_TESTS=ON ..
+cmake --build . -j$(nproc)
+
+# Test with QEMU (vlen=256)
+TESTS_EXECUTABLE_LOADER=qemu-riscv64 \
+TESTS_EXECUTABLE_LOADER_ARGUMENTS="-cpu;rv64,v=true,zfh=true,zvfh=true,vlen=256,elen=64,vext_spec=v1.0;-L;/path/to/sysroot" \
+ctest --output-on-failure -j8
+```
+
+## Intel SDE for x86 ISA Testing
+
+The CI uses Intel SDE to test advanced ISA extensions (AVX-512, AVX-VNNI, etc.) on machines that do not natively support them:
+
+```bash
+TESTS_EXECUTABLE_LOADER=/path/to/sde64 \
+TESTS_EXECUTABLE_LOADER_ARGUMENTS="-spr;--" \
+ctest --output-on-failure -j8
+```
+
+## Testing
+
+Tests are in `tests/`. Each layer has a `test_<layername>.cpp` file.
+
+### Test Pattern
+
+Tests use `testutil.h` which provides `test_layer()` — it creates a layer with given `ParamDict` and weights, runs forward with random input using the naive (generic, non-optimized) layer implementation, then runs the same input through the CPU-optimized and Vulkan paths (when available), and compares the results with numerical tolerance checks.
+
+```cpp
+// tests/test_relu.cpp
+#include "testutil.h"
+
+static int test_relu(const ncnn::Mat& a, float slope)
+{
+    ncnn::ParamDict pd;
+    pd.set(0, slope);
+    std::vector<ncnn::Mat> weights(0);
+    int ret = test_layer("ReLU", pd, weights, a);
+    if (ret != 0)
+        fprintf(stderr, "test_relu failed a.dims=%d a=(%d %d %d %d) slope=%f\n",
+                a.dims, a.w, a.h, a.d, a.c, slope);
+    return ret;
+}
+
+int main()
+{
+    SRAND(7767517);
+    return test_relu(RandomMat(5, 6, 7, 24), 0.f)
+        || test_relu(RandomMat(128), 0.1f);
+}
+```
+
+### Adding a New Test
+
+1. Create `tests/test_<layername>.cpp`
+2. Add to `tests/CMakeLists.txt`: `ncnn_add_test(test_<layername>)`
+3. Test all dimension ranks (1D, 2D, 3D, 4D) with various sizes, including:
+   - Sizes divisible by common pack sizes (4, 8, 16)
+   - Non-aligned sizes to test remainder loops
+   - Multiple parameter combinations
+
+## Code Coverage
+
+CI runs code coverage on every push/PR (see `.github/workflows/test-coverage.yml`). It builds with `NCNN_COVERAGE=ON` which adds `-coverage -fprofile-arcs -ftest-coverage` flags and links `-lgcov`. After tests run, `lcov` collects the `.gcda` / `.gcno` data and uploads to Codecov.
+
+When developing, you should measure coverage locally to ensure your new code is well tested:
+
+```bash
+# Build with coverage
+mkdir build-coverage && cd build-coverage
+cmake -DCMAKE_BUILD_TYPE=debug \
+      -DNCNN_COVERAGE=ON \
+      -DNCNN_RUNTIME_CPU=OFF \
+      -DNCNN_OPENMP=OFF \
+      -DNCNN_BUILD_TOOLS=OFF \
+      -DNCNN_BUILD_EXAMPLES=OFF \
+      -DNCNN_BUILD_TESTS=ON ..
+cmake --build . -j$(nproc)
+
+# Run tests
+ctest --output-on-failure -j$(nproc)
+
+# Collect coverage
+lcov -d ./src -c -o lcov.info
+lcov -r lcov.info '/usr/*' -o lcov.info
+lcov -r lcov.info '*/build-coverage/*' -o lcov.info
+lcov --list lcov.info
+
+# (Optional) Generate HTML report
+genhtml lcov.info --output-directory coverage-html
+# Open coverage-html/index.html in a browser
+```
+
+Aim for high coverage of your new or modified code paths. The CI coverage matrix tests multiple configurations — x86 ISA variants (none/sse2/avx/avx2/avx512/avx512vnni), cross-compiled architectures (ARM, RISC-V RVV, MIPS, LoongArch, PowerPC) via QEMU, Vulkan GPU (llvmpipe and SwiftShader), and OpenMP on/off — so make sure your tests exercise the relevant branches.
diff --git a/docs/agents/code-style.md b/docs/agents/code-style.md
@@ -0,0 +1,62 @@
+# Code Style and Portability
+
+## Formatting
+
+The project uses **Allman brace style** with 4-space indentation, no tabs. Defined in `.clang-format` and `.astylerc`.
+
+Key conventions:
+- **Indentation**: 4 spaces, no tabs
+- **Braces**: Allman style (opening brace on new line for functions, classes, control statements)
+- **Namespaces**: No indentation inside `namespace ncnn { ... }`
+- **Pointers**: Left-aligned (`float* ptr`, not `float *ptr`)
+- **Column limit**: None (no hard line length limit)
+- **Includes**: Not sorted by clang-format
+- **Naming**: `snake_case` for variables/functions, `PascalCase` for class names, `UPPER_CASE` for macros
+- **Comments**: `//` style, minimal — code is expected to be self-explanatory
+- **Copyright header**: Every ncnn-authored source file starts with `// Copyright YYYY Tencent` and `// SPDX-License-Identifier: BSD-3-Clause`
+- **SIMD code**: Uses `#if __SSE2__` / `#if __AVX__` / `#if __ARM_NEON` preprocessor guards, nested from wider to narrower
+- **OpenMP**: `#pragma omp parallel for num_threads(opt.num_threads)` on the outer channel loop
+
+Format code with:
+```bash
+./codeformat.sh  # runs clang-format + astyle twice for stable output
+```
+
+You do **not** need to run this locally before submitting. The GitHub CI workflow (`.github/workflows/code-format.yml`) automatically formats all C/C++ source files and GLSL shaders on every push/PR and commits the formatting changes back. Just write code following the conventions above, and CI will fix any minor formatting deviations.
+
+## Code Portability (Core Library)
+
+ncnn's core library (`src/`) is designed for maximum compiler and platform compatibility. Strict portability rules apply to all code under `src/`:
+
+### Language Standard
+
+- **C code**: C99
+- **C++ code**: C++03 (`.clang-format` enforces `Standard: c++03`)
+- **Do NOT use** C++11 or later features in `src/`: no `auto`, `nullptr`, range-based for loops, `constexpr`, `std::move`, lambda expressions, `override`/`final` keywords, uniform initialization `{}`, `<thread>`, `<mutex>`, `<atomic>`, etc.
+- Use `0` instead of `nullptr`, explicit type declarations instead of `auto`, traditional for loops instead of range-for.
+
+### STL Restrictions
+
+ncnn provides its own minimal STL implementation in `src/simplestl.h` (enabled with `NCNN_SIMPLESTL=ON`) to support environments without a C++ standard library (bare-metal, some embedded systems). All core library code must be compatible with this subset:
+
+- **Allowed**: `std::vector`, `std::string`, `std::pair`, `std::list`, `std::stack`, `std::swap`, `std::min`, `std::max`, `std::partial_sort`, `std::less`, `std::greater`
+- **Not available in simplestl**: `std::map`, `std::set`, `std::unordered_map`, `std::shared_ptr`, `std::unique_ptr`, `<algorithm>` (beyond `partial_sort`), `<functional>`, `<iostream>`, streams, smart pointers, etc.
+- When writing core library code, only use STL templates that are implemented in `simplestl.h`.
+
+### Math Restrictions
+
+ncnn also provides `src/simplemath.h` / `src/simplemath.cpp` (enabled with `NCNN_SIMPLEMATH=ON`) as a drop-in replacement for `<math.h>` / `<cmath>`, for platforms without a math library. Core code should stick to standard C99 math functions.
+
+### OpenMP Restrictions
+
+ncnn provides a minimal OpenMP runtime (`src/simpleomp.h` / `src/simpleomp.cpp`, enabled with `NCNN_SIMPLEOMP=ON`) that supports both the LLVM libomp ABI and the GCC libgomp ABI. Only the following OpenMP usage is allowed in the core library:
+
+```cpp
+#pragma omp parallel for num_threads(opt.num_threads)
+```
+
+Do not use any other OpenMP directives such as `critical`, `atomic`, `reduction`, `task`, `simd`, `sections`, or `barrier`. The `collapse(2)` clause is used in a few places but should be limited to simple cases.
+
+### Tools and PNNX — No Restriction
+
+Code outside the core library — specifically `tools/pnnx/`, `tools/caffe/`, `tools/onnx/`, `examples/`, `tests/`, `python/` — is **not** subject to these portability restrictions. PNNX in particular uses **C++17** (or C++14 for PyTorch < 2.1) and freely uses modern C++ features, the full standard library, protobuf, etc.
diff --git a/docs/agents/data-structures.md b/docs/agents/data-structures.md
@@ -0,0 +1,97 @@
+# Key Data Structures
+
+## Mat (`src/mat.h`)
+
+The core tensor type. Supports 1D to 4D data with element packing for SIMD.
+
+```cpp
+class Mat {
+    void* data;          // Raw data pointer
+    int* refcount;       // Reference counting (NULL for external data)
+    size_t elemsize;     // Bytes per element (4=fp32, 2=fp16, 1=int8 when elempack=1;
+                         //   equals scalar_size * elempack when packed, e.g., 16 for pack4 fp32)
+    int elempack;        // Packed elements (1=scalar, 4=SSE/NEON, 8=AVX/fp16)
+    Allocator* allocator;
+    int dims;            // 0=empty, 1=1D, 2=2D, 3=3D, 4=4D
+    int w, h, d, c;      // Width, height, depth, channels
+    size_t cstep;        // Channel stride (elements per channel)
+};
+```
+
+Key concepts:
+- **Element packing (`elempack`)**: Multiple elements stored together for SIMD. E.g., `elempack=4` means 4 floats packed as one unit (for SSE/NEON 128-bit). `elempack=8` for AVX 256-bit. Channel count `c` is divided by `elempack`.
+- **Channel step (`cstep`)**: Aligned stride between channels for SIMD alignment.
+- GPU variants: `VkMat` (Vulkan buffer), `VkImageMat` (Vulkan image).
+
+## Net (`src/net.h`)
+
+The inference engine. Loads param (graph) and model (weights), creates `Extractor` for inference.
+
+```cpp
+class Net {
+    Option opt;                    // Runtime options
+    int load_param(const char*);   // Load graph structure (.param)
+    int load_model(const char*);   // Load weights (.bin)
+    Extractor create_extractor();  // Create inference session
+};
+
+class Extractor {
+    int input(const char* name, const Mat& in);   // Set input
+    int extract(const char* name, Mat& out);       // Get output (runs inference)
+};
+```
+
+## Layer (`src/layer.h`)
+
+Base class for all operators. Key behavioral flags set in constructor:
+
+```cpp
+class Layer {
+    bool one_blob_only;     // Single input/output (e.g., ReLU)
+    bool support_inplace;   // Can modify input in-place
+    bool support_packing;   // Accepts packed Mat (elempack > 1)
+    bool support_vulkan;    // Has Vulkan implementation
+    bool support_bf16_storage;
+    bool support_fp16_storage;
+    bool support_int8_storage;
+    bool support_any_packing;         // Layer handles any elempack internally (skip auto packing conversion)
+    bool support_vulkan_any_packing;  // Same as above, but for Vulkan path
+
+    // CPU forward
+    virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;
+    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;
+    virtual int forward_inplace(std::vector<Mat>& bottom_top_blobs, const Option& opt) const;
+    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;
+
+    // Vulkan forward
+    virtual int forward(const VkMat& bottom_blob, VkMat& top_blob, VkCompute& cmd, const Option& opt) const;
+    virtual int forward_inplace(VkMat& bottom_top_blob, VkCompute& cmd, const Option& opt) const;
+
+    virtual int load_param(const ParamDict& pd);  // Load params from .param
+    virtual int load_model(const ModelBin& mb);    // Load weights from .bin
+    virtual int create_pipeline(const Option& opt); // Setup (e.g., create Vulkan pipelines)
+    virtual int destroy_pipeline(const Option& opt);
+    virtual int upload_model(VkTransfer& cmd, const Option& opt); // Upload weights to GPU
+};
+```
+
+Forward interface selection table:
+
+| one_blob_only | support_inplace | Required interface |
+|---|---|---|
+| false | false | `forward(vector<Mat>, vector<Mat>)` |
+| false | true | `forward_inplace(vector<Mat>)` (must), `forward(vector<Mat>, vector<Mat>)` (optional) |
+| true | false | `forward(Mat, Mat)` |
+| true | true | `forward_inplace(Mat)` (must), `forward(Mat, Mat)` (optional) |
+
+## Blob (`src/blob.h`)
+
+A named tensor edge in the computation graph. Each blob has a producer layer and consumer layers.
+
+## ParamDict (`src/paramdict.h`)
+
+Key-value store for layer parameters. Keys are integers (0, 1, 2, ...). Values can be int, float, or arrays thereof. Used in `.param` files as `key=value`.
+
+## Option (`src/option.h`)
+
+Runtime configuration: `num_threads`, `use_vulkan_compute`, `use_fp16_packed`, `use_bf16_storage`, blob/workspace allocators, etc.