Skip to content

feat: implement UE8M0 scale format support for FP8 inference#1023

Open
Libres-coder wants to merge 1 commit intodeepseek-ai:mainfrom
Libres-coder:main
Open

feat: implement UE8M0 scale format support for FP8 inference#1023
Libres-coder wants to merge 1 commit intodeepseek-ai:mainfrom
Libres-coder:main

Conversation

@Libres-coder
Copy link
Copy Markdown

@Libres-coder Libres-coder commented Oct 26, 2025

Summary

Implements UE8M0 (uint8 exponent) scale format for FP8 quantization, as specified in config_v3.1.json. Reduces activation scale memory by 75% (4 bytes → 1 byte) and optimizes computation through exponent-based operations.

Changes

Core Implementation (inference/kernel.py)

New functions:

  • convert_scale_to_ue8m0() - Convert float32 scales to uint8 format
  • convert_scale_from_ue8m0() - Convert uint8 back to float32

Updated functions:

  • act_quant() / act_quant_kernel() - Support UE8M0 output (uint8 scales)
  • fp8_gemm() / fp8_gemm_kernel() - Use exponent addition for UE8M0
  • weight_dequant() / weight_dequant_kernel() - Support UE8M0 input

Format specification:

# Encoding: uint8 = ceil(log2(scale)) + 127
# Decoding: scale = 2^(uint8 - 127)
# Optimization: exp_a + exp_b instead of scale_a * scale_b

Integration (inference/model.py)

  • Added scale_fmt parameter to linear() function
  • Pass scale_fmt from config through model to kernels
  • Configuration flow: config.jsonModelArgs.scale_fmtLinear.scale_fmt → kernels

Benefits

Aspect Improvement
Activation scale memory -75% (4 bytes → 1 byte)
Memory bandwidth -75% for scale transfers
Computation Exponent addition (faster than float multiplication)

Backward Compatibility

Zero breaking changes:

  • Default scale_fmt=None uses original float32 path
  • Weight scales remain float32 (safetensors compatibility)
  • Automatic runtime conversion for mixed formats

Usage

Already configured in config_v3.1.json:

{
    "dtype": "fp8",
    "scale_fmt": "ue8m0"
}

Run inference:

python inference/generate.py --config configs/config_v3.1.json ...

Files Changed

  • inference/kernel.py - Core implementation (~80 lines)
  • inference/model.py - Integration (~15 lines)

Resolves: #994

@Libres-coder
Copy link
Copy Markdown
Author

ptal,thx @GeeeekExplorer @haswelliris @mowentian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant