feat: implement UE8M0 scale format support for FP8 inference#1023
Open
Libres-coder wants to merge 1 commit intodeepseek-ai:mainfrom
Open
feat: implement UE8M0 scale format support for FP8 inference#1023Libres-coder wants to merge 1 commit intodeepseek-ai:mainfrom
Libres-coder wants to merge 1 commit intodeepseek-ai:mainfrom
Conversation
Author
|
ptal,thx @GeeeekExplorer @haswelliris @mowentian |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements UE8M0 (uint8 exponent) scale format for FP8 quantization, as specified in
config_v3.1.json. Reduces activation scale memory by 75% (4 bytes → 1 byte) and optimizes computation through exponent-based operations.Changes
Core Implementation (
inference/kernel.py)New functions:
convert_scale_to_ue8m0()- Convert float32 scales to uint8 formatconvert_scale_from_ue8m0()- Convert uint8 back to float32Updated functions:
act_quant()/act_quant_kernel()- Support UE8M0 output (uint8 scales)fp8_gemm()/fp8_gemm_kernel()- Use exponent addition for UE8M0weight_dequant()/weight_dequant_kernel()- Support UE8M0 inputFormat specification:
Integration (
inference/model.py)scale_fmtparameter tolinear()functionscale_fmtfrom config through model to kernelsconfig.json→ModelArgs.scale_fmt→Linear.scale_fmt→ kernelsBenefits
Backward Compatibility
Zero breaking changes:
scale_fmt=Noneuses original float32 pathUsage
Already configured in
config_v3.1.json:{ "dtype": "fp8", "scale_fmt": "ue8m0" }Run inference:
Files Changed
inference/kernel.py- Core implementation (~80 lines)inference/model.py- Integration (~15 lines)Resolves: #994