BF16 vs GGUF, FP8 Scaled, NVFP4 Speed & Quality Compared + ComfyUI CUDA 13 Gains + FLUX 2 Klein 9B #357
FurkanGozukara
announced in
Tutorials
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
BF16 vs GGUF, FP8 Scaled, NVFP4 Speed & Quality Compared + ComfyUI CUDA 13 Gains + FLUX 2 Klein 9B
Full tutorial: https://www.youtube.com/watch?v=XDzspWgnzxI
It was always wondered how much quality and speed difference exists between BF16, GGUF, FP8 Scaled and NVFP4 precisions. In this tutorial I have compared all these precision and quantization variants for both speed and quality. The results are pretty surprising. Moreover, we have developed and published NVFP4 model quant generator app and FP8 Scaled quant generator apps. The links of the apps are below if you want to use them. Furthermore, upgrading ComfyUI to CUDA 13 with properly compiled libraries is now very much recommended. We have observed some noticeable performance gains with CUDA 13. So for both SwarmUI and ComfyUI solo users, CUDA 13 ComfyUI is now recommended.
📂 Resources & Links:
📥 Download ComfyUI CUDA 13 Installer: [ https://www.patreon.com/posts/ComfyUI-Installers-105023709 ]
📥 SwarmUI & ComfyUI Unified Model Downloader: [ https://www.patreon.com/posts/SwarmUI-Install-Download-Models-Presets-114517862 ]
🤖 NVFP4 Model Quantizer App: [ https://www.patreon.com/posts/nvfp4-quantizer-app-148217625 ]
🤖 SECourses Musubi Trainer (FP8 Scaled Quantization App): [ https://www.patreon.com/posts/nvfp4-quantizer-app-148217625 ]
🛠️ Image Comparison Slider Tool: [ https://www.patreon.com/posts/image-video-comparison-slider-app-133935178 ]
☁️ SimplePod AI: [ https://simplepod.ai/ref?user=secourses ]
New Model FLUX 2 Klein 9B: [ https://huggingface.co/black-forest-labs/FLUX.2-klein-9B ]
🎥 FLUX 1 Kontext Dev Tutorial (inpaint - outpaint - image fix): [ https://youtu.be/XWzZ2wnzNuQ ]
🎥 Previous ComfyUI Installation Tutorial: [ https://youtu.be/yOj9PYq3XYM ]
How to Use SwarmUI Presets & Workflows in ComfyUI + Custom Model Paths Setup for ComfyUI & SwarmUI Tutorial: [ https://youtu.be/EqFilBM3i7s ]
SECourses Discord Channel for 7/24 Support: [ https://discord.com/invite/software-engineering-courses-secourses-772774097734074388 ]
SECourses Musubi Tuner Tutorial: [ https://youtu.be/DPX3eBTuO_Y ]
NVIDIA NVFP4 Blog Post to learn More: [ https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/ ]
⏱️ Video Chapters:
00:00:00 Introduction: GGUF Q8 vs NVFP4 vs BF16 vs FP8 Precision Comparison
00:00:38 FP8 Quantization & New NVFP4 Model Quantizer App in Musubi Trainer
00:01:08 The New FLUX SRPO Mixed NVFP4 Model & FLUX 2 Klein 9B Announcement
00:01:56 Speed Comparison Setup: ComfyUI CUDA 13 & Compiled Libraries
00:02:41 Z Image Turbo Speed Test: GGUF Q8 vs NVFP4 (87% Faster)
00:03:09 Z Image Turbo Speed Test: BF16 vs FP8 Scaled vs GGUF Improvements
00:03:32 Installing & Using Image Comparison Slider Tool for Quality Check
00:03:55 Z Image Turbo Quality: BF16 vs GGUF Q8 vs FP8 Scaled
00:04:13 Z Image Turbo Quality: NVFP4 Degradation Analysis
00:04:27 FLUX 2 Dev Speed Test: GGUF Q8 vs NVFP4 (100% Faster)
00:04:43 FLUX 2 Dev Speed Test: FP8 Scaled vs BF16 Performance
00:05:12 FLUX 2 Dev Quality: BF16 vs GGUF Q8 vs Mixed FP8 Scaled
00:05:38 FLUX 2 Dev Quality: NVFP4 Mixed Precision Analysis
00:05:54 Benchmark Settings: 2048px Resolution & Quality 1 Preset Details
00:06:25 FLUX 1 Dev Speed Test: GGUF Q8 vs NVFP4 (118% Faster)
00:07:21 FLUX 1 Dev Speed Test: BF16 & FP8 Scaled Performance Stats
00:07:42 FLUX 1 Dev Quality: BF16 vs GGUF Q8 vs FP8 Scaled
00:07:55 FLUX 1 Dev Quality: NVFP4 Visual Degradation Review
00:08:06 FLUX 1 Kontext Dev: Model Intro & Outpainting Tutorial Reference
00:08:40 FLUX 1 Kontext Dev Speed: GGUF Q8 vs NVFP4 (93% Faster)
00:08:59 FLUX 1 Kontext Dev Speed: BF16 & FP8 Scaled Comparisons
00:09:12 FLUX 1 Kontext Dev Quality: Original vs Edited Image (Hair Change)
00:09:36 FLUX 1 Kontext Dev Quality: BF16 vs GGUF Q8 vs FP8 Scaled
00:09:51 How to Use SwarmUI Unified Model Downloader & Bundles
00:10:36 Downloading Models via URL from CivitAI & Hugging Face to Cloud
00:11:45 SECourses Musubi Trainer: Creating Custom FP8 Quantized Models
00:12:44 The New FLUX SRPO NVFP4 Mixed Precision Model Overview
00:13:15 Live Demo: FLUX SRPO NVFP4 Speed Test on RTX 5090 (5.7s)
00:13:52 VRAM Usage Analysis: NVFP4 on RTX 5090 (14GB Usage)
00:14:16 Live Comparison: BF16 Speed & VRAM Test on RTX 5090
00:15:15 Troubleshooting: Fixing Low RAM/VRAM Issues with Arguments
00:16:25 Why You Should Upgrade to ComfyUI CUDA 13 Version
00:16:51 SimplePod AI: Updated Instructions & Template Setup
00:17:29 RTX 6000 Blackwell Fix & nvitop Utilization Verification
00:18:18 Conclusion, Contact Info & Support Channels
In this video, you will learn:
Speed differences between GGUF Q8, NVFP4, BF16, and FP8.
Visual quality analysis using the Image Comparison Slider.
How to use the new NVFP4 and FP8 Quantizer tools.
How to fix Low VRAM/RAM issues with specific arguments.
Performance benchmarks on RTX 5090 and RTX 6000.
Video Transcription
00:00:00 Greetings everyone. Today I am going to make comparison of the performance for the GGUF Q8,
00:00:08 NVFP4, BF16, and FP8 scaled precisions for the Z Image Turbo, FLUX 2 Dev,
00:00:18 FLUX 1 Dev, FLUX 1 Kontext Dev models. Moreover, I will compare their quality 1 by 1, side by side,
00:00:25 so you will see how much quality degrades or changes between these precisions. Moreover,
00:00:32 I will show how you can download them and I will talk about my FP8
00:00:38 quantization application implemented into SECourses Musubi Trainer. And furthermore,
00:00:45 my new NVFP4 model quantizer application. To develop this application, I did spend
00:00:52 massive amount of time and money. I have used the SimplePod RTX PRO 6000 GPU over a 1 day to
00:01:02 make this application work. This was not a trivial task. However, as a result, we got
00:01:08 an amazing FLUX SRPO mixed NVFP4 model. You will see how quality it is and how much faster it is.
00:01:16 Moreover, FLUX just published a new model called as FLUX 2 Klein 9 billion parameters. Hopefully,
00:01:24 I will also cover this model. This model is supposed to be better trainable and
00:01:30 since it is much smaller compared to the official FLUX 2, it will work much faster.
00:01:37 I hope the quality is amazing so that we can start using this model as well. Hopefully,
00:01:42 I will cover it fully with presets, with 1 click downloads. You see the BF16 is only
00:01:47 18 GB. FLUX 2 was 60 GB as a BF16. So this is another model, a tutorial is coming hopefully.
00:01:56 So let's begin with the speed differences. To obtain these speeds you need to install
00:02:02 ComfyUI CUDA 13 version. Our ComfyUI installer already updated for CUDA 13 version with the
00:02:09 latest libraries. I have compiled every one of them myself for you. My compiled libraries,
00:02:16 Flash Attention, Sage Attention, xFormers, are compiled for these CUDA archs. Therefore,
00:02:22 they are working for every GPU out there that you can think of. If you
00:02:26 have watched my latest tutorial, you will learn how to install and use this
00:02:31 latest ComfyUI version. The link will be in the description of the video.
00:02:34 So let's begin with the speed comparison. The first model is Z Image Turbo and GGUF Q8
00:02:41 speed is 2.26 IT per second. This is for 1536 to 1536 pixel image generation. When we look at the
00:02:52 NVFP4 variant precision, it becomes 4.23 IT per second. It is 87 percentage faster compared to the
00:03:01 GGUF Q8. The BF16 version is only 10 percentage faster. I saw a significant speed up with the CUDA
00:03:09 13 version for GGUF models so the ComfyUI team is cooking. They are improving the GGUF significantly
00:03:17 and you need CUDA 13 version for this one. With the FP8 scaled, it is only 7 percentage faster
00:03:24 than the GGUF Q8. So the GGUF is becoming much faster compared to before for Z Image Turbo.
00:03:32 What about the quality? For quality comparison, I am going to use image comparison slider. You can
00:03:37 download it from here and install. Installation is so simple. All you need to do is install
00:03:43 update up .bat file then start image comparison .bat file. The link will be in the description
00:03:49 of the video. So I have selected my files. Let's full screen. On the left we see the BF16 version,
00:03:55 the highest quality. And on the right now we see the GGUF Q8. You see the GGUF Q8
00:04:01 is almost same as BF16. There isn't visible quality degradation so it is working amazing.
00:04:07 When we switch to FP8 scaled, the quality is still very good. We don't see quality degrade,
00:04:13 almost same quality. When we look at the NVFP4, we see some quality degrade for the Z Image
00:04:20 Turbo. Maybe a better variant, a better version of the NVFP4 will be published. We will see that.
00:04:27 Let's look at the FLUX 2 Dev version. With the FLUX 2 Dev version, the GGUF Q8 is very slow,
00:04:35 7.97 second IT. When we look at the NVFP4, it is 100 percentage faster compared to
00:04:43 the GGUF Q8. The FP8 scaled version is still significantly faster than the GGUF Q8. Therefore,
00:04:52 you should use the FP8 scaled variants for FLUX 2. There is a massive difference. And
00:04:57 BF16 version is also slow even though I have did test on RTX 6000 PRO. So for FLUX 2 Dev,
00:05:06 I recommend either NVFP4 or FP8 scaled. Let's look at the quality difference. So these are
00:05:12 the images. Now on the left we see the BF16 and on the right we see the GGUF Q8. Almost same image,
00:05:20 there is no quality degrade. When we look at the mixed FP8 scaled,
00:05:24 we see almost same quality. Mixed means that some of the layers of the model is
00:05:31 not quantized. So they are BF16 precision in this case. So it is almost same quality,
00:05:38 really cool. When we look at the NVFP4, we see some degrade in quality but still very good.
00:05:45 Still perfectly usable. Therefore, you can use the NVFP4 variant for the FLUX 2 Dev model. Amazing.
00:05:54 Before we move to the FLUX 1 Dev variant, I need to tell you that I have generated
00:05:58 with 2048 pixel resolution and Quality 1 preset. Our Quality 1 preset is using
00:06:07 heavy sampler, seeds 2. Therefore, it is 2x slow. It also applies to
00:06:12 our Z Image Turbo and other presets as well. I am usually using heavy sampler,
00:06:18 therefore they are twice slower. So these tests are made with the best quality.
00:06:25 So with the FLUX 1 Dev variant, we see GGUF Q8 is 3.54 IT per second. You need to think
00:06:33 about the relative speeds, not the exact speeds because exact speeds depends on your resolution,
00:06:39 your preset, your GPU. However, relative speeds are valid for same GPU. I have done
00:06:46 the tests on SimplePod AI and for every test I did generate multiple images because the initial
00:06:53 image is slow and subsequent images will be faster generation. And in the debug menu,
00:07:00 I looked the final second IT or IT per second and the duration. This is how I
00:07:06 have calculated the durations. So when we look at the NVFP4, it is 118 percent faster
00:07:14 than the GGUF Q8. Massive speed difference. BF16 is still 28 percentage faster than the
00:07:21 GGUF and FP8 scaled is still 19 percentage faster than GGUF. So for FLUX Dev model,
00:07:28 use FP8 scaled or NVFP4 if you are slow, but not GGUF. GGUF is still slow for FLUX Dev.
00:07:36 Let's look at the quality of the FLUX Dev model. So the left one is BF16 and the
00:07:42 right one is now GGUF. GGUF is almost same as the BF16. But when we look at the FP8 scaled,
00:07:50 just a little bit difference, still very good quality. Can be perfectly used. However,
00:07:55 when we look at the NVFP4, we see some degrade in quality. It is a
00:08:00 noticeable quality degrade. So it is up to you to use or not, you can test it.
00:08:06 So let's also look at the FLUX Kontext Dev model speed differences and quality
00:08:11 differences. This is editing model. If you don't know how to use this model or what it does,
00:08:16 I have an excellent tutorial for it. The link will be in the description of the video. This tutorial,
00:08:21 when you open this tutorial and when you look at the video chapters,
00:08:26 you will see how to do outpainting, how to use FLUX Kontext to fix images and
00:08:31 more information. This is a really good tutorial. So I recommend you to watch
00:08:35 this tutorial as well if you don't know how to use this FLUX Kontext Dev model.
00:08:40 So we see that GGUF Q8 speed is 1.83 IT per second. The NVFP4 is 93 percentage faster than
00:08:50 GGUF Q8. BF16 is 14 percentage faster than GGUF and FP8 scaled is almost same as GGUF,
00:08:59 only 9 percentage faster. So you can use GGUF or FP8 scaled for this model. NVFP4 is very
00:09:06 fast. But what about the quality difference? So let's select the files. This is how you select
00:09:12 multiple files. And let's full screen. Okay, on the left we see the original image and on
00:09:17 the right we see the Kontext Edit image. So I changed my hair, I made it longer. This was the
00:09:24 prompt. You see the face is almost not changed, very good. Only the hair is changed. This is the
00:09:29 test case. So let's select BF16 and GGUF Q8. We almost don't see any difference. It is almost
00:09:36 unnoticeable. When we look at the FP8 scaled, still very good quality, almost no noticeable
00:09:43 difference. And this FP8 quant scaled is a model that I have generated myself. It is in our
00:09:51 downloader application. If you don't know how to use our downloader application, it is so simple.
00:09:56 You download the latest SwarmUI model downloader SwarmUI installer from this post. Extract it into
00:10:03 your SwarmUI installation folder. Then just double click Windows start download models up .bat file.
00:10:10 Then you will get to this screen. You can give your custom model paths, you can download anywhere
00:10:15 you want. We have image bundles. You see NVFP4 images bundle, Z Image Turbo models core bundle,
00:10:22 FLUX models core bundle. You can download every file individually or as a bundle.
00:10:28 One another thing is that downloading models on cloud machines are much harder than downloading
00:10:36 and using on our computer. Therefore, our unified model downloader also supports URL downloader so
00:10:42 that you can download models from CivitAI, from Hugging Face or any other platform. Just paste
00:10:49 the link here and select the folder wherever you want to download. Then it will download
00:10:55 it with maximum speed with hash calculation, hash verification. For example, as a demo,
00:11:01 let's download this model. So I will right click and copy link address of this model and paste it
00:11:07 here and I will download it into here. Then I will click download. It will start the download with
00:11:15 maximum speed of my internet connection. You see it is downloading with 16 connection. Therefore,
00:11:21 I am reaching 100 megabytes per second on my personal Windows computer. On cloud you can
00:11:27 reach to 1 gigabytes per second. This download tool is extremely useful if you want to use it.
00:11:33 You can search for the files from here like "Kontext" and it will list me all the files.
00:11:39 You see FLUX Kontext Dev quantized model. And to make this quantization we are using
00:11:45 SECourses Musubi Trainer application. The link will be in the description of the video. And we
00:11:50 have a full tutorial of how to use SECourses Musubi Trainer. It is in this tutorial so you
00:11:55 can also watch this or Wan 2.2 training tutorial. Either of them works. So this application has
00:12:01 FP8 model converter. Normally we were using Musubi style but I have recently added quant
00:12:09 version. This is much more advanced. It is using specific quantization. This is how I
00:12:15 generated this amazing FLUX Kontext Dev Quant FP8 scaled. I used that also to generate FLUX
00:12:22 Dev Quant FP8 scaled. These are myself generated models. They are available in our downloader.
00:12:30 So we don't see any quality difference. It is amazing quality, almost as BF16. And when we
00:12:36 look at the NVFP4, NVFP4 also did amazing job. It is still very good quality so perfectly usable.
00:12:44 And finally, the new FLUX SRPO model NVFP4 mixed precision. To make this model I spent literally
00:12:53 2 days, a lot of money because quantization is only possible with 48 GB GPUs. Therefore I had
00:13:02 to use PRO 6000 on SimplePod. So you see there is almost no quality difference. They are both
00:13:08 amazing and it is extremely fast. Let's make a live demonstration. This is running locally,
00:13:15 my local SwarmUI. And let me show you my nvitop as well. So let's generate 8 images with the FLUX
00:13:22 SRPO mixed NVFP4. You will be shocked by the speed of it. So it started generation. Okay
00:13:30 it is generating. You are watching it live. It takes like 5 seconds for 40 steps and the highest
00:13:37 quality. You see the quality is amazing. It is working amazing. It is taking only 5.7 seconds
00:13:44 to generate on RTX 5090. And how much memory it is using? It is using 14 GB of VRAM memory. It
00:13:52 would use lesser memory if I had a lower end GPU like 8 GB GPU. However, it is working amazing.
00:13:59 By the way, NVFP4 models bringing speed only on RTX 5000 series but you can use it on other GPUs
00:14:08 as well. Other models like 4000, 3000 series. So it is amazing. When we compare it with the BF16,
00:14:16 let's look at the BF16 difference. So here. It will take more than twice time. It will take like
00:14:23 12 seconds. So now it is loading the model and it will also use much more VRAM. So if you have 5000
00:14:29 series GPUs, like 5060 or 5070, 5080, this will work amazing. Okay it is loading the model. Yes,
00:14:38 it is using 26 GB of VRAM. As I said numerous times, don't worry about your VRAM. Even if you
00:14:46 have 8 GB GPU or 6 GB GPU, it will still work as long as you have RAM memory. Because ComfyUI
00:14:52 is automatically doing all the block swapping, VRAM streaming for you. So it will still work
00:14:57 very fast and it will work on low end GPUs. As long as you have RAM memory don't worry about
00:15:03 it. So it is taking like 14 seconds. It is more than twice. This is an amazing speed difference.
00:15:10 However, recently since ComfyUI is doing a lot of updates, you may be needed to add some
00:15:15 arguments to your ComfyUI or SwarmUI backend. Which are them? If you are on low RAM memory,
00:15:22 RAM memory not VRAM, you can use cache none. If you use cache none,
00:15:26 it will not keep any model on the RAM or VRAM. It will deload them,
00:15:31 even VAE or text encoder. So this will be minimal RAM and VRAM usage. Or you can add this disable
00:15:38 smart memory and it will return back to older VRAM management. How you do that? You add it
00:15:45 into here and it will use it. Or in ComfyUI, as I have shown in our previous tutorials,
00:15:52 you edit this run GPU .bat file like this and you add that like this. So it will use that
00:16:01 argument. This is how you add arguments to your ComfyUI installation or SwarmUI installation that
00:16:06 uses the our ComfyUI installation. And they will fix your out of VRAM errors or stuck or freeze
00:16:13 errors. Because ComfyUI is updating, they are fixing, sometimes they are breaking. So you can
00:16:19 use these arguments to fix the issues. But I can say that you should definitely upgrade to ComfyUI
00:16:25 to the CUDA 13. Why? Because it is faster overall and especially faster for GGUF models and it will
00:16:33 be from now on developed better and better. Therefore, it is recommended from now on.
00:16:40 Our installers are all up to date, updated. We have all the presets. I also have converted
00:16:46 these presets into ComfyUI so you can use that. Moreover, if you are remembering in
00:16:51 our latest tutorial, we had introduced you the SimplePod AI. I have updated all the zip files.
00:16:59 So you can just run "run pod SimplePod comfyui instructions". You will have all the links. So
00:17:05 please use these links to register. Use this link as a template. And the team fixed the
00:17:11 errors. So when you selected this template, edit and use, deploy your persistent volume,
00:17:16 set your mount storage as workspace. You can also use without persistent storage then it will be
00:17:24 ready to use. Select any GPU you want. And if you had watched the previous tutorial, you remember
00:17:29 that the RTX 6000 Blackwell GPUs were not working. They fixed that issue. I did run this test on the
00:17:36 RTX PRO 6000 GPUs and it is working perfect with 100 percentage utilization. Let me demonstrate
00:17:44 you. For example, this is running on SimplePod AI. Let's continue. Open a terminal. Let's pip
00:17:50 install nvitop. And nvitop. We see that 600 watts is being used so it is fully utilizing. Moreover,
00:17:58 they have added also extract option to the Jupyter Lab interface. So now you can extract zip files
00:18:07 as well. You see when I right click, extract archive is there. So it is also fixed and working.
00:18:13 And you can see that how much time and tests I have done. Let me show you. You
00:18:18 see all these mixed precisions NVFP4 tests. All of them were failing. I did a lot of,
00:18:25 a lot of testing and fixing. So this NVFP4 converter application was really hard to
00:18:31 program. I did spend huge money and time on it. All of the links will be in the description of
00:18:37 the video. Like any other of my videos, you see I put the links like this. So this is how
00:18:43 you will find the links. You can contact me by replying to my videos or by joining our
00:18:49 Discord server. The link is here. You can also message me from LinkedIn. You can also send me
00:18:54 an email. Everything is fine. I reply all of them. Discord is the best way to contact me or
00:19:01 Patreon or replying to the YouTube. Hopefully see you in another amazing tutorial video.
Beta Was this translation helpful? Give feedback.
All reactions