Skip to content

[BUG] Fatal access violation crash on startup with recent NVIDIA drivers (bfloat16 probe) #552

@Wontfallo

Description

@Wontfallo

Describe the bug

ComfyUI crashes on startup with a Windows fatal exception: access violation originating from _probe_bfloat16_support() in src/optimization/compatibility.py (line 688). The crash is a hard segfault that Python's try/except cannot catch.

Environment

  • OS: Windows 11
  • GPU: NVIDIA GeForce RTX 4090
  • NVIDIA Driver: 595.79 (recent update)
  • PyTorch: 2.9.1+cu130
  • cuDNN: 91200
  • Python: 3.13.11
  • ComfyUI: 0.16.4
  • SeedVR2: v2.5.23 (commit 4490bd1)

Stack trace

Windows fatal exception: access violation

Stack (most recent call first):
  File "...\seedvr2_videoupscaler\src\optimization\compatibility.py", line 688 in _probe_bfloat16_support
  File "...\seedvr2_videoupscaler\src\optimization\compatibility.py", line 697 in <module>
  File "<frozen importlib._bootstrap>", line 488 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1023 in exec_module
  ...
  File "...\ComfyUI\nodes.py", line 2225 in load_custom_node

Root cause

The function _probe_bfloat16_support() performs a raw CUDA allocation (torch.randn(..., dtype=torch.bfloat16, device='cuda:0')) at module import time. With recent NVIDIA drivers (595.xx series), this triggers a fatal access violation during the CUDA/cuDNN initialization phase. Since it's a segfault, the try/except RuntimeError block cannot catch it, and the entire ComfyUI process terminates.

The GPU (RTX 4090, sm_89) fully supports bfloat16 — the crash is specifically about when and how the probe runs, not about actual bfloat16 capability.

Proposed fix

Run the bfloat16 probe in a subprocess so that if it crashes, the main process is unaffected:

def _probe_bfloat16_support() -> bool:
    if not torch.cuda.is_available():
        return True
    
    # Subprocess-based probe (safe from access violations)
    try:
        import subprocess
        import sys
        
        probe_script = (
            "import torch; "
            "a = torch.randn(8, 8, dtype=torch.bfloat16, device='cuda:0'); "
            "_ = torch.matmul(a, a); "
            "print('OK')"
        )
        
        result = subprocess.run(
            [sys.executable, "-c", probe_script],
            capture_output=True,
            text=True,
            timeout=30,
            env={**os.environ, "CUDA_VISIBLE_DEVICES": os.environ.get("CUDA_VISIBLE_DEVICES", "0")},
        )
        
        if result.returncode == 0 and "OK" in result.stdout:
            return True
        else:
            return False
    except (subprocess.TimeoutExpired, FileNotFoundError, OSError):
        pass
    
    # Fallback: check GPU compute capability (sm_80+ supports bfloat16)
    try:
        major, _ = torch.cuda.get_device_capability(0)
        return major >= 8
    except Exception:
        return True

This adds ~2 seconds to startup but prevents fatal crashes with any driver version. The actual bfloat16 result is identical — no performance impact at runtime.

Likely impact

Anyone on Windows with PyTorch 2.9+ and recent NVIDIA drivers (595.xx series, March 2026) will hit this crash. It likely affects all GPU models, not just RTX 4090.


Full Diff

diff --git a/src/optimization/compatibility.py b/src/optimization/compatibility.py
index c462022..ab60146 100644
--- a/src/optimization/compatibility.py
+++ b/src/optimization/compatibility.py
@@ -682,17 +682,51 @@ if not os.environ.get("SEEDVR2_OPTIMIZATIONS_LOGGED"):
 
 # Bfloat16 CUBLAS support
 def _probe_bfloat16_support() -> bool:
+    """
+    Probe bfloat16 CUBLAS support using a subprocess to prevent fatal access
+    violations from crashing the main ComfyUI process.
+    
+    On PyTorch 2.9+ with cuDNN >= 91200, calling torch.randn(..., dtype=bfloat16, device='cuda')
+    during module import can trigger a Windows fatal exception: access violation.
+    Running the probe in a subprocess isolates this crash.
+    """
     if not torch.cuda.is_available():
         return True
+    
+    # First try: subprocess-based probe (safe from access violations)
     try:
-        a = torch.randn(8, 8, dtype=torch.bfloat16, device='cuda:0')
-        _ = torch.matmul(a, a)
-        del a
-        return True
-    except RuntimeError as e:
-        if "CUBLAS_STATUS_NOT_SUPPORTED" in str(e):
+        import subprocess
+        import sys
+        
+        probe_script = (
+            "import torch; "
+            "a = torch.randn(8, 8, dtype=torch.bfloat16, device='cuda:0'); "
+            "_ = torch.matmul(a, a); "
+            "print('OK')"
+        )
+        
+        result = subprocess.run(
+            [sys.executable, "-c", probe_script],
+            capture_output=True,
+            text=True,
+            timeout=30,
+            env={**os.environ, "CUDA_VISIBLE_DEVICES": os.environ.get("CUDA_VISIBLE_DEVICES", "0")},
+        )
+        
+        if result.returncode == 0 and "OK" in result.stdout:
+            return True
+        else:
+            # Subprocess crashed or returned error - bf16 not safe
             return False
-        raise
+    except (subprocess.TimeoutExpired, FileNotFoundError, OSError):
+        pass
+    
+    # Fallback: check GPU compute capability (sm_80+ supports bfloat16)
+    try:
+        major, _ = torch.cuda.get_device_capability(0)
+        return major >= 8
+    except Exception:
+        return True
 
 BFLOAT16_SUPPORTED = _probe_bfloat16_support()
 COMPUTE_DTYPE = torch.bfloat16 if BFLOAT16_SUPPORTED else torch.float16

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions