Fixing ComfyUI Crashes on AMD/ROCm

I run ComfyUI on an AMD Radeon (gfx1201, 16 GB VRAM) with ROCm on Arch Linux. It was crashing after a few generations — usually when switching checkpoints. No Python traceback, just a dead server. Here’s what caused it and how to fix it.

Setup

  • AMD Radeon gfx1201, 16 GB VRAM
  • Arch Linux, CachyOS kernel
  • ROCm 7.2.1 system packages
  • PyTorch 2.10.0.dev20251123+rocm7.1 (nightly)
  • ComfyUI v0.3.75, SDXL models (~6.5 GB each)

Symptoms

The ComfyUI log would end mid-operation during model loading:

--- Logging error ---
Stopped server

No OOM, no traceback. The process just dies. This typically happened during CLIP or checkpoint transfers to cuda:0 when switching between models.

Diagnosis

GPU core dumps

Six gpucore.* files in the ComfyUI directory, totaling 47 GB. These are written by the amdgpu driver on crash — confirms this is below Python.

Kernel log

journalctl -k -b | grep amdgpu
amdgpu: VM memory stats for proc (0) task (0) is non-zero when fini

The GPU driver isn’t freeing virtual memory properly. Memory leaks across generations until allocation fails during a model swap.

The mismatch

pacman -Q rocm-core        # 7.2.1
python -c "import torch; print(torch.version.hip)"  # 7.1

System ROCm was 7.2.1. PyTorch was compiled against ROCm 7.1 — and it was a nightly dev build. The HIP runtime mismatch causes the memory allocator to silently leak. Setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True confirmed the issue:

expandable_segments not supported on this platform
(Triggered internally at /pytorch/c10/hip/HIPAllocatorConfig.h:40.)

Fix

1. Match PyTorch to system ROCm

pip install --force-reinstall torch torchvision torchaudio \
  --index-url https://download.pytorch.org/whl/rocm7.2

--force-reinstall is required — pip doesn’t distinguish ROCm suffixes and will skip the install otherwise. This pulled torch-2.11.0+rocm7.2 (stable).

2. Fix numpy compatibility

The new torch pulled numpy 2.4.3 which broke opencv and numba:

pip install "numpy<2.3"

3. Launch flags

export HSA_ENABLE_SDMA=0
python main.py --disable-cuda-malloc --reserve-vram 2048
  • HSA_ENABLE_SDMA=0 — disables System DMA engine, avoids memory corruption on some AMD hardware during rapid alloc/free cycles
  • --disable-cuda-malloc — uses default PyTorch allocator instead of cudaMallocAsync
  • --reserve-vram 2048 — keeps 2 GB headroom so model swaps don’t push the driver to its limit

4. Cleanup

rm ~/comfy/ComfyUI/gpucore.*          # reclaim 47 GB
export GPU_COREDUMP_ENABLE=0           # prevent future dumps

Result

Model swapping works reliably after matching the ROCm versions. The VM memory stats non-zero when fini kernel messages stopped.

Diagnostic checklist

If you’re hitting similar crashes on AMD:

  1. Check for gpucore.* files in your ComfyUI directory
  2. Check kernel log: journalctl -k -b | grep amdgpu
  3. Compare ROCm versions: pacman -Q rocm-core vs python -c "import torch; print(torch.version.hip)"
  4. If mismatched: pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocmX.Y
  5. Launch with --disable-cuda-malloc and HSA_ENABLE_SDMA=0
← Back to all posts