Fixing ComfyUI Crashes on AMD/ROCm
I run ComfyUI on an AMD Radeon (gfx1201, 16 GB VRAM) with ROCm on Arch Linux. It was crashing after a few generations — usually when switching checkpoints. No Python traceback, just a dead server. Here’s what caused it and how to fix it.
Setup
- AMD Radeon gfx1201, 16 GB VRAM
- Arch Linux, CachyOS kernel
- ROCm 7.2.1 system packages
- PyTorch 2.10.0.dev20251123+rocm7.1 (nightly)
- ComfyUI v0.3.75, SDXL models (~6.5 GB each)
Symptoms
The ComfyUI log would end mid-operation during model loading:
--- Logging error ---
Stopped server
No OOM, no traceback. The process just dies. This typically happened during CLIP or checkpoint transfers to cuda:0 when switching between models.
Diagnosis
GPU core dumps
Six gpucore.* files in the ComfyUI directory, totaling 47 GB. These are written by the amdgpu driver on crash — confirms this is below Python.
Kernel log
journalctl -k -b | grep amdgpu
amdgpu: VM memory stats for proc (0) task (0) is non-zero when fini
The GPU driver isn’t freeing virtual memory properly. Memory leaks across generations until allocation fails during a model swap.
The mismatch
pacman -Q rocm-core # 7.2.1
python -c "import torch; print(torch.version.hip)" # 7.1
System ROCm was 7.2.1. PyTorch was compiled against ROCm 7.1 — and it was a nightly dev build. The HIP runtime mismatch causes the memory allocator to silently leak. Setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True confirmed the issue:
expandable_segments not supported on this platform
(Triggered internally at /pytorch/c10/hip/HIPAllocatorConfig.h:40.)
Fix
1. Match PyTorch to system ROCm
pip install --force-reinstall torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/rocm7.2
--force-reinstall is required — pip doesn’t distinguish ROCm suffixes and will skip the install otherwise. This pulled torch-2.11.0+rocm7.2 (stable).
2. Fix numpy compatibility
The new torch pulled numpy 2.4.3 which broke opencv and numba:
pip install "numpy<2.3"
3. Launch flags
export HSA_ENABLE_SDMA=0
python main.py --disable-cuda-malloc --reserve-vram 2048
HSA_ENABLE_SDMA=0— disables System DMA engine, avoids memory corruption on some AMD hardware during rapid alloc/free cycles--disable-cuda-malloc— uses default PyTorch allocator instead ofcudaMallocAsync--reserve-vram 2048— keeps 2 GB headroom so model swaps don’t push the driver to its limit
4. Cleanup
rm ~/comfy/ComfyUI/gpucore.* # reclaim 47 GB
export GPU_COREDUMP_ENABLE=0 # prevent future dumps
Result
Model swapping works reliably after matching the ROCm versions. The VM memory stats non-zero when fini kernel messages stopped.
Diagnostic checklist
If you’re hitting similar crashes on AMD:
- Check for
gpucore.*files in your ComfyUI directory - Check kernel log:
journalctl -k -b | grep amdgpu - Compare ROCm versions:
pacman -Q rocm-corevspython -c "import torch; print(torch.version.hip)" - If mismatched:
pip install --force-reinstall torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocmX.Y - Launch with
--disable-cuda-mallocandHSA_ENABLE_SDMA=0