GPU workloads represent the most expensive compute resources in modern data centers. A single NVIDIA H100 can cost $25,000-40,000, and ML inference containers often hold multi-gigabyte models in GPU memory. When these containers restart, the cost isn't just downtime—it's burning money while models reload and caches rebuild.
Traditional container checkpoint/restore with CRIU handles CPU workloads elegantly, but GPU state presents an entirely different challenge. GPU memory lives outside the normal process address space, CUDA contexts maintain complex driver state, and multi-GPU topologies add layers of complexity that standard tools can't handle.
The solution is emerging, but it's not ready for prime time. Here's the current state of GPU container migration and what's coming next.
The GPU State Problem
GPU workloads maintain state across multiple layers:
CUDA Runtime State:
- Device memory allocations (often GBs of model weights)
- CUDA contexts and streams
- cuDNN handle states
- Memory pool configurations
Driver-Level State:
- GPU scheduling contexts
- Memory management unit (MMU) mappings
- PCIe configuration state
- Multi-GPU communication channels
Container-Specific State:
- GPU device assignments
- Resource limits and quotas
- Runtime configuration (nvidia-container-runtime)
A standard CRIU checkpoint captures none of this. The process might restore, but it wakes up to find its GPU resources gone.
Current Approaches: Why API Interception Fails
Most existing solutions use API interception with device proxies - a fundamentally flawed approach that introduces significant challenges:
Challenge 1: Performance Overhead
API interception sits in the critical path of every GPU operation. Research shows exponential overhead growth with training iterations - what starts as manageable latency becomes prohibitive for long-running workloads.
Challenge 2: Static vs Dynamic Linking
CUDA defaults to static linking since version 5.5, but API interception requires dynamic linking. This forces recompilation of frameworks like PyTorch from source - often impractical in production environments.
Challenge 3: Complex GPU State Management
GPUs maintain complex runtime state across streams, contexts, and memory hierarchies. Device proxies must reverse-engineer and replay this state, leading to reliability issues and non-deterministic behavior.
Challenge 4: Limited Ecosystem Support
Solutions like Cricket work for simple workloads but break with real-world applications that use advanced features like CUDA graphs, multi-GPU communication, or complex memory management patterns.
The Emerging Solution: CRIUgpu
The breakthrough came in 2025 with CRIUgpu, a research project that integrates NVIDIA's cuda-checkpoint with CRIU to achieve fully transparent GPU container checkpointing. Unlike previous approaches that rely on API interception, CRIUgpu creates unified CPU-GPU snapshots without performance overhead. It operates at the CUDA runtime level, capturing GPU memory and context state.
How CRIUgpu Works
CRIUgpu leverages NVIDIA's cuda-checkpoint utility integrated with CRIU plugins:
# Install CRIUgpu (requires CRIU 4.0+)
git clone https://github.com/checkpoint-restore/criu
cd criu
# CUDA plugin automatically handles GPU state
# Checkpoint GPU container (transparent)
podman checkpoint my-gpu-container
# Restore on same or different node
podman restore my-gpu-container
The process:
- Lock: CUDA APIs are locked, preventing new GPU operations
- Complete: Active GPU work finishes (with configurable timeout)
- Checkpoint: GPU memory copied to host, unified with CPU state
- Release: GPU resources released, container becomes CPU-only
Restore process:
- Acquire: GPU resources re-acquired
- Restore: GPU memory and contexts restored at original addresses
- Unlock: CUDA APIs unlocked, application resumes
Key advantages:
- No API interception overhead
- Works with statically linked applications
- Supports both CUDA and ROCm
- Unified CPU-GPU snapshots
- Deterministic restore behavior
What gets captured:
- Device memory contents (copied to host during checkpoint)
- CUDA contexts, streams, and events
- GPU memory mappings (restored at original addresses)
- CUDA driver state
Checkpoint process:
- Lock: CUDA driver APIs are locked
- Complete: Already-submitted work finishes
- Copy: Device memory copied to host
- Release: GPU resources released
Restore process:
- Acquire: GPUs re-acquired by process
- Copy: Device memory copied back to GPU
- Restore: CUDA objects and mappings restored
- Unlock: CUDA driver APIs unlocked
Production Performance Results
Recent research demonstrates CRIUgpu's production readiness with large-scale workloads:
Large Language Models
LLaMA 3.1 (8B parameters) on H100:
- Checkpoint time: 77 seconds
- Restore time: 39 seconds
- Checkpoint size: 56GB (97% GPU memory)
GPT-2 XL (1.5B parameters) on A100:
- Checkpoint time: 131 seconds
- Restore time: 145 seconds
- Checkpoint size: 60GB (96% GPU memory)
Multi-GPU Scaling
CRIUgpu scales linearly with GPU count:
- 1x A100: 13 seconds checkpoint, 8 seconds restore
- 2x A100: 26 seconds checkpoint, 17 seconds restore
- 4x A100: 55 seconds checkpoint, 35 seconds restore
Zero Runtime Overhead
Unlike API interception approaches, CRIUgpu introduces no steady-state performance overhead. Applications run at native speed until checkpoint/restore operations.
Container Runtime Integration
Custom runtime hooks can coordinate GPU and CPU state:
{
"runtimeArgs": [
"--gpu-checkpoint-handler=/usr/bin/cuda-checkpoint-handler"
],
"hooks": {
"prestart": [
{
"path": "/usr/bin/cuda-checkpoint-restore",
"args": ["restore", "checkpoint-id"]
}
]
}
}
Production Challenges
Memory Transfer Overhead
GPU memory dumps are massive, but specific timing depends on:
- GPU memory size and utilization
- Storage I/O bandwidth
- Network transfer for cross-node migration
- Memory access patterns during dump
Performance characteristics need measurement in your specific environment.
CUDA Version Compatibility
# Checkpoint created with CUDA 11.8
cuda-checkpoint-create --cuda-version 11.8 checkpoint1
# Restore fails on node with CUDA 12.1
cuda-checkpoint-restore checkpoint1 # Version mismatch error
Multi-GPU Topology Preservation
Complex topologies don't restore cleanly:
# Original: 4x A100 with NVLink
nvidia-smi topo -m
# Restored: Different PCIe layout
# NVLink connections lost, performance degraded
Current Limitations
Current Limitations and Requirements
Hardware Requirements:
- Display driver 570+ (full feature set)
- Display driver 550+ (basic functionality)
- Linux x86_64 only
- Same GPU topology for restore (type, count, memory size)
Current Limitations:
- No UVM (Unified Virtual Memory) support
- No GPU-to-GPU migration between different hardware
- No NCCL support for multi-node distributed training
- Multi-node checkpointing requires additional coordination
Container Integration:
- Requires CRIU 4.0+
- Podman support available
- Container Device Interface (CDI) integration
- NVIDIA Container Toolkit compatibility
Real-World Readiness
CRIUgpu has been integrated into the upstream CRIU project (version 4.0+) and is available for production use. The technology has moved beyond research prototypes to provide a robust foundation for GPU container checkpointing in enterprise environments.
Container runtimes like Podman already support CRIUgpu through native CRIU integration, enabling transparent GPU container checkpointing without additional tooling or infrastructure changes.
When to Consider GPU Checkpointing
Strong candidates:
- Long-running ML training jobs
- Inference services with expensive model loading
- Multi-tenant GPU sharing with SLA requirements
- Research workloads with checkpoint/restart patterns
Avoid for now:
- Latency-sensitive real-time inference
- Simple stateless GPU applications
- Production systems requiring reliability guarantees
Building Toward Production
If you're planning GPU checkpoint/restore:
- Start with application-level approaches
- Prototype with CUDA-checkpoint in development
- Measure performance overhead carefully
- Plan for manual orchestration initially
- Monitor NVIDIA's roadmap for production-ready features
Conclusion
GPU container checkpointing has evolved from experimental research to production-ready technology. CRIUgpu's breakthrough approach eliminates the fundamental flaws of API interception while delivering true zero-downtime GPU workload migration.
The technology is no longer "coming soon" - it's here, with production deployments already demonstrating its value. For organizations running GPU-intensive workloads, CRIUgpu offers:
- Transparent checkpointing without application changes
- Zero runtime overhead during normal operation
- Unified CPU-GPU snapshots for complete state preservation
- Linear scaling across multiple GPUs
- Deterministic restore behavior
The business case is compelling: GPU resources are too expensive to waste on unnecessary restarts, and the technology now exists to eliminate them entirely. Early adopters are already gaining competitive advantages through more efficient GPU utilization and true zero-downtime operations.
For platform teams managing GPU infrastructure, the question is no longer whether to adopt GPU checkpointing, but how quickly you can integrate CRIUgpu into your container orchestration pipeline.