GPU

GPU Container Checkpoint/Restore with CRIUgpu: Zero-Downtime Live Migration for ML Workloads

GPU CRIU

Debo Ray

Co-Founder, CEO

July 11, 2025

Share via Social Media

GPU workloads represent the most expensive compute resources in modern data centers. A single NVIDIA H100 can cost $25,000-40,000, and ML inference containers often hold multi-gigabyte models in GPU memory. When these containers restart, the cost isn't just downtime—it's burning money while models reload and caches rebuild.

Traditional container checkpoint/restore with CRIU handles CPU workloads elegantly, but GPU state presents an entirely different challenge. GPU memory lives outside the normal process address space, CUDA contexts maintain complex driver state, and multi-GPU topologies add layers of complexity that standard tools can't handle.

The solution is emerging, but it's not ready for prime time. Here's the current state of GPU container migration and what's coming next.

The GPU State Problem

GPU workloads maintain state across multiple layers:

CUDA Runtime State:

  • Device memory allocations (often GBs of model weights)
  • CUDA contexts and streams
  • cuDNN handle states
  • Memory pool configurations

Driver-Level State:

  • GPU scheduling contexts
  • Memory management unit (MMU) mappings
  • PCIe configuration state
  • Multi-GPU communication channels

Container-Specific State:

  • GPU device assignments
  • Resource limits and quotas
  • Runtime configuration (nvidia-container-runtime)

A standard CRIU checkpoint captures none of this. The process might restore, but it wakes up to find its GPU resources gone.

Current Approaches: Why API Interception Fails

Most existing solutions use API interception with device proxies - a fundamentally flawed approach that introduces significant challenges:

Challenge 1: Performance Overhead

API interception sits in the critical path of every GPU operation. Research shows exponential overhead growth with training iterations - what starts as manageable latency becomes prohibitive for long-running workloads.

Challenge 2: Static vs Dynamic Linking

CUDA defaults to static linking since version 5.5, but API interception requires dynamic linking. This forces recompilation of frameworks like PyTorch from source - often impractical in production environments.

Challenge 3: Complex GPU State Management

GPUs maintain complex runtime state across streams, contexts, and memory hierarchies. Device proxies must reverse-engineer and replay this state, leading to reliability issues and non-deterministic behavior.

Challenge 4: Limited Ecosystem Support

Solutions like Cricket work for simple workloads but break with real-world applications that use advanced features like CUDA graphs, multi-GPU communication, or complex memory management patterns.

The Emerging Solution: CRIUgpu

The breakthrough came in 2025 with CRIUgpu, a research project that integrates NVIDIA's cuda-checkpoint with CRIU to achieve fully transparent GPU container checkpointing. Unlike previous approaches that rely on API interception, CRIUgpu creates unified CPU-GPU snapshots without performance overhead. It operates at the CUDA runtime level, capturing GPU memory and context state.

How CRIUgpu Works

CRIUgpu leverages NVIDIA's cuda-checkpoint utility integrated with CRIU plugins:

# Install CRIUgpu (requires CRIU 4.0+)

git clone https://github.com/checkpoint-restore/criu

cd criu

# CUDA plugin automatically handles GPU state

# Checkpoint GPU container (transparent)

podman checkpoint my-gpu-container

# Restore on same or different node

podman restore my-gpu-container

The process:

  1. Lock: CUDA APIs are locked, preventing new GPU operations
  2. Complete: Active GPU work finishes (with configurable timeout)
  3. Checkpoint: GPU memory copied to host, unified with CPU state
  4. Release: GPU resources released, container becomes CPU-only

Restore process:

  1. Acquire: GPU resources re-acquired
  2. Restore: GPU memory and contexts restored at original addresses
  3. Unlock: CUDA APIs unlocked, application resumes

Key advantages:

  • No API interception overhead
  • Works with statically linked applications
  • Supports both CUDA and ROCm
  • Unified CPU-GPU snapshots
  • Deterministic restore behavior

What gets captured:

  • Device memory contents (copied to host during checkpoint)
  • CUDA contexts, streams, and events
  • GPU memory mappings (restored at original addresses)
  • CUDA driver state

Checkpoint process:

  1. Lock: CUDA driver APIs are locked
  2. Complete: Already-submitted work finishes
  3. Copy: Device memory copied to host
  4. Release: GPU resources released

Restore process:

  1. Acquire: GPUs re-acquired by process
  2. Copy: Device memory copied back to GPU
  3. Restore: CUDA objects and mappings restored
  4. Unlock: CUDA driver APIs unlocked

Production Performance Results

Recent research demonstrates CRIUgpu's production readiness with large-scale workloads:

Large Language Models

LLaMA 3.1 (8B parameters) on H100:

  • Checkpoint time: 77 seconds
  • Restore time: 39 seconds
  • Checkpoint size: 56GB (97% GPU memory)

GPT-2 XL (1.5B parameters) on A100:

  • Checkpoint time: 131 seconds
  • Restore time: 145 seconds
  • Checkpoint size: 60GB (96% GPU memory)

Multi-GPU Scaling

CRIUgpu scales linearly with GPU count:

  • 1x A100: 13 seconds checkpoint, 8 seconds restore
  • 2x A100: 26 seconds checkpoint, 17 seconds restore
  • 4x A100: 55 seconds checkpoint, 35 seconds restore

Zero Runtime Overhead

Unlike API interception approaches, CRIUgpu introduces no steady-state performance overhead. Applications run at native speed until checkpoint/restore operations.

Container Runtime Integration

Custom runtime hooks can coordinate GPU and CPU state:

{

  "runtimeArgs": [

    "--gpu-checkpoint-handler=/usr/bin/cuda-checkpoint-handler"

  ],

  "hooks": {

    "prestart": [

      {

        "path": "/usr/bin/cuda-checkpoint-restore",

        "args": ["restore", "checkpoint-id"]

      }

    ]

  }

}

Production Challenges

Memory Transfer Overhead

GPU memory dumps are massive, but specific timing depends on:

  • GPU memory size and utilization
  • Storage I/O bandwidth
  • Network transfer for cross-node migration
  • Memory access patterns during dump

Performance characteristics need measurement in your specific environment.

CUDA Version Compatibility

# Checkpoint created with CUDA 11.8

cuda-checkpoint-create --cuda-version 11.8 checkpoint1

# Restore fails on node with CUDA 12.1

cuda-checkpoint-restore checkpoint1  # Version mismatch error

Multi-GPU Topology Preservation

Complex topologies don't restore cleanly:

# Original: 4x A100 with NVLink

nvidia-smi topo -m

# Restored: Different PCIe layout

# NVLink connections lost, performance degraded

Current Limitations

Current Limitations and Requirements

Hardware Requirements:

  • Display driver 570+ (full feature set)
  • Display driver 550+ (basic functionality)
  • Linux x86_64 only
  • Same GPU topology for restore (type, count, memory size)

Current Limitations:

  • No UVM (Unified Virtual Memory) support
  • No GPU-to-GPU migration between different hardware
  • No NCCL support for multi-node distributed training
  • Multi-node checkpointing requires additional coordination

Container Integration:

  • Requires CRIU 4.0+
  • Podman support available
  • Container Device Interface (CDI) integration
  • NVIDIA Container Toolkit compatibility

Real-World Readiness

CRIUgpu has been integrated into the upstream CRIU project (version 4.0+) and is available for production use. The technology has moved beyond research prototypes to provide a robust foundation for GPU container checkpointing in enterprise environments.

Container runtimes like Podman already support CRIUgpu through native CRIU integration, enabling transparent GPU container checkpointing without additional tooling or infrastructure changes.

When to Consider GPU Checkpointing

Strong candidates:

  • Long-running ML training jobs
  • Inference services with expensive model loading
  • Multi-tenant GPU sharing with SLA requirements
  • Research workloads with checkpoint/restart patterns

Avoid for now:

  • Latency-sensitive real-time inference
  • Simple stateless GPU applications
  • Production systems requiring reliability guarantees

Building Toward Production

If you're planning GPU checkpoint/restore:

  1. Start with application-level approaches
  2. Prototype with CUDA-checkpoint in development
  3. Measure performance overhead carefully
  4. Plan for manual orchestration initially
  5. Monitor NVIDIA's roadmap for production-ready features

Conclusion

GPU container checkpointing has evolved from experimental research to production-ready technology. CRIUgpu's breakthrough approach eliminates the fundamental flaws of API interception while delivering true zero-downtime GPU workload migration.

The technology is no longer "coming soon" - it's here, with production deployments already demonstrating its value. For organizations running GPU-intensive workloads, CRIUgpu offers:

  • Transparent checkpointing without application changes
  • Zero runtime overhead during normal operation
  • Unified CPU-GPU snapshots for complete state preservation
  • Linear scaling across multiple GPUs
  • Deterministic restore behavior

The business case is compelling: GPU resources are too expensive to waste on unnecessary restarts, and the technology now exists to eliminate them entirely. Early adopters are already gaining competitive advantages through more efficient GPU utilization and true zero-downtime operations.

For platform teams managing GPU infrastructure, the question is no longer whether to adopt GPU checkpointing, but how quickly you can integrate CRIUgpu into your container orchestration pipeline.

Reduce Your Cloud Spend with Live Rightsizing MicroVMs
Run workloads in secure, right-sized microVMs with built-in observability and dynamic scaling. Just a single operator and you are on the path to reducing cloud spend.
Get full visiiblity and pay only for what you use.