Skip to content

GPU Slicing and Multi-Tenant GPU Configuration

Introduction

Graphics Processing Units (GPUs) have evolved far beyond their original purpose of rendering graphics. Today, they are essential for artificial intelligence (AI), machine learning (ML), and high-performance computing workloads. NVIDIA GPUs, in particular, dominate the ML landscape due to their specialized architecture designed for parallel processing, making them indispensable for matrix operations and mathematical computations that power AI-driven insights.

However, GPUs are expensive resources, and maximizing their utilization is crucial for cost-effective operations. Traditional GPU allocation methods often lead to underutilization, where a single workload consumes an entire GPU but only uses a fraction of its computational capacity. This is where GPU sharing strategies become essential.

NVIDIA GPU Sharing Strategies

NVIDIA provides several approaches to enable multiple workloads to share GPU resources efficiently, including:

1. Time-Slicing

Time-slicing divides GPU access into small time intervals, allowing different tasks to use the GPU in predefined time slices. This approach is similar to how CPUs time-slice between different processes.

How it works:

  • Multiple processes share a single GPU by taking turns accessing it
  • The GPU scheduler allocates time slots to different workloads
  • No memory isolation between processes
  • Suitable for workloads with intermittent GPU usage patterns

Ideal use cases:

  • Development and testing environments
  • Multiple small-scale workloads
  • Batch processing tasks
  • Real-time analytics with streaming data
  • Cost-sensitive environments with budget constraints

2. Multi-Instance GPU (MIG)

MIG is available on NVIDIA A100, A30, and H100 GPUs, allowing a single physical GPU to be partitioned into multiple isolated instances. Each instance has its own dedicated memory, cache, and compute cores.

How it works:

  • Physical GPU is partitioned into up to 7 separate instances
  • Each instance provides guaranteed performance and memory isolation
  • Complete fault isolation between instances
  • Hardware-level partitioning ensures predictable performance

Ideal use cases:

  • Multi-tenant environments requiring strict isolation
  • Cloud service providers offering GPU-as-a-Service
  • Workloads requiring guaranteed performance levels
  • Environments with strict SLA requirements

Time-Slicing vs MIG: Detailed Comparison

Aspect Time-Slicing Multi-Instance GPU (MIG)
Memory Isolation No isolation - shared memory space Complete memory isolation per instance
Fault Isolation No isolation - one crash affects all Complete fault isolation
Performance Guarantees No guarantees - best effort sharing Guaranteed performance per instance
GPU Support All NVIDIA GPUs Limited to A100, A30, H100
Resource Overhead Minimal overhead Some overhead due to partitioning
Use Case Resource optimization Strong Isolation and SLA requirements

Configuring GPU Slicing DIY

Enabling and managing GPU slicing infrastructure is a complex undertaking that involves multiple layers of technology, careful resource planning, and ongoing operational overhead. While platforms like Omnistrate abstract much of this complexity, understanding the underlying challenges helps appreciate the engineering effort required to make GPU sharing work effectively.

Infrastructure Complexity Overview

GPU slicing requires orchestrating several complex systems that must work together seamlessly:

1. Hardware-Level Considerations

  • GPU Architecture Compatibility: Not all GPUs support the same sharing mechanisms. NVIDIA's MIG is only available on A100, A30, and H100 GPUs, while time-slicing works across different GPU generations but with varying performance characteristics

2. Kernel and Driver Stack Complexity

  • NVIDIA Driver Management: Requires specific driver versions that support sharing features, with complex upgrade paths that can break existing workloads
  • CUDA Runtime Coordination: Managing CUDA contexts across multiple processes requires sophisticated scheduling and memory management

3. Container Orchestration Challenges

  • Device Plugin Architecture: Implementing and maintaining custom Kubernetes device plugins that can advertise virtual GPU resources accurately
  • Resource Scheduling Complexity: The Kubernetes scheduler must understand GPU topology, memory constraints, and performance characteristics to make optimal placement decisions
  • Namespace Isolation: Ensuring proper isolation between different tenants while maintaining GPU access

Operational Management Complexity

Capacity Planning and Resource Allocation

  • Workload Characterization: Understanding the GPU usage patterns of different workloads to optimize sharing ratios
  • Performance Modeling: Predicting how different combinations of workloads will perform when sharing GPU resources
  • Cost Optimization: Balancing the cost of GPU instances against the performance impact of sharing
  • Scaling Strategies: Determining when to scale horizontally (more GPU instances) vs. vertically (more sharing on existing GPUs)

Lifecycle Management

  • Rolling Updates: Updating GPU drivers, CUDA versions, or container runtimes without disrupting running workloads
  • Workload Migration: Moving workloads between different GPU instances during maintenance or optimization
  • Disaster Recovery: Implementing backup and recovery strategies for stateful GPU workloads
  • Version Compatibility: Managing compatibility matrices between CUDA versions, driver versions, and application requirements

Why Managed Solutions Matter

The complexity outlined above explains why managed platforms like Omnistrate provide significant value:

  • Abstraction of Complexity: Hiding the intricate details of GPU driver management, device plugin configuration, and monitoring setup
  • Tested Configurations: Providing pre-validated combinations of hardware, software, and configuration that work reliably together
  • Automated Operations: Handling routine maintenance, updates, and optimization tasks automatically
  • Expert Support: Access to specialists who understand the nuances of GPU sharing infrastructure

Configuring GPU Slicing in Omnistrate

Omnistrate provides built-in support for GPU slicing through the multiTenantGpu feature, making it easy to deploy services that efficiently share GPU resources across multiple tenants.

Configuration Overview

To enable GPU slicing in your Omnistrate service, you need to add the x-internal-integrations section to your compose.yaml file:

x-internal-integrations: 
  multiTenantGpu: 
    instanceType: g4dn.xlarge # instance type to be used for GPU slicing
    timeSlicingReplicas: 2 # number of replicas to be used for time slicing
    migProfile: 1g.5gb # optional: MIG profile for A100/H100 GPUs

Configuration Parameters

instanceType

Specifies the EC2 instance type that will host the GPU slicing functionality. Common GPU-enabled instance types include:

  • g4dn.xlarge: 1 NVIDIA T4 GPU, 4 vCPUs, 16 GB RAM - Cost-effective for inference workloads
  • g4dn.2xlarge: 1 NVIDIA T4 GPU, 8 vCPUs, 32 GB RAM - Balanced compute and memory
  • p3.2xlarge: 1 NVIDIA V100 GPU, 8 vCPUs, 61 GB RAM - High-performance training
  • p4d.24xlarge: 8 NVIDIA A100 GPUs, 96 vCPUs, 1152 GB RAM - Multi-GPU training

timeSlicingReplicas

Defines how many virtual GPU replicas will be created from each physical GPU. This determines how many concurrent workloads can share a single GPU.

  • Value of 2: Each physical GPU appears as 2 virtual GPUs (50% allocation per workload)
  • Value of 4: Each physical GPU appears as 4 virtual GPUs (25% allocation per workload)
  • Value of 8: Each physical GPU appears as 8 virtual GPUs (12.5% allocation per workload)

migProfile (Optional)

Specifies the Multi-Instance GPU (MIG) profile to use when the instance type supports MIG (A100, A30, H100 GPUs). This parameter enables hardware-level GPU partitioning with guaranteed isolation and performance.

Note

MIG and Time-Slicing be combined to create a multi-layered GPU sharing strategy. You can use both migProfile and timeSlicingReplicas together to further subdivide MIG instances with time-slicing.

Complete Example Configuration

Here are complete examples of GPU-sliced service configurations:

Time-Slicing Configuration Example

version: '3.9'

x-omnistrate-service-plan:
  name: 'gpu-slicing-example hosted tier' 
  tenancyType: 'OMNISTRATE_MULTI_TENANCY' 

x-internal-integrations: 
  multiTenantGpu: 
    instanceType: g4dn.xlarge # instance type to be used for GPU slicing
    timeSlicingReplicas: 2 # number of replicas to be used for time slicing

services:
  gpuinfo:
    image: ghcr.io/omnistrate-community/gpu-slicing-example:0.0.3
    ports:
      - 5000:5000
    platform: linux/amd64
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 100M
        reservations:
          cpus: '100m'
          memory: 50M
    x-omnistrate-capabilities:
      autoscaling: 
        maxReplicas: 3
        minReplicas: 1
        idleMinutesBeforeScalingDown: 2
        idleThreshold: 20
        overUtilizedMinutesBeforeScalingUp: 3
        overUtilizedThreshold: 80
      serverlessConfiguration:
        targetPort: 5000
        enableAutoStop: true
        minimumNodesInPool: 1

MIG Configuration Example

version: '3.9'

x-omnistrate-service-plan:
  name: 'gpu-mig-example hosted tier' 
  tenancyType: 'OMNISTRATE_MULTI_TENANCY' 

x-internal-integrations: 
  multiTenantGpu: 
    instanceType: p4d.24xlarge # A100 GPU instance supporting MIG
    migProfile: 1g.5gb # MIG profile for hardware-level isolation

services:
  gpuinfo:
    image: ghcr.io/omnistrate-community/gpu-slicing-example:0.0.3
    ports:
      - 5000:5000
    platform: linux/amd64
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 500M
        reservations:
          cpus: '500m'
          memory: 250M
    x-omnistrate-capabilities:
      autoscaling: 
        maxReplicas: 5
        minReplicas: 1
        idleMinutesBeforeScalingDown: 2
        idleThreshold: 20
        overUtilizedMinutesBeforeScalingUp: 3
        overUtilizedThreshold: 80
      serverlessConfiguration:
        targetPort: 5000
        enableAutoStop: true
        minimumNodesInPool: 1

How GPU Slicing Works in Omnistrate

  1. Infrastructure Provisioning: Omnistrate provisions the specified GPU-enabled EC2 instance
  2. NVIDIA Device Plugin: Automatically installs and configures the NVIDIA Kubernetes device plugin
  3. Time-Slicing Configuration: Configures the device plugin with the specified number of replicas
  4. Resource Advertisement: The Kubernetes scheduler sees multiple virtual GPUs instead of one physical GPU
  5. Workload Scheduling: Multiple pods can be scheduled to share the same physical GPU
  6. Automatic Scaling: Omnistrate can automatically scale the number of replicas based on demand

Best Practices

Choosing the Right Instance Type

  • For inference workloads: Use g4dn instances with T4 GPUs for cost-effectiveness
  • For training workloads: Use p3 instances with V100 GPUs for better performance
  • For large-scale training: Consider p4d instances with A100 GPUs and MIG support

Setting Time-Slicing Replicas

  • Start conservative: Begin with 2-4 replicas and monitor performance
  • Monitor GPU utilization: Use tools like nvidia-smi to track actual GPU usage
  • Consider workload characteristics: CPU-bound tasks can share more aggressively than GPU-intensive ones
  • Account for memory usage: Ensure total GPU memory usage doesn't exceed physical limits

Resource Management

  • Set appropriate CPU and memory limits for your containers
  • Use Omnistrate's autoscaling capabilities to handle varying demand
  • Monitor performance metrics to optimize replica counts
  • Consider using different configurations for development vs production environments

Example Use Cases

AI/ML Model Serving

Deploy multiple model inference endpoints that share GPU resources efficiently:

  • Each model gets dedicated time slices for inference
  • Cost-effective serving of multiple models
  • Automatic scaling based on request volume

Development and Testing

Enable multiple developers to share GPU resources:

  • Each developer gets access to GPU acceleration
  • Reduced infrastructure costs for development teams
  • Isolated development environments

Batch Processing

Process multiple data pipelines concurrently:

  • Different batch jobs share GPU resources
  • Improved throughput for data processing workflows
  • Cost optimization for periodic workloads

Conclusion

GPU slicing with Omnistrate provides a powerful way to maximize GPU utilization while minimizing costs. By leveraging NVIDIA's time-slicing technology through simple configuration parameters, you can enable multiple workloads to efficiently share expensive GPU resources.

Omnistrate's built-in GPU slicing support makes it easy to implement either approach, allowing you to focus on your application logic while the platform handles the complex GPU resource management automatically.