GPU Slicing and Multi-Tenant GPU Configuration¶

Introduction¶

Graphics Processing Units (GPUs) have evolved far beyond their original purpose of rendering graphics. Today, they are essential for artificial intelligence (AI), machine learning (ML), and high-performance computing workloads. NVIDIA GPUs, in particular, dominate the ML landscape due to their specialized architecture designed for parallel processing, making them indispensable for matrix operations and mathematical computations that power AI-driven insights.

However, GPUs are expensive resources, and maximizing their utilization is crucial for cost-effective operations. Traditional GPU allocation methods often lead to underutilization, where a single workload consumes an entire GPU but only uses a fraction of its computational capacity. This is where GPU sharing strategies become essential.

NVIDIA provides several approaches to enable multiple workloads to share GPU resources efficiently, including:

1. Time-Slicing¶

Time-slicing divides GPU access into small time intervals, allowing different tasks to use the GPU in predefined time slices. This approach is similar to how CPUs time-slice between different processes.

How it works:

Multiple processes share a single GPU by taking turns accessing it
The GPU scheduler allocates time slots to different workloads
No memory isolation between processes
Suitable for workloads with intermittent GPU usage patterns

Ideal use cases:

Development and testing environments
Multiple small-scale workloads
Batch processing tasks
Real-time analytics with streaming data
Cost-sensitive environments with budget constraints

2. Multi-Instance GPU (MIG)¶

MIG is available on NVIDIA A100, A30, and H100 GPUs, allowing a single physical GPU to be partitioned into multiple isolated instances. Each instance has its own dedicated memory, cache, and compute cores.

How it works:

Physical GPU is partitioned into up to 7 separate instances
Each instance provides guaranteed performance and memory isolation
Complete fault isolation between instances
Hardware-level partitioning ensures predictable performance

Ideal use cases:

Multi-tenant environments requiring strict isolation
Cloud service providers offering GPU-as-a-Service
Workloads requiring guaranteed performance levels
Environments with strict SLA requirements

Time-Slicing vs MIG: Detailed Comparison¶

Aspect	Time-Slicing	Multi-Instance GPU (MIG)
Memory Isolation	No isolation - shared memory space	Complete memory isolation per instance
Fault Isolation	No isolation - one crash affects all	Complete fault isolation
Performance Guarantees	No guarantees - best effort sharing	Guaranteed performance per instance
GPU Support	All NVIDIA GPUs	Limited to A100, A30, H100
Resource Overhead	Minimal overhead	Some overhead due to partitioning
Use Case	Resource optimization	Strong Isolation and SLA requirements

Configuring GPU Slicing DIY¶

Enabling and managing GPU slicing infrastructure is a complex undertaking that involves multiple layers of technology, careful resource planning, and ongoing operational overhead. While platforms like Omnistrate abstract much of this complexity, understanding the underlying challenges helps appreciate the engineering effort required to make GPU sharing work effectively.

Infrastructure Complexity Overview¶

GPU slicing requires orchestrating several complex systems that must work together seamlessly:

1. Hardware-Level Considerations¶

GPU Architecture Compatibility: Not all GPUs support the same sharing mechanisms. NVIDIA's MIG is only available on A100, A30, and H100 GPUs, while time-slicing works across different GPU generations but with varying performance characteristics

2. Kernel and Driver Stack Complexity¶

NVIDIA Driver Management: Requires specific driver versions that support sharing features, with complex upgrade paths that can break existing workloads
CUDA Runtime Coordination: Managing CUDA contexts across multiple processes requires sophisticated scheduling and memory management

3. Container Orchestration Challenges¶

Device Plugin Architecture: Implementing and maintaining custom Kubernetes device plugins that can advertise virtual GPU resources accurately
Resource Scheduling Complexity: The Kubernetes scheduler must understand GPU topology, memory constraints, and performance characteristics to make optimal placement decisions
Namespace Isolation: Ensuring proper isolation between different tenants while maintaining GPU access

Operational Management Complexity¶

Capacity Planning and Resource Allocation¶

Workload Characterization: Understanding the GPU usage patterns of different workloads to optimize sharing ratios
Performance Modeling: Predicting how different combinations of workloads will perform when sharing GPU resources
Cost Optimization: Balancing the cost of GPU instances against the performance impact of sharing
Scaling Strategies: Determining when to scale horizontally (more GPU instances) vs. vertically (more sharing on existing GPUs)

Lifecycle Management¶

Rolling Updates: Updating GPU drivers, CUDA versions, or container runtimes without disrupting running workloads
Workload Migration: Moving workloads between different GPU instances during maintenance or optimization
Disaster Recovery: Implementing backup and recovery strategies for stateful GPU workloads
Version Compatibility: Managing compatibility matrices between CUDA versions, driver versions, and application requirements

Why Managed Solutions Matter¶

The complexity outlined above explains why managed platforms like Omnistrate provide significant value:

Abstraction of Complexity: Hiding the intricate details of GPU driver management, device plugin configuration, and monitoring setup
Tested Configurations: Providing pre-validated combinations of hardware, software, and configuration that work reliably together
Automated Operations: Handling routine maintenance, updates, and optimization tasks automatically
Expert Support: Access to specialists who understand the nuances of GPU sharing infrastructure

Configuring GPU Slicing in Omnistrate¶

Omnistrate provides built-in support for GPU slicing through the multiTenantGpu feature, making it easy to deploy services that efficiently share GPU resources across multiple tenants.

Configuration Overview¶

To enable GPU slicing in your Omnistrate service, you need to add the x-internal-integrations section to your compose.yaml file:

x-internal-integrations: 
  multiTenantGpu: 
    instanceType: g4dn.xlarge # instance type to be used for GPU slicing
    timeSlicingReplicas: 2 # number of replicas to be used for time slicing
    migProfile: 1g.5gb # optional: MIG profile for A100/H100 GPUs

Configuration Parameters¶

`instanceType`¶

Specifies the EC2 instance type that will host the GPU slicing functionality. Common GPU-enabled instance types include:

g4dn.xlarge: 1 NVIDIA T4 GPU, 4 vCPUs, 16 GB RAM - Cost-effective for inference workloads
g4dn.2xlarge: 1 NVIDIA T4 GPU, 8 vCPUs, 32 GB RAM - Balanced compute and memory
p3.2xlarge: 1 NVIDIA V100 GPU, 8 vCPUs, 61 GB RAM - High-performance training
p4d.24xlarge: 8 NVIDIA A100 GPUs, 96 vCPUs, 1152 GB RAM - Multi-GPU training

`timeSlicingReplicas`¶

Defines how many virtual GPU replicas will be created from each physical GPU. This determines how many concurrent workloads can share a single GPU.

Value of 2: Each physical GPU appears as 2 virtual GPUs (50% allocation per workload)
Value of 4: Each physical GPU appears as 4 virtual GPUs (25% allocation per workload)
Value of 8: Each physical GPU appears as 8 virtual GPUs (12.5% allocation per workload)

`migProfile` (Optional)¶

Specifies the Multi-Instance GPU (MIG) profile to use when the instance type supports MIG (A100, A30, H100 GPUs). This parameter enables hardware-level GPU partitioning with guaranteed isolation and performance.

Note

MIG and Time-Slicing be combined to create a multi-layered GPU sharing strategy. You can use both migProfile and timeSlicingReplicas together to further subdivide MIG instances with time-slicing.

Complete Example Configuration¶

Here are complete examples of GPU-sliced service configurations:

Time-Slicing Configuration Example¶

version: '3.9'

x-omnistrate-service-plan:
  name: 'gpu-slicing-example hosted tier' 
  tenancyType: 'OMNISTRATE_MULTI_TENANCY' 

x-internal-integrations: 
  multiTenantGpu: 
    instanceType: g4dn.xlarge # instance type to be used for GPU slicing
    timeSlicingReplicas: 2 # number of replicas to be used for time slicing

services:
  gpuinfo:
    image: ghcr.io/omnistrate-community/gpu-slicing-example:0.0.3
    ports:
      - 5000:5000
    platform: linux/amd64
    deploy:
      resources:
        limits:
          cpus: '1'
          memory: 100M
        reservations:
          cpus: '100m'
          memory: 50M
    x-omnistrate-capabilities:
      autoscaling: 
        maxReplicas: 3
        minReplicas: 1
        idleMinutesBeforeScalingDown: 2
        idleThreshold: 20
        overUtilizedMinutesBeforeScalingUp: 3
        overUtilizedThreshold: 80
      serverlessConfiguration:
        targetPort: 5000
        enableAutoStop: true
        minimumNodesInPool: 1

MIG Configuration Example¶

version: '3.9'

x-omnistrate-service-plan:
  name: 'gpu-mig-example hosted tier' 
  tenancyType: 'OMNISTRATE_MULTI_TENANCY' 

x-internal-integrations: 
  multiTenantGpu: 
    instanceType: p4d.24xlarge # A100 GPU instance supporting MIG
    migProfile: 1g.5gb # MIG profile for hardware-level isolation

services:
  gpuinfo:
    image: ghcr.io/omnistrate-community/gpu-slicing-example:0.0.3
    ports:
      - 5000:5000
    platform: linux/amd64
    deploy:
      resources:
        limits:
          cpus: '2'
          memory: 500M
        reservations:
          cpus: '500m'
          memory: 250M
    x-omnistrate-capabilities:
      autoscaling: 
        maxReplicas: 5
        minReplicas: 1
        idleMinutesBeforeScalingDown: 2
        idleThreshold: 20
        overUtilizedMinutesBeforeScalingUp: 3
        overUtilizedThreshold: 80
      serverlessConfiguration:
        targetPort: 5000
        enableAutoStop: true
        minimumNodesInPool: 1

How GPU Slicing Works in Omnistrate¶

Infrastructure Provisioning: Omnistrate provisions the specified GPU-enabled EC2 instance
NVIDIA Device Plugin: Automatically installs and configures the NVIDIA Kubernetes device plugin
Time-Slicing Configuration: Configures the device plugin with the specified number of replicas
Resource Advertisement: The Kubernetes scheduler sees multiple virtual GPUs instead of one physical GPU
Workload Scheduling: Multiple pods can be scheduled to share the same physical GPU
Automatic Scaling: Omnistrate can automatically scale the number of replicas based on demand

Best Practices¶

Choosing the Right Instance Type¶

For inference workloads: Use g4dn instances with T4 GPUs for cost-effectiveness
For training workloads: Use p3 instances with V100 GPUs for better performance
For large-scale training: Consider p4d instances with A100 GPUs and MIG support

Setting Time-Slicing Replicas¶

Start conservative: Begin with 2-4 replicas and monitor performance
Monitor GPU utilization: Use tools like nvidia-smi to track actual GPU usage
Consider workload characteristics: CPU-bound tasks can share more aggressively than GPU-intensive ones
Account for memory usage: Ensure total GPU memory usage doesn't exceed physical limits

Resource Management¶

Set appropriate CPU and memory limits for your containers
Use Omnistrate's autoscaling capabilities to handle varying demand
Monitor performance metrics to optimize replica counts
Consider using different configurations for development vs production environments

Example Use Cases¶

AI/ML Model Serving¶

Deploy multiple model inference endpoints that share GPU resources efficiently:

Each model gets dedicated time slices for inference
Cost-effective serving of multiple models
Automatic scaling based on request volume

Development and Testing¶

Enable multiple developers to share GPU resources:

Each developer gets access to GPU acceleration
Reduced infrastructure costs for development teams
Isolated development environments

Batch Processing¶

Process multiple data pipelines concurrently:

Different batch jobs share GPU resources
Improved throughput for data processing workflows
Cost optimization for periodic workloads

Conclusion¶

GPU slicing with Omnistrate provides a powerful way to maximize GPU utilization while minimizing costs. By leveraging NVIDIA's time-slicing technology through simple configuration parameters, you can enable multiple workloads to efficiently share expensive GPU resources.

Omnistrate's built-in GPU slicing support makes it easy to implement either approach, allowing you to focus on your application logic while the platform handles the complex GPU resource management automatically.