imajin/docs/operations/gpu-coordination.md

# GPU Coordination

Managing GPU resources across multiple services with GPUBoss and Redis.

## Overview

The @imajin platform runs multiple GPU-intensive services:
- **imajin-prompt**: LLM inference (DeepSeek R1 70B)
- **imajin-diffusion**: Diffusion model inference

GPUBoss coordinates VRAM allocation to prevent OOM errors.

## Architecture

```mermaid
sequenceDiagram
    participant Service as Service
    participant Boss as GPUBoss
    participant Redis as Redis
    participant GPU as GPU VRAM

    Service->>Boss: Request VRAM lease (8GB)
    Boss->>Redis: Check available VRAM
    Redis-->>Boss: 16GB available on cuda:0
    Boss->>Redis: Register lease (8GB, cuda:0)
    Boss-->>Service: Lease granted (cuda:0)

    Service->>GPU: Load model
    Note over Service,GPU: Model inference

    Service->>Boss: Release lease
    Boss->>Redis: Clear lease
    Boss-->>Service: Lease released
```

## Configuration

### Redis Setup

```bash
# Docker
docker run -d -p 6379:6379 --name redis redis

# System service
sudo systemctl start redis
```

### Service Configuration

```yaml
# config.yaml
gpu:
  enabled: true
  redis_url: redis://localhost:6379
  priority: "normal"  # low, normal, high
```

### Priority Levels

| Priority | Use Case |
|----------|----------|
| `low` | Background tasks, batch processing |
| `normal` | Standard requests |
| `high` | User-facing, latency-sensitive |

Higher priority services get VRAM leases first when contention exists.

## Device Assignment

### Multi-GPU Setup

Assign different models to different GPUs:

```bash
# imajin-diffusion
export IMAGE_GEN_PHOTOREALISTIC_DEVICE=cuda:0
export IMAGE_GEN_ANIME_DEVICE=cuda:1
```

This allows parallel generation with both models.

### Single-GPU Setup

All services share one GPU, coordinated by GPUBoss:

```bash
export IMAGE_GEN_PHOTOREALISTIC_DEVICE=cuda:0
export IMAGE_GEN_ANIME_DEVICE=cuda:0
```

GPUBoss ensures only one model is loaded at a time.

## VRAM Requirements

| Model | Approximate VRAM |
|-------|------------------|
| DeepSeek R1 70B (Q4) | 40GB |
| DeepSeek R1 70B (Q8) | 70GB |
| Diffusion (photorealistic) | 8GB |
| Diffusion (anime) | 8GB |
| Cultural classifier | 4GB |

## Lease Lifecycle

### 1. Request Lease

```python
async with gpu_boss.lease(vram_gb=8, priority="normal") as device:
    # device = "cuda:0"
    model = load_model(device)
    result = model.generate(...)
```

### 2. Automatic Release

Leases are automatically released when:
- Context manager exits
- Service shuts down
- Timeout expires (configurable)

### 3. Manual Release

```python
lease_id = await gpu_boss.acquire(vram_gb=8)
try:
    # ... use GPU
finally:
    await gpu_boss.release(lease_id)
```

## Monitoring

### Check GPU Status

```bash
nvidia-smi
```

### Check Redis Leases

```bash
redis-cli keys "gpuboss:*"
redis-cli hgetall "gpuboss:leases"
```

### Service Health

```bash
curl http://localhost:8003/health
# { "gpu_available": true, "vram_total": 24576, "vram_free": 16384 }
```

## Troubleshooting

### OOM Despite Coordination

1. Check for leaked leases: `redis-cli keys "gpuboss:*"`
2. Verify VRAM estimates match actual usage
3. Reduce model quantization or batch size

### Slow Lease Acquisition

1. Check Redis latency: `redis-cli --latency`
2. Verify priority settings
3. Check for long-running leases blocking queue

### Service Can't Get GPU

```bash
# Check what's holding leases
redis-cli hgetall "gpuboss:leases"

# Force release stale leases (use with caution)
redis-cli del "gpuboss:leases"
```

## Best Practices

1. **Request minimum needed VRAM** - Don't over-request
2. **Use appropriate priority** - Reserve "high" for user-facing requests
3. **Handle lease failures gracefully** - Return 503 if GPU unavailable
4. **Set reasonable timeouts** - Prevent indefinite waits
5. **Monitor VRAM usage** - Track actual vs. requested

## Related

- [Configuration](./configuration.md) - Redis URL configuration
- [Service Topology](../architecture/service-topology.md) - Service dependencies