# GPU Coordination Managing GPU resources across multiple services with GPUBoss and Redis. ## Overview The @imajin platform runs multiple GPU-intensive services: - **imajin-prompt**: LLM inference (DeepSeek R1 70B) - **imajin-diffusion**: Diffusion model inference GPUBoss coordinates VRAM allocation to prevent OOM errors. ## Architecture ```mermaid sequenceDiagram participant Service as Service participant Boss as GPUBoss participant Redis as Redis participant GPU as GPU VRAM Service->>Boss: Request VRAM lease (8GB) Boss->>Redis: Check available VRAM Redis-->>Boss: 16GB available on cuda:0 Boss->>Redis: Register lease (8GB, cuda:0) Boss-->>Service: Lease granted (cuda:0) Service->>GPU: Load model Note over Service,GPU: Model inference Service->>Boss: Release lease Boss->>Redis: Clear lease Boss-->>Service: Lease released ``` ## Configuration ### Redis Setup ```bash # Docker docker run -d -p 6379:6379 --name redis redis # System service sudo systemctl start redis ``` ### Service Configuration ```yaml # config.yaml gpu: enabled: true redis_url: redis://localhost:6379 priority: "normal" # low, normal, high ``` ### Priority Levels | Priority | Use Case | |----------|----------| | `low` | Background tasks, batch processing | | `normal` | Standard requests | | `high` | User-facing, latency-sensitive | Higher priority services get VRAM leases first when contention exists. ## Device Assignment ### Multi-GPU Setup Assign different models to different GPUs: ```bash # imajin-diffusion export IMAGE_GEN_PHOTOREALISTIC_DEVICE=cuda:0 export IMAGE_GEN_ANIME_DEVICE=cuda:1 ``` This allows parallel generation with both models. ### Single-GPU Setup All services share one GPU, coordinated by GPUBoss: ```bash export IMAGE_GEN_PHOTOREALISTIC_DEVICE=cuda:0 export IMAGE_GEN_ANIME_DEVICE=cuda:0 ``` GPUBoss ensures only one model is loaded at a time. ## VRAM Requirements | Model | Approximate VRAM | |-------|------------------| | DeepSeek R1 70B (Q4) | 40GB | | DeepSeek R1 70B (Q8) | 70GB | | Diffusion (photorealistic) | 8GB | | Diffusion (anime) | 8GB | | Cultural classifier | 4GB | ## Lease Lifecycle ### 1. Request Lease ```python async with gpu_boss.lease(vram_gb=8, priority="normal") as device: # device = "cuda:0" model = load_model(device) result = model.generate(...) ``` ### 2. Automatic Release Leases are automatically released when: - Context manager exits - Service shuts down - Timeout expires (configurable) ### 3. Manual Release ```python lease_id = await gpu_boss.acquire(vram_gb=8) try: # ... use GPU finally: await gpu_boss.release(lease_id) ``` ## Monitoring ### Check GPU Status ```bash nvidia-smi ``` ### Check Redis Leases ```bash redis-cli keys "gpuboss:*" redis-cli hgetall "gpuboss:leases" ``` ### Service Health ```bash curl http://localhost:8003/health # { "gpu_available": true, "vram_total": 24576, "vram_free": 16384 } ``` ## Troubleshooting ### OOM Despite Coordination 1. Check for leaked leases: `redis-cli keys "gpuboss:*"` 2. Verify VRAM estimates match actual usage 3. Reduce model quantization or batch size ### Slow Lease Acquisition 1. Check Redis latency: `redis-cli --latency` 2. Verify priority settings 3. Check for long-running leases blocking queue ### Service Can't Get GPU ```bash # Check what's holding leases redis-cli hgetall "gpuboss:leases" # Force release stale leases (use with caution) redis-cli del "gpuboss:leases" ``` ## Best Practices 1. **Request minimum needed VRAM** - Don't over-request 2. **Use appropriate priority** - Reserve "high" for user-facing requests 3. **Handle lease failures gracefully** - Return 503 if GPU unavailable 4. **Set reasonable timeouts** - Prevent indefinite waits 5. **Monitor VRAM usage** - Track actual vs. requested ## Related - [Configuration](./configuration.md) - Redis URL configuration - [Service Topology](../architecture/service-topology.md) - Service dependencies