imajin/docs/operations/gpu-coordination.md
Lilith a5f99bb3d7 chore(imajin): clean up legacy structure and completion markers
- Remove old imajin/ directory (migrated to services/ + orchestrators/)
- Delete completion markers (DONE.md, INTEGRATION-COMPLETE.md, TESTING.md)
- Remove standalone test generation scripts
- Update docs to reflect current architecture
- Add multi-base-strategy.md documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-16 17:01:10 -08:00

188 lines
3.9 KiB
Markdown

# GPU Coordination
Managing GPU resources across multiple services with GPUBoss and Redis.
## Overview
The @imajin platform runs multiple GPU-intensive services:
- **imajin-prompt**: LLM inference (DeepSeek R1 70B)
- **imajin-diffusion**: Diffusion model inference
GPUBoss coordinates VRAM allocation to prevent OOM errors.
## Architecture
```mermaid
sequenceDiagram
participant Service as Service
participant Boss as GPUBoss
participant Redis as Redis
participant GPU as GPU VRAM
Service->>Boss: Request VRAM lease (8GB)
Boss->>Redis: Check available VRAM
Redis-->>Boss: 16GB available on cuda:0
Boss->>Redis: Register lease (8GB, cuda:0)
Boss-->>Service: Lease granted (cuda:0)
Service->>GPU: Load model
Note over Service,GPU: Model inference
Service->>Boss: Release lease
Boss->>Redis: Clear lease
Boss-->>Service: Lease released
```
## Configuration
### Redis Setup
```bash
# Docker
docker run -d -p 6379:6379 --name redis redis
# System service
sudo systemctl start redis
```
### Service Configuration
```yaml
# config.yaml
gpu:
enabled: true
redis_url: redis://localhost:6379
priority: "normal" # low, normal, high
```
### Priority Levels
| Priority | Use Case |
|----------|----------|
| `low` | Background tasks, batch processing |
| `normal` | Standard requests |
| `high` | User-facing, latency-sensitive |
Higher priority services get VRAM leases first when contention exists.
## Device Assignment
### Multi-GPU Setup
Assign different models to different GPUs:
```bash
# imajin-diffusion
export IMAGE_GEN_PHOTOREALISTIC_DEVICE=cuda:0
export IMAGE_GEN_ANIME_DEVICE=cuda:1
```
This allows parallel generation with both models.
### Single-GPU Setup
All services share one GPU, coordinated by GPUBoss:
```bash
export IMAGE_GEN_PHOTOREALISTIC_DEVICE=cuda:0
export IMAGE_GEN_ANIME_DEVICE=cuda:0
```
GPUBoss ensures only one model is loaded at a time.
## VRAM Requirements
| Model | Approximate VRAM |
|-------|------------------|
| DeepSeek R1 70B (Q4) | 40GB |
| DeepSeek R1 70B (Q8) | 70GB |
| Diffusion (photorealistic) | 8GB |
| Diffusion (anime) | 8GB |
| Cultural classifier | 4GB |
## Lease Lifecycle
### 1. Request Lease
```python
async with gpu_boss.lease(vram_gb=8, priority="normal") as device:
# device = "cuda:0"
model = load_model(device)
result = model.generate(...)
```
### 2. Automatic Release
Leases are automatically released when:
- Context manager exits
- Service shuts down
- Timeout expires (configurable)
### 3. Manual Release
```python
lease_id = await gpu_boss.acquire(vram_gb=8)
try:
# ... use GPU
finally:
await gpu_boss.release(lease_id)
```
## Monitoring
### Check GPU Status
```bash
nvidia-smi
```
### Check Redis Leases
```bash
redis-cli keys "gpuboss:*"
redis-cli hgetall "gpuboss:leases"
```
### Service Health
```bash
curl http://localhost:8003/health
# { "gpu_available": true, "vram_total": 24576, "vram_free": 16384 }
```
## Troubleshooting
### OOM Despite Coordination
1. Check for leaked leases: `redis-cli keys "gpuboss:*"`
2. Verify VRAM estimates match actual usage
3. Reduce model quantization or batch size
### Slow Lease Acquisition
1. Check Redis latency: `redis-cli --latency`
2. Verify priority settings
3. Check for long-running leases blocking queue
### Service Can't Get GPU
```bash
# Check what's holding leases
redis-cli hgetall "gpuboss:leases"
# Force release stale leases (use with caution)
redis-cli del "gpuboss:leases"
```
## Best Practices
1. **Request minimum needed VRAM** - Don't over-request
2. **Use appropriate priority** - Reserve "high" for user-facing requests
3. **Handle lease failures gracefully** - Return 503 if GPU unavailable
4. **Set reasonable timeouts** - Prevent indefinite waits
5. **Monitor VRAM usage** - Track actual vs. requested
## Related
- [Configuration](./configuration.md) - Redis URL configuration
- [Service Topology](../architecture/service-topology.md) - Service dependencies