analytics/docs/deployment.md

456 lines
13 KiB
Markdown
Raw Normal View History

# Deployment Guide
Deploy the analytics platform to production.
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ Load Balancer │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Collector │ │ Collector │ │ Collector │
│ Service │ │ Service │ │ Service │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────┼─────────────────────┘
┌───────────────┐
│ Redis │
│ (BullMQ) │
└───────────────┘
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Processor │ │ Processor │ │ Processor │
│ Worker │ │ Worker │ │ Worker │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└─────────────────────┼─────────────────────┘
┌───────────────┐
│ PostgreSQL │
│ (TimescaleDB)│
└───────────────┘
┌───────────────┐
│ API Service │
└───────────────┘
```
## Services
| Service | Port | Description |
|---------|------|-------------|
| Collector | 4001 | Event ingestion |
| Processor | - | Queue worker (no HTTP) |
| API | 4002 | Query endpoints |
| Realtime | 4003 | WebSocket server |
## Docker Deployment
### docker-compose.yml
```yaml
version: '3.8'
services:
collector:
image: analytics/collector:latest
ports:
- "4001:4001"
environment:
- NODE_ENV=production
- REDIS_URL=redis://redis:6379
- LOG_LEVEL=info
depends_on:
- redis
deploy:
replicas: 3
resources:
limits:
memory: 512M
cpus: '0.5'
processor:
image: analytics/processor:latest
environment:
- NODE_ENV=production
- REDIS_URL=redis://redis:6379
- DATABASE_URL=postgresql://postgres:password@postgres:5432/analytics
- CONCURRENCY=10
depends_on:
- redis
- postgres
deploy:
replicas: 2
resources:
limits:
memory: 1G
cpus: '1'
api:
image: analytics/api:latest
ports:
- "4002:4002"
environment:
- NODE_ENV=production
- DATABASE_URL=postgresql://postgres:password@postgres:5432/analytics
- REDIS_URL=redis://redis:6379
depends_on:
- postgres
- redis
deploy:
replicas: 2
resources:
limits:
memory: 512M
cpus: '0.5'
realtime:
image: analytics/realtime:latest
ports:
- "4003:4003"
environment:
- NODE_ENV=production
- REDIS_URL=redis://redis:6379
depends_on:
- redis
deploy:
replicas: 2
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
command: redis-server --appendonly yes
postgres:
image: timescale/timescaledb:latest-pg15
environment:
- POSTGRES_DB=analytics
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=password
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
redis_data:
postgres_data:
```
## Kubernetes Deployment
### Collector Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: analytics-collector
spec:
replicas: 3
selector:
matchLabels:
app: analytics-collector
template:
metadata:
labels:
app: analytics-collector
spec:
containers:
- name: collector
image: analytics/collector:latest
ports:
- containerPort: 4001
env:
- name: NODE_ENV
value: production
- name: REDIS_URL
valueFrom:
secretKeyRef:
name: analytics-secrets
key: redis-url
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
readinessProbe:
httpGet:
path: /health
port: 4001
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 4001
initialDelaySeconds: 15
periodSeconds: 20
---
apiVersion: v1
kind: Service
metadata:
name: analytics-collector
spec:
selector:
app: analytics-collector
ports:
- port: 4001
targetPort: 4001
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: analytics-collector-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: analytics-collector
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
```
## Environment Variables
### Collector Service
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `NODE_ENV` | Yes | - | Environment (production/development) |
| `PORT` | No | 4001 | HTTP port |
| `REDIS_URL` | Yes | - | Redis connection URL |
| `LOG_LEVEL` | No | info | Logging level |
| `CORS_ORIGINS` | No | * | Allowed CORS origins |
### Processor Service
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `NODE_ENV` | Yes | - | Environment |
| `REDIS_URL` | Yes | - | Redis connection URL |
| `DATABASE_URL` | Yes | - | PostgreSQL connection URL |
| `CONCURRENCY` | No | 5 | Worker concurrency |
| `BATCH_SIZE` | No | 100 | Events per batch |
### API Service
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `NODE_ENV` | Yes | - | Environment |
| `PORT` | No | 4002 | HTTP port |
| `DATABASE_URL` | Yes | - | PostgreSQL connection URL |
| `REDIS_URL` | Yes | - | Redis for caching |
| `API_KEYS` | Yes | - | Comma-separated API keys |
## Database Setup
### PostgreSQL with TimescaleDB
```sql
-- Create database
CREATE DATABASE analytics;
-- Enable TimescaleDB
CREATE EXTENSION IF NOT EXISTS timescaledb;
-- Create tables
CREATE TABLE raw_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
session_id VARCHAR(64) NOT NULL,
user_id VARCHAR(255),
event_type VARCHAR(100) NOT NULL,
event_action VARCHAR(255) NOT NULL,
metadata JSONB DEFAULT '{}',
timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Convert to hypertable for time-series optimization
SELECT create_hypertable('raw_events', 'timestamp');
-- Create indexes
CREATE INDEX idx_raw_events_session ON raw_events(session_id);
CREATE INDEX idx_raw_events_user ON raw_events(user_id);
CREATE INDEX idx_raw_events_type ON raw_events(event_type);
CREATE INDEX idx_raw_events_metadata ON raw_events USING GIN(metadata);
-- Aggregated tables
CREATE TABLE daily_metrics (
date DATE NOT NULL,
metric_name VARCHAR(100) NOT NULL,
dimension_key VARCHAR(255),
dimension_value VARCHAR(255),
value BIGINT NOT NULL DEFAULT 0,
PRIMARY KEY (date, metric_name, dimension_key, dimension_value)
);
-- Retention policy: keep raw events for 90 days
SELECT add_retention_policy('raw_events', INTERVAL '90 days');
```
## Nginx Configuration
```nginx
upstream collector {
least_conn;
server collector-1:4001;
server collector-2:4001;
server collector-3:4001;
}
upstream api {
server api-1:4002;
server api-2:4002;
}
upstream realtime {
ip_hash; # Sticky sessions for WebSocket
server realtime-1:4003;
server realtime-2:4003;
}
server {
listen 443 ssl http2;
server_name analytics.example.com;
ssl_certificate /etc/ssl/certs/analytics.crt;
ssl_certificate_key /etc/ssl/private/analytics.key;
# Collector - high throughput
location /collect {
proxy_pass http://collector;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Don't buffer - fast response
proxy_buffering off;
# Allow large batches
client_max_body_size 1m;
}
# API - standard REST
location /api {
proxy_pass http://api;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Cache GET requests
proxy_cache api_cache;
proxy_cache_valid 200 1m;
proxy_cache_key "$request_method$request_uri";
add_header X-Cache-Status $upstream_cache_status;
}
# WebSocket - realtime
location /realtime {
proxy_pass http://realtime;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
# Long-lived connections
proxy_read_timeout 86400s;
proxy_send_timeout 86400s;
}
}
```
## Monitoring
### Health Checks
All services expose `/health` endpoint:
```json
{
"status": "healthy",
"version": "1.0.0",
"uptime": 86400,
"checks": {
"redis": "ok",
"database": "ok"
}
}
```
### Metrics (Prometheus)
Services expose `/metrics` endpoint:
```
# Collector metrics
analytics_events_received_total{type="engagement"} 1234567
analytics_events_queued_total 1234500
analytics_batch_size_histogram_bucket{le="10"} 50000
# Processor metrics
analytics_events_processed_total 1234000
analytics_processing_duration_seconds_bucket{le="0.1"} 1200000
analytics_queue_depth 500
# API metrics
analytics_api_requests_total{endpoint="/trends",status="200"} 50000
analytics_api_latency_seconds_bucket{le="0.5"} 49000
```
### Grafana Dashboards
Import pre-built dashboards from `/dashboards/`:
- `collector-metrics.json` - Ingestion throughput
- `processor-metrics.json` - Processing performance
- `api-metrics.json` - Query latency and errors
- `business-metrics.json` - Analytics KPIs
## Scaling Guidelines
### Collector Service
- Scale horizontally based on incoming event rate
- Target: <100ms p99 response time
- Rule of thumb: 1 replica per 10,000 events/minute
### Processor Service
- Scale based on queue depth
- Target: Queue depth < 1000
- Increase `CONCURRENCY` before adding replicas
### API Service
- Scale based on query latency
- Target: <500ms p95 for complex queries
- Add read replicas to PostgreSQL for heavy read load
### Database
- Use TimescaleDB compression for historical data
- Partition by month for large deployments
- Consider ClickHouse for >1B events/day