# Chobit Architecture ## Overview Chobit is an interactive AI companion — a multi-platform Godot 4 app with a 3D VRM avatar, voice interaction, and pluggable LLM backend. Godot is the avatar runtime; all ML/GPU inference runs on external services via model-boss. The project follows the @applications Tier 2 pattern with shared GDScript symlinked into platform-specific Godot projects: ``` shared/godot/ → Cross-platform source (avatar, conversation, audio, UI) godot-desktop/src/ → → Symlink to shared/godot/ (transparent overlay, tray, window mgmt) godot-mobile/src/ → → Symlink to shared/godot/ (touch input, on-device camera) services/ → Desktop-only Python sidecars (bridge, tray, vision) ``` ## System Diagram ``` ┌──────────────────────────────────────────────────────────────┐ │ Godot 4 App (transparent desktop overlay) │ │ │ │ ┌────────────────┐ ┌─────────────────┐ ┌──────────────┐ │ │ │ Microphone │ │ Conversation │ │ VRM Avatar │ │ │ │ Input │ │ Orchestrator │ │ │ │ │ │ │ │ │ │ Skeleton │ │ │ │ VAD │ │ State Machine │ │ Blendshapes │ │ │ │ (Silero/energy) │──│ Sentence Stream │──│ AnimationTree│ │ │ │ │ │ Emotion Extract │ │ IK / LookAt │ │ │ │ AudioEffectCapt │ │ Interrupt Ctrl │ │ Lipsync │ │ │ └────────────────┘ └────────┬────────┘ └──────────────┘ │ │ │ │ │ ┌────────────────┐ │ │ │ │ Camera Input │ │ │ │ │ │ │ │ │ │ Webcam Feed │ │ │ │ │ Gesture Classif│───────────┘ │ │ │ Face Detection │ │ │ └────────────────┘ │ │ │ │ ┌──────────────┼──────────────┐ │ │ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ STT │ │ LLM │ │ TTS │ │ │ │ Client │ │ Client │ │ Client │ │ │ │ (HTTP) │ │ (HTTP/WS)│ │ (HTTP) │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ └────────────────┼──────────────┼──────────────┼──────────────┘ │ │ │ ▼ ▼ ▼ ┌───────────────────────────────────────┐ │ Backend Services │ │ │ │ @speech-synthesis @model-boss │ │ ├─ Whisper STT ├─ GPU leases │ │ └─ Chatterbox TTS └─ LLM routing │ │ │ │ Any OpenAI-compatible LLM endpoint │ │ or LifeAI companion service │ └───────────────────────────────────────┘ ``` ## Attention System (Dual-Mode Gaze) Chobit has two attention modes that determine where the avatar looks and how it responds to the user: ### Desktop Gaze (Ambient Mode) The avatar tracks what the user is doing on screen. The companion is "with you" while you work. - **Eyes/head follow cursor position** — LookAt target is the mouse pointer mapped to 3D space - **Active during idle state** — the default when no conversation is happening - **Ambient reactions** — occasional glances at notification areas, screen edges, active windows - **Subtle personality** — random look-away moments, stretches, yawns (not a robotic cursor tracker) ### Face-to-Face (Conversation Mode) The webcam activates and the avatar looks at the user directly. Mutual eye contact. - **Gaze target is the user's face** — detected via webcam, avatar maintains eye contact - **Active during conversation** — listening, processing, speaking states - **Facial awareness** — can detect user's general expression for responsive reactions - **Triggered by VAD** — speech detection switches from Desktop Gaze to Face-to-Face ### Mode Transitions Transitions map to the ConversationState FSM: | State | Attention Mode | Behavior | |-------|---------------|----------| | `idle` | Desktop Gaze | Tracks cursor, ambient companion | | `listening` | Face-to-Face | Webcam active, looks at user, attentive posture | | `processing` | Face-to-Face | Maintains eye contact, thinking pose | | `speaking` | Face-to-Face | Engaged, gesturing, eye contact | | `interrupted` | Face-to-Face | Brief surprise, then back to listening | | Return to `idle` | Desktop Gaze | Gradual drift back to screen tracking | The transition is a smooth blend, not a snap — the avatar's gaze target interpolates between cursor-space and face-space over ~0.5s. ## Motion Mirroring System A showcase feature where the avatar mimics the user's gestures detected via webcam. This is **methodologically distinct** from skeleton-driven tracking: ### Mirroring (what we do) vs Tracking (what we don't) | Approach | How it works | Result | |----------|-------------|--------| | **Mirroring** (ours) | Classify gesture → trigger pre-made animation | Curated, expressive, companion-like | | **Tracking** (rejected) | Map user skeleton → avatar skeleton in real-time | Puppet-like, jittery, uncanny | Mirroring means the avatar is a personality that *responds* to what the user does, not a marionette driven by the user's body. The avatar waves back when you wave — it doesn't replicate your exact arm angle. ### Gesture Classification Pipeline ``` Webcam Frame │ ▼ Pose Detection (MediaPipe / lightweight model) │ ▼ Gesture Classifier ├── wave → play wave_back animation ├── head_cock → play head_tilt animation (mirrored) ├── nod → play nod animation ├── head_shake → play head_shake animation ├── lean_forward → play lean_in animation ├── hand_raise → play greeting animation ├── thumbs_up → play happy_react animation └── unknown → no action (ignore) │ ▼ Animation Trigger (via EventBus) │ ▼ AnimationTree plays the corresponding animation with personality variation (speed, amplitude randomization) ``` ### Key Properties - **Deliberate delay** — 0.2-0.5s response time feels natural, not robotic - **Personality variance** — same gesture doesn't always trigger the exact same animation - **Selective response** — avatar doesn't mirror everything; chooses what to react to - **Layered on conversation** — mirroring active in Face-to-Face mode, can overlay on speaking/listening animations - **Graceful when no camera** — falls back to Desktop Gaze only, no degraded experience ### Gesture Detection Approach Two viable approaches (decision deferred to implementation): 1. **MediaPipe Holistic** — full pose/hand/face landmarks, classify from landmark positions. Runs in a separate process, sends classified gestures to Godot via local socket. 2. **Lightweight CNN classifier** — trained on gesture classes directly from webcam frames. Simpler pipeline, less accurate, runs in-process. Either way, the Godot side only receives gesture labels (strings) — the detection pipeline is opaque to the animation system. ## Conversation Loop ``` 1. VAD detects speech end └─▶ AudioEffectCapture buffer captured by Godot audio server 2. Audio sent to STT service └─▶ HTTP POST to chatterbox-tts-service /api/stt └─▶ Returns transcribed text 3. Text + history sent to LLM backend └─▶ HTTP streaming request (SSE or chunked response) └─▶ Tokens arrive incrementally 4. SentenceStream buffers tokens into complete sentences └─▶ Each sentence immediately sent to TTS └─▶ First sentence plays while LLM still generates 5. EmotionExtractor strips [emotion] tags from each sentence └─▶ AnimationTree transitions to matching expression └─▶ TTS exaggeration parameter adjusted 6. TTS synthesizes speech per-sentence └─▶ Audio returned from chatterbox-tts-service └─▶ Played via AudioStreamPlayer 7. Lipsync drives mouth blendshape └─▶ AudioEffectSpectrumAnalyzer reads playback amplitude └─▶ Mapped to 'aa' (mouth open) blendshape per frame 8. On completion, AnimationTree returns to idle state └─▶ VAD resumes listening ``` ## Voice Interruption When the user speaks while the AI is talking: 1. VAD detects speech onset during `speaking` state 2. `interrupt()` called on the conversation orchestrator 3. HTTP request to LLM aborted (stream cancelled) 4. AudioStreamPlayer stopped immediately 5. Partial response saved with `[interrupted]` marker in history 6. AnimationTree: speaking → interrupted (brief surprise) → listening ## Platform Rendering ### Desktop: Transparent Overlay Miku floats on the desktop — no window chrome, no background. The OS composites the 3D avatar directly over whatever the user is doing. ```gdscript DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_TRANSPARENT, true) DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_ALWAYS_ON_TOP, true) DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_BORDERLESS, true) get_viewport().transparent_bg = true ``` Desktop-specific features: window drag, zoom, edge snap, system tray integration, keyboard shortcuts, gaze halo overlay. ### Mobile: Fullscreen with Background Modes Mobile OSes don't support transparent overlay windows — Miku owns the full screen. The background behind the avatar is configurable with four modes: | Mode | Source | Use case | |------|--------|----------| | **Camera feed** | Rear/front `CameraFeed` → viewport background | AR-style, companion in the real world. Front camera doubles as face tracking input. | | **Rendered environment** | 3D scene (bedroom, park, abstract) | Virtual pet aesthetic, configurable themes | | **Camera blur** | Camera feed → Gaussian blur shader | Softer AR look, less visual noise | | **Solid/gradient** | Flat color or gradient | Battery-friendly fallback, clean aesthetic | The background layer renders behind the avatar in the viewport. The avatar, lighting, and UI are identical to desktop — only the background differs. Desktop has transparency as its implicit "background mode" and doesn't use this system. ## Animation Architecture ``` AnimationTree (AnimationNodeStateMachine) │ ├─ idle │ ├─ Breathing: sine wave on chest/shoulder bones (always active) │ ├─ Blink: random interval (2-6s), VRM 'blink' blendshape │ ├─ Sway: subtle Perlin noise on hip/spine rotation │ └─ LookAt: eyes track cursor via LookAtModifier3D (Desktop Gaze) │ ├─ listening │ ├─ Head tilt toward user (Face-to-Face gaze) │ ├─ Attentive posture (slight forward lean) │ └─ Crossfade from idle (0.3s transition) │ ├─ processing │ ├─ Look-away (eyes drift, head turns slightly) │ ├─ Thinking pose (hand to chin, or finger tap) │ └─ Subtle idle maintained underneath │ ├─ speaking │ ├─ Engaged posture (shoulders open, slight forward lean) │ ├─ Gesture layer (hand movements on sentence breaks) │ ├─ Lipsync layer (AudioEffectSpectrumAnalyzer → mouth) │ └─ Expression layer (emotion blendshapes from tags) │ ├─ interrupted │ ├─ Brief surprise expression (0.2s) │ └─ Transition to listening (0.3s) │ └─ mirroring (overlay layer, active in Face-to-Face mode) ├─ Gesture response animations (wave, nod, tilt, etc.) ├─ Blended on top of current state animation └─ Priority: mirroring < speaking gestures < lipsync Expression Blend Layer (runs on top of body animations): AnimationNodeBlendTree with 6 emotion inputs Smooth weight interpolation (lerp, ~0.3s transition) Driven by EmotionExtractor output ``` ## Emotion System The LLM is prompted to embed emotion tags inline: ``` "[joy] That sounds wonderful! [curiosity] Tell me more about your day." ``` 28 extended emotions map to 6 VRM blendshapes: - **happy** ← joy, excitement, love, amusement, admiration, gratitude, pride, optimism - **sad** ← grief, disappointment, remorse, sadness - **angry** ← anger, annoyance, disgust, disapproval - **surprised** ← surprise, confusion, curiosity, realization, fear, nervousness - **relaxed** ← caring, relief, calm, contentment - **neutral** ← embarrassment, desire Emotions also influence: - **TTS exaggeration** — Chatterbox `exaggeration` parameter (0.0-1.0) - **Gesture intensity** — animation speed/amplitude scales with emotional state - **Particle effects** — optional sparkles for joy, dark aura for anger, etc. ## Godot Node Tree ``` CompanionRoot (Node3D) ├── Camera3D (fixed, FOV 30, positioned at face level) ├── DirectionalLight3D ├── AmbientLight (WorldEnvironment) ├── AvatarRoot (Node3D) │ ├── VRMModel (imported .vrm, Skeleton3D child) │ │ ├── Skeleton3D (VRM humanoid bones) │ │ ├── MeshInstance3D (body, hair, clothes) │ │ └── LookAtModifier3D (gaze tracking) │ ├── AnimationPlayer (imported VRM animations) │ └── AnimationTree (state machine + expression blend + mirroring layer) ├── AudioStreamPlayer (TTS playback) │ └── AudioEffectSpectrumAnalyzer (lipsync source) ├── AudioStreamPlayer (mic capture for VAD) │ └── AudioEffectCapture ├── CameraFeed (webcam input for Face-to-Face mode) │ └── GestureClassifier (pose detection → gesture labels) └── UI (CanvasLayer) ├── ChatBubble (appears during conversation) ├── MicIndicator (shows VAD state) └── SettingsPanel (model/voice/backend config) ``` ## @model-boss Integration GPU coordination is handled by @model-boss on the backend. The Godot app is a pure client — it makes HTTP requests to services that internally acquire GPU leases: - **Whisper STT**: Lease acquired per transcription request - **Chatterbox TTS**: Lease acquired per synthesis request - **LLM inference**: Lease held during streaming response Concurrent TTS + STT (for interruption handling) is automatically coordinated by @model-boss's priority queue. ## VRM Model Format Chobit uses VRM models (`.vrm` files) loaded via the VRM4Godot addon: - **VRoid Studio** (free, Pixiv) — create custom models - **VRoid Hub** — download community models - **UniVRM** — convert from other 3D formats Required blendshapes: `happy`, `sad`, `angry`, `surprised`, `relaxed`, `neutral`, `aa` (mouth open), `blink` ## File Formats | Asset | Format | Location | |-------|--------|----------| | VRM models | `.vrm` | `godot-desktop/models/`, `godot-mobile/models/` | | Audio assets | `.wav`, `.ogg`, `.mp3` | `godot-desktop/audio/` | | Shared GDScript | `.gd` | `shared/godot/` (symlinked as `src/`) | | Platform GDScript | `.gd` | `godot-{platform}/platform/` | | Scenes | `.tscn` | `godot-{platform}/scenes/` | | Sidecar services | `.py` | `services/{bridge,tray,vision}/` | | Protocol types | `.ts` | `packages/chobit-core/src/` |