chobit/docs/ARCHITECTURE.md
2026-03-28 14:55:34 -07:00

16 KiB

Chobit Architecture

Overview

Chobit is an interactive AI companion — a multi-platform Godot 4 app with a 3D VRM avatar, voice interaction, and pluggable LLM backend. Godot is the avatar runtime; all ML/GPU inference runs on external services via model-boss.

The project follows the @applications Tier 2 pattern with shared GDScript symlinked into platform-specific Godot projects:

shared/godot/           → Cross-platform source (avatar, conversation, audio, UI)
godot-desktop/src/ →    → Symlink to shared/godot/ (transparent overlay, tray, window mgmt)
godot-mobile/src/ →     → Symlink to shared/godot/ (touch input, on-device camera)
services/               → Desktop-only Python sidecars (bridge, tray, vision)

System Diagram

┌──────────────────────────────────────────────────────────────┐
│ Godot 4 App (transparent desktop overlay)                    │
│                                                              │
│  ┌────────────────┐  ┌─────────────────┐  ┌──────────────┐ │
│  │ Microphone      │  │ Conversation    │  │ VRM Avatar   │ │
│  │ Input           │  │ Orchestrator    │  │              │ │
│  │                 │  │                 │  │ Skeleton     │ │
│  │ VAD             │  │ State Machine   │  │ Blendshapes  │ │
│  │ (Silero/energy) │──│ Sentence Stream │──│ AnimationTree│ │
│  │                 │  │ Emotion Extract │  │ IK / LookAt  │ │
│  │ AudioEffectCapt │  │ Interrupt Ctrl  │  │ Lipsync      │ │
│  └────────────────┘  └────────┬────────┘  └──────────────┘ │
│                               │                              │
│  ┌────────────────┐           │                              │
│  │ Camera Input   │           │                              │
│  │                │           │                              │
│  │ Webcam Feed    │           │                              │
│  │ Gesture Classif│───────────┘                              │
│  │ Face Detection │                                          │
│  └────────────────┘                                          │
│                                                              │
│                ┌──────────────┼──────────────┐              │
│                ▼              ▼              ▼              │
│          ┌──────────┐  ┌──────────┐  ┌──────────┐         │
│          │ STT      │  │ LLM      │  │ TTS      │         │
│          │ Client   │  │ Client   │  │ Client   │         │
│          │ (HTTP)   │  │ (HTTP/WS)│  │ (HTTP)   │         │
│          └──────────┘  └──────────┘  └──────────┘         │
│                │              │              │              │
└────────────────┼──────────────┼──────────────┼──────────────┘
                 │              │              │
                 ▼              ▼              ▼
          ┌───────────────────────────────────────┐
          │         Backend Services              │
          │                                       │
          │  @speech-synthesis    @model-boss     │
          │  ├─ Whisper STT       ├─ GPU leases   │
          │  └─ Chatterbox TTS    └─ LLM routing  │
          │                                       │
          │  Any OpenAI-compatible LLM endpoint   │
          │  or LifeAI companion service          │
          └───────────────────────────────────────┘

Attention System (Dual-Mode Gaze)

Chobit has two attention modes that determine where the avatar looks and how it responds to the user:

Desktop Gaze (Ambient Mode)

The avatar tracks what the user is doing on screen. The companion is "with you" while you work.

  • Eyes/head follow cursor position — LookAt target is the mouse pointer mapped to 3D space
  • Active during idle state — the default when no conversation is happening
  • Ambient reactions — occasional glances at notification areas, screen edges, active windows
  • Subtle personality — random look-away moments, stretches, yawns (not a robotic cursor tracker)

Face-to-Face (Conversation Mode)

The webcam activates and the avatar looks at the user directly. Mutual eye contact.

  • Gaze target is the user's face — detected via webcam, avatar maintains eye contact
  • Active during conversation — listening, processing, speaking states
  • Facial awareness — can detect user's general expression for responsive reactions
  • Triggered by VAD — speech detection switches from Desktop Gaze to Face-to-Face

Mode Transitions

Transitions map to the ConversationState FSM:

State Attention Mode Behavior
idle Desktop Gaze Tracks cursor, ambient companion
listening Face-to-Face Webcam active, looks at user, attentive posture
processing Face-to-Face Maintains eye contact, thinking pose
speaking Face-to-Face Engaged, gesturing, eye contact
interrupted Face-to-Face Brief surprise, then back to listening
Return to idle Desktop Gaze Gradual drift back to screen tracking

The transition is a smooth blend, not a snap — the avatar's gaze target interpolates between cursor-space and face-space over ~0.5s.

Motion Mirroring System

A showcase feature where the avatar mimics the user's gestures detected via webcam. This is methodologically distinct from skeleton-driven tracking:

Mirroring (what we do) vs Tracking (what we don't)

Approach How it works Result
Mirroring (ours) Classify gesture → trigger pre-made animation Curated, expressive, companion-like
Tracking (rejected) Map user skeleton → avatar skeleton in real-time Puppet-like, jittery, uncanny

Mirroring means the avatar is a personality that responds to what the user does, not a marionette driven by the user's body. The avatar waves back when you wave — it doesn't replicate your exact arm angle.

Gesture Classification Pipeline

Webcam Frame
  │
  ▼
Pose Detection (MediaPipe / lightweight model)
  │
  ▼
Gesture Classifier
  ├── wave         → play wave_back animation
  ├── head_cock    → play head_tilt animation (mirrored)
  ├── nod          → play nod animation
  ├── head_shake   → play head_shake animation
  ├── lean_forward → play lean_in animation
  ├── hand_raise   → play greeting animation
  ├── thumbs_up    → play happy_react animation
  └── unknown      → no action (ignore)
  │
  ▼
Animation Trigger (via EventBus)
  │
  ▼
AnimationTree plays the corresponding animation
with personality variation (speed, amplitude randomization)

Key Properties

  • Deliberate delay — 0.2-0.5s response time feels natural, not robotic
  • Personality variance — same gesture doesn't always trigger the exact same animation
  • Selective response — avatar doesn't mirror everything; chooses what to react to
  • Layered on conversation — mirroring active in Face-to-Face mode, can overlay on speaking/listening animations
  • Graceful when no camera — falls back to Desktop Gaze only, no degraded experience

Gesture Detection Approach

Two viable approaches (decision deferred to implementation):

  1. MediaPipe Holistic — full pose/hand/face landmarks, classify from landmark positions. Runs in a separate process, sends classified gestures to Godot via local socket.
  2. Lightweight CNN classifier — trained on gesture classes directly from webcam frames. Simpler pipeline, less accurate, runs in-process.

Either way, the Godot side only receives gesture labels (strings) — the detection pipeline is opaque to the animation system.

Conversation Loop

1. VAD detects speech end
   └─▶ AudioEffectCapture buffer captured by Godot audio server

2. Audio sent to STT service
   └─▶ HTTP POST to chatterbox-tts-service /api/stt
   └─▶ Returns transcribed text

3. Text + history sent to LLM backend
   └─▶ HTTP streaming request (SSE or chunked response)
   └─▶ Tokens arrive incrementally

4. SentenceStream buffers tokens into complete sentences
   └─▶ Each sentence immediately sent to TTS
   └─▶ First sentence plays while LLM still generates

5. EmotionExtractor strips [emotion] tags from each sentence
   └─▶ AnimationTree transitions to matching expression
   └─▶ TTS exaggeration parameter adjusted

6. TTS synthesizes speech per-sentence
   └─▶ Audio returned from chatterbox-tts-service
   └─▶ Played via AudioStreamPlayer

7. Lipsync drives mouth blendshape
   └─▶ AudioEffectSpectrumAnalyzer reads playback amplitude
   └─▶ Mapped to 'aa' (mouth open) blendshape per frame

8. On completion, AnimationTree returns to idle state
   └─▶ VAD resumes listening

Voice Interruption

When the user speaks while the AI is talking:

  1. VAD detects speech onset during speaking state
  2. interrupt() called on the conversation orchestrator
  3. HTTP request to LLM aborted (stream cancelled)
  4. AudioStreamPlayer stopped immediately
  5. Partial response saved with [interrupted] marker in history
  6. AnimationTree: speaking → interrupted (brief surprise) → listening

Desktop Overlay

Godot 4 transparent window configuration:

# In project.godot or at runtime:
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_TRANSPARENT, true)
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_ALWAYS_ON_TOP, true)
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_BORDERLESS, true)

# Transparent viewport
get_viewport().transparent_bg = true

# Click-through on transparent pixels (optional)
# Handled via input event detection on the character mesh

The result: the character floats on the desktop with no window chrome, visible above all other windows, with only the character model and minimal UI elements being interactive.

Animation Architecture

AnimationTree (AnimationNodeStateMachine)
│
├─ idle
│  ├─ Breathing: sine wave on chest/shoulder bones (always active)
│  ├─ Blink: random interval (2-6s), VRM 'blink' blendshape
│  ├─ Sway: subtle Perlin noise on hip/spine rotation
│  └─ LookAt: eyes track cursor via LookAtModifier3D (Desktop Gaze)
│
├─ listening
│  ├─ Head tilt toward user (Face-to-Face gaze)
│  ├─ Attentive posture (slight forward lean)
│  └─ Crossfade from idle (0.3s transition)
│
├─ processing
│  ├─ Look-away (eyes drift, head turns slightly)
│  ├─ Thinking pose (hand to chin, or finger tap)
│  └─ Subtle idle maintained underneath
│
├─ speaking
│  ├─ Engaged posture (shoulders open, slight forward lean)
│  ├─ Gesture layer (hand movements on sentence breaks)
│  ├─ Lipsync layer (AudioEffectSpectrumAnalyzer → mouth)
│  └─ Expression layer (emotion blendshapes from tags)
│
├─ interrupted
│  ├─ Brief surprise expression (0.2s)
│  └─ Transition to listening (0.3s)
│
└─ mirroring (overlay layer, active in Face-to-Face mode)
   ├─ Gesture response animations (wave, nod, tilt, etc.)
   ├─ Blended on top of current state animation
   └─ Priority: mirroring < speaking gestures < lipsync

Expression Blend Layer (runs on top of body animations):
  AnimationNodeBlendTree with 6 emotion inputs
  Smooth weight interpolation (lerp, ~0.3s transition)
  Driven by EmotionExtractor output

Emotion System

The LLM is prompted to embed emotion tags inline:

"[joy] That sounds wonderful! [curiosity] Tell me more about your day."

28 extended emotions map to 6 VRM blendshapes:

  • happy ← joy, excitement, love, amusement, admiration, gratitude, pride, optimism
  • sad ← grief, disappointment, remorse, sadness
  • angry ← anger, annoyance, disgust, disapproval
  • surprised ← surprise, confusion, curiosity, realization, fear, nervousness
  • relaxed ← caring, relief, calm, contentment
  • neutral ← embarrassment, desire

Emotions also influence:

  • TTS exaggeration — Chatterbox exaggeration parameter (0.0-1.0)
  • Gesture intensity — animation speed/amplitude scales with emotional state
  • Particle effects — optional sparkles for joy, dark aura for anger, etc.

Godot Node Tree

CompanionRoot (Node3D)
├── Camera3D (fixed, FOV 30, positioned at face level)
├── DirectionalLight3D
├── AmbientLight (WorldEnvironment)
├── AvatarRoot (Node3D)
│   ├── VRMModel (imported .vrm, Skeleton3D child)
│   │   ├── Skeleton3D (VRM humanoid bones)
│   │   ├── MeshInstance3D (body, hair, clothes)
│   │   └── LookAtModifier3D (gaze tracking)
│   ├── AnimationPlayer (imported VRM animations)
│   └── AnimationTree (state machine + expression blend + mirroring layer)
├── AudioStreamPlayer (TTS playback)
│   └── AudioEffectSpectrumAnalyzer (lipsync source)
├── AudioStreamPlayer (mic capture for VAD)
│   └── AudioEffectCapture
├── CameraFeed (webcam input for Face-to-Face mode)
│   └── GestureClassifier (pose detection → gesture labels)
└── UI (CanvasLayer)
    ├── ChatBubble (appears during conversation)
    ├── MicIndicator (shows VAD state)
    └── SettingsPanel (model/voice/backend config)

@model-boss Integration

GPU coordination is handled by @model-boss on the backend. The Godot app is a pure client — it makes HTTP requests to services that internally acquire GPU leases:

  • Whisper STT: Lease acquired per transcription request
  • Chatterbox TTS: Lease acquired per synthesis request
  • LLM inference: Lease held during streaming response

Concurrent TTS + STT (for interruption handling) is automatically coordinated by @model-boss's priority queue.

VRM Model Format

Chobit uses VRM models (.vrm files) loaded via the VRM4Godot addon:

  • VRoid Studio (free, Pixiv) — create custom models
  • VRoid Hub — download community models
  • UniVRM — convert from other 3D formats

Required blendshapes: happy, sad, angry, surprised, relaxed, neutral, aa (mouth open), blink

File Formats

Asset Format Location
VRM models .vrm godot-desktop/models/, godot-mobile/models/
Audio assets .wav, .ogg, .mp3 godot-desktop/audio/
Shared GDScript .gd shared/godot/ (symlinked as src/)
Platform GDScript .gd godot-{platform}/platform/
Scenes .tscn godot-{platform}/scenes/
Sidecar services .py services/{bridge,tray,vision}/
Protocol types .ts packages/chobit-core/src/