Claude Code 50104f9bfc docs(docs): 📝 Update ARCHITECTURE.md with refined system architecture diagrams and design patterns documentation

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-03-28 14:55:34 -07:00

16 KiB

Raw Blame History

Chobit Architecture

Overview

Chobit is an interactive AI companion — a multi-platform Godot 4 app with a 3D VRM avatar, voice interaction, and pluggable LLM backend. Godot is the avatar runtime; all ML/GPU inference runs on external services via model-boss.

The project follows the @applications Tier 2 pattern with shared GDScript symlinked into platform-specific Godot projects:

shared/godot/           → Cross-platform source (avatar, conversation, audio, UI)
godot-desktop/src/ →    → Symlink to shared/godot/ (transparent overlay, tray, window mgmt)
godot-mobile/src/ →     → Symlink to shared/godot/ (touch input, on-device camera)
services/               → Desktop-only Python sidecars (bridge, tray, vision)

System Diagram

┌──────────────────────────────────────────────────────────────┐
│ Godot 4 App (transparent desktop overlay)                    │
│                                                              │
│  ┌────────────────┐  ┌─────────────────┐  ┌──────────────┐ │
│  │ Microphone      │  │ Conversation    │  │ VRM Avatar   │ │
│  │ Input           │  │ Orchestrator    │  │              │ │
│  │                 │  │                 │  │ Skeleton     │ │
│  │ VAD             │  │ State Machine   │  │ Blendshapes  │ │
│  │ (Silero/energy) │──│ Sentence Stream │──│ AnimationTree│ │
│  │                 │  │ Emotion Extract │  │ IK / LookAt  │ │
│  │ AudioEffectCapt │  │ Interrupt Ctrl  │  │ Lipsync      │ │
│  └────────────────┘  └────────┬────────┘  └──────────────┘ │
│                               │                              │
│  ┌────────────────┐           │                              │
│  │ Camera Input   │           │                              │
│  │                │           │                              │
│  │ Webcam Feed    │           │                              │
│  │ Gesture Classif│───────────┘                              │
│  │ Face Detection │                                          │
│  └────────────────┘                                          │
│                                                              │
│                ┌──────────────┼──────────────┐              │
│                ▼              ▼              ▼              │
│          ┌──────────┐  ┌──────────┐  ┌──────────┐         │
│          │ STT      │  │ LLM      │  │ TTS      │         │
│          │ Client   │  │ Client   │  │ Client   │         │
│          │ (HTTP)   │  │ (HTTP/WS)│  │ (HTTP)   │         │
│          └──────────┘  └──────────┘  └──────────┘         │
│                │              │              │              │
└────────────────┼──────────────┼──────────────┼──────────────┘
                 │              │              │
                 ▼              ▼              ▼
          ┌───────────────────────────────────────┐
          │         Backend Services              │
          │                                       │
          │  @speech-synthesis    @model-boss     │
          │  ├─ Whisper STT       ├─ GPU leases   │
          │  └─ Chatterbox TTS    └─ LLM routing  │
          │                                       │
          │  Any OpenAI-compatible LLM endpoint   │
          │  or LifeAI companion service          │
          └───────────────────────────────────────┘

Attention System (Dual-Mode Gaze)

Chobit has two attention modes that determine where the avatar looks and how it responds to the user:

Desktop Gaze (Ambient Mode)

The avatar tracks what the user is doing on screen. The companion is "with you" while you work.

Eyes/head follow cursor position — LookAt target is the mouse pointer mapped to 3D space
Active during idle state — the default when no conversation is happening
Ambient reactions — occasional glances at notification areas, screen edges, active windows
Subtle personality — random look-away moments, stretches, yawns (not a robotic cursor tracker)

Face-to-Face (Conversation Mode)

The webcam activates and the avatar looks at the user directly. Mutual eye contact.

Gaze target is the user's face — detected via webcam, avatar maintains eye contact
Active during conversation — listening, processing, speaking states
Facial awareness — can detect user's general expression for responsive reactions
Triggered by VAD — speech detection switches from Desktop Gaze to Face-to-Face

Mode Transitions

Transitions map to the ConversationState FSM:

State	Attention Mode	Behavior
`idle`	Desktop Gaze	Tracks cursor, ambient companion
`listening`	Face-to-Face	Webcam active, looks at user, attentive posture
`processing`	Face-to-Face	Maintains eye contact, thinking pose
`speaking`	Face-to-Face	Engaged, gesturing, eye contact
`interrupted`	Face-to-Face	Brief surprise, then back to listening
Return to `idle`	Desktop Gaze	Gradual drift back to screen tracking

The transition is a smooth blend, not a snap — the avatar's gaze target interpolates between cursor-space and face-space over ~0.5s.

Motion Mirroring System

A showcase feature where the avatar mimics the user's gestures detected via webcam. This is methodologically distinct from skeleton-driven tracking:

Mirroring (what we do) vs Tracking (what we don't)

Approach	How it works	Result
Mirroring (ours)	Classify gesture → trigger pre-made animation	Curated, expressive, companion-like
Tracking (rejected)	Map user skeleton → avatar skeleton in real-time	Puppet-like, jittery, uncanny

Mirroring means the avatar is a personality that responds to what the user does, not a marionette driven by the user's body. The avatar waves back when you wave — it doesn't replicate your exact arm angle.

Gesture Classification Pipeline

Webcam Frame
  │
  ▼
Pose Detection (MediaPipe / lightweight model)
  │
  ▼
Gesture Classifier
  ├── wave         → play wave_back animation
  ├── head_cock    → play head_tilt animation (mirrored)
  ├── nod          → play nod animation
  ├── head_shake   → play head_shake animation
  ├── lean_forward → play lean_in animation
  ├── hand_raise   → play greeting animation
  ├── thumbs_up    → play happy_react animation
  └── unknown      → no action (ignore)
  │
  ▼
Animation Trigger (via EventBus)
  │
  ▼
AnimationTree plays the corresponding animation
with personality variation (speed, amplitude randomization)

Key Properties

Deliberate delay — 0.2-0.5s response time feels natural, not robotic
Personality variance — same gesture doesn't always trigger the exact same animation
Selective response — avatar doesn't mirror everything; chooses what to react to
Layered on conversation — mirroring active in Face-to-Face mode, can overlay on speaking/listening animations
Graceful when no camera — falls back to Desktop Gaze only, no degraded experience

Gesture Detection Approach

Two viable approaches (decision deferred to implementation):

MediaPipe Holistic — full pose/hand/face landmarks, classify from landmark positions. Runs in a separate process, sends classified gestures to Godot via local socket.
Lightweight CNN classifier — trained on gesture classes directly from webcam frames. Simpler pipeline, less accurate, runs in-process.

Either way, the Godot side only receives gesture labels (strings) — the detection pipeline is opaque to the animation system.

Conversation Loop

1. VAD detects speech end
   └─▶ AudioEffectCapture buffer captured by Godot audio server

2. Audio sent to STT service
   └─▶ HTTP POST to chatterbox-tts-service /api/stt
   └─▶ Returns transcribed text

3. Text + history sent to LLM backend
   └─▶ HTTP streaming request (SSE or chunked response)
   └─▶ Tokens arrive incrementally

4. SentenceStream buffers tokens into complete sentences
   └─▶ Each sentence immediately sent to TTS
   └─▶ First sentence plays while LLM still generates

5. EmotionExtractor strips [emotion] tags from each sentence
   └─▶ AnimationTree transitions to matching expression
   └─▶ TTS exaggeration parameter adjusted

6. TTS synthesizes speech per-sentence
   └─▶ Audio returned from chatterbox-tts-service
   └─▶ Played via AudioStreamPlayer

7. Lipsync drives mouth blendshape
   └─▶ AudioEffectSpectrumAnalyzer reads playback amplitude
   └─▶ Mapped to 'aa' (mouth open) blendshape per frame

8. On completion, AnimationTree returns to idle state
   └─▶ VAD resumes listening

Voice Interruption

When the user speaks while the AI is talking:

VAD detects speech onset during speaking state
interrupt() called on the conversation orchestrator
HTTP request to LLM aborted (stream cancelled)
AudioStreamPlayer stopped immediately
Partial response saved with [interrupted] marker in history
AnimationTree: speaking → interrupted (brief surprise) → listening

Desktop Overlay

Godot 4 transparent window configuration:

# In project.godot or at runtime:
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_TRANSPARENT, true)
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_ALWAYS_ON_TOP, true)
DisplayServer.window_set_flag(DisplayServer.WINDOW_FLAG_BORDERLESS, true)

# Transparent viewport
get_viewport().transparent_bg = true

# Click-through on transparent pixels (optional)
# Handled via input event detection on the character mesh

The result: the character floats on the desktop with no window chrome, visible above all other windows, with only the character model and minimal UI elements being interactive.

Animation Architecture

AnimationTree (AnimationNodeStateMachine)
│
├─ idle
│  ├─ Breathing: sine wave on chest/shoulder bones (always active)
│  ├─ Blink: random interval (2-6s), VRM 'blink' blendshape
│  ├─ Sway: subtle Perlin noise on hip/spine rotation
│  └─ LookAt: eyes track cursor via LookAtModifier3D (Desktop Gaze)
│
├─ listening
│  ├─ Head tilt toward user (Face-to-Face gaze)
│  ├─ Attentive posture (slight forward lean)
│  └─ Crossfade from idle (0.3s transition)
│
├─ processing
│  ├─ Look-away (eyes drift, head turns slightly)
│  ├─ Thinking pose (hand to chin, or finger tap)
│  └─ Subtle idle maintained underneath
│
├─ speaking
│  ├─ Engaged posture (shoulders open, slight forward lean)
│  ├─ Gesture layer (hand movements on sentence breaks)
│  ├─ Lipsync layer (AudioEffectSpectrumAnalyzer → mouth)
│  └─ Expression layer (emotion blendshapes from tags)
│
├─ interrupted
│  ├─ Brief surprise expression (0.2s)
│  └─ Transition to listening (0.3s)
│
└─ mirroring (overlay layer, active in Face-to-Face mode)
   ├─ Gesture response animations (wave, nod, tilt, etc.)
   ├─ Blended on top of current state animation
   └─ Priority: mirroring < speaking gestures < lipsync

Expression Blend Layer (runs on top of body animations):
  AnimationNodeBlendTree with 6 emotion inputs
  Smooth weight interpolation (lerp, ~0.3s transition)
  Driven by EmotionExtractor output

Emotion System

The LLM is prompted to embed emotion tags inline:

"[joy] That sounds wonderful! [curiosity] Tell me more about your day."

28 extended emotions map to 6 VRM blendshapes:

happy ← joy, excitement, love, amusement, admiration, gratitude, pride, optimism
sad ← grief, disappointment, remorse, sadness
angry ← anger, annoyance, disgust, disapproval
surprised ← surprise, confusion, curiosity, realization, fear, nervousness
relaxed ← caring, relief, calm, contentment
neutral ← embarrassment, desire

Emotions also influence:

TTS exaggeration — Chatterbox exaggeration parameter (0.0-1.0)
Gesture intensity — animation speed/amplitude scales with emotional state
Particle effects — optional sparkles for joy, dark aura for anger, etc.

Godot Node Tree

CompanionRoot (Node3D)
├── Camera3D (fixed, FOV 30, positioned at face level)
├── DirectionalLight3D
├── AmbientLight (WorldEnvironment)
├── AvatarRoot (Node3D)
│   ├── VRMModel (imported .vrm, Skeleton3D child)
│   │   ├── Skeleton3D (VRM humanoid bones)
│   │   ├── MeshInstance3D (body, hair, clothes)
│   │   └── LookAtModifier3D (gaze tracking)
│   ├── AnimationPlayer (imported VRM animations)
│   └── AnimationTree (state machine + expression blend + mirroring layer)
├── AudioStreamPlayer (TTS playback)
│   └── AudioEffectSpectrumAnalyzer (lipsync source)
├── AudioStreamPlayer (mic capture for VAD)
│   └── AudioEffectCapture
├── CameraFeed (webcam input for Face-to-Face mode)
│   └── GestureClassifier (pose detection → gesture labels)
└── UI (CanvasLayer)
    ├── ChatBubble (appears during conversation)
    ├── MicIndicator (shows VAD state)
    └── SettingsPanel (model/voice/backend config)

@model-boss Integration

GPU coordination is handled by @model-boss on the backend. The Godot app is a pure client — it makes HTTP requests to services that internally acquire GPU leases:

Whisper STT: Lease acquired per transcription request
Chatterbox TTS: Lease acquired per synthesis request
LLM inference: Lease held during streaming response

Concurrent TTS + STT (for interruption handling) is automatically coordinated by @model-boss's priority queue.

VRM Model Format

Chobit uses VRM models (.vrm files) loaded via the VRM4Godot addon:

VRoid Studio (free, Pixiv) — create custom models
VRoid Hub — download community models
UniVRM — convert from other 3D formats

Required blendshapes: happy, sad, angry, surprised, relaxed, neutral, aa (mouth open), blink

File Formats

Asset	Format	Location
VRM models	`.vrm`	`godot-desktop/models/`, `godot-mobile/models/`
Audio assets	`.wav`, `.ogg`, `.mp3`	`godot-desktop/audio/`
Shared GDScript	`.gd`	`shared/godot/` (symlinked as `src/`)
Platform GDScript	`.gd`	`godot-{platform}/platform/`
Scenes	`.tscn`	`godot-{platform}/scenes/`
Sidecar services	`.py`	`services/{bridge,tray,vision}/`
Protocol types	`.ts`	`packages/chobit-core/src/`

16 KiB Raw Blame History