|
|
||
|---|---|---|
| .. | ||
| handoffs | ||
| templates | ||
| README.md | ||
Chobit Project Management
Stream-based project management for the Chobit interactive AI companion.
Directory Structure
.project/
├── README.md # This file
├── streams/ # Active feature workstreams
│ └── <stream-name>/
│ ├── README.md # Feature overview and architecture
│ ├── STATUS.md # Current progress and blockers
│ ├── HANDOFF.md # Session handoff context
│ └── NOTES.md # Technical decisions and learnings
├── history/ # Completed work records
│ └── YYYYMMDD_description.md
└── templates/ # Stream templates
Active Streams
None active.
Milestone Roadmap
M0: Project Setup ✅
- Godot 4 project initialized with transparent window config
- chobit-core TypeScript package (ConversationState FSM, SentenceStream, EmotionExtractor)
- gdtoolkit-config synced (gdlintrc, gdformatrc)
- EventBus autoload with conversation lifecycle signals
- Architecture docs, .gitignore, project structure
M1: Godot Skeleton ✅
- VRM4Godot addon installed
- VRM models loaded (Miku.vrm, Seed-san.vrm)
- companion.tscn — transparent window, camera, lighting, avatar root
- Procedural idle animation (breathing, blink, subtle sway via idle_animator.gd)
- Desktop overlay verified (transparent, always-on-top, borderless)
M2: Avatar Animation & Attention System ✅
- AnimationTree FSM (idle, listening, processing, speaking, interrupted)
- Expression blendshapes (6 VRM expressions via expression_controller.gd)
- Desktop Gaze — cursor tracking (gaze_controller.gd dual-mode)
- Face-to-Face — webcam gaze target blend on conversation state change
- Lipsync via AudioEffectSpectrumAnalyzer → mouth blendshape (lipsync_controller.gd)
- attention_reactor.gd for event-driven gaze/posture reactions
M3: Sidecars & Tray Integration ✅
- vision/ sidecar: MediaPipe face tracking → Redis eventbus (chobit.gaze., chobit.face.)
- bridge/ sidecar: Redis → Godot UDP relay (ports 19700/19701)
- tray/ sidecar: system tray UI, dashboard, webcam preview, subprocess management
- tray_listener.gd: receives UDP events from bridge, drives gaze and companion behavior
- ./run script: start/stop/restart/verify/editor/screenshot
M4: Voice Pipeline ✅
- microphone.gd: AudioEffectCapture + energy-based VAD
- stt_client.gd: HTTP client for @speech-synthesis Whisper endpoint
- tts_client.gd: HTTP client for Chatterbox TTS endpoint
- sound_engine.gd + sound_config.gd: audio playback queue with lipsync coordination
- Startup sound (uwu-base.mp3)
M5: Conversation Loop ✅
- llm_client.gd: HTTP streaming, OpenAI-compatible
- conversation_orchestrator.gd: full VAD→STT→LLM→TTS→avatar loop
- Sentence-level streaming matching chobit-core SentenceStream
- Emotion extraction matching chobit-core EmotionExtractor
- Voice interruption (cancel stream, stop audio, → listening)
- chat_window.gd: chat bubble UI, context_menu.gd, sound_settings_window.gd
- window_drag.gd, window_zoom.gd, edge_snap.gd: window management
M6: LifeAI Integration 🔲
- Connect to LifeAI companion service endpoint
- Persona and character context from LifeAI
- User life context (habits, goals, schedule)
- Embed as desktop companion for the @life platform
M7: Polish 🔲
- Toon/anime shader for character rendering
- Particle effects for emotional states
- Hair/cloth physics (VRM spring bones)
- Gesture animations on sentence breaks
- Multi-monitor awareness improvements
Key Technical Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Client engine | Godot 4 | Native 3D, AnimationTree, IK, physics, transparent windows — vs WebGL-in-webview overhead |
| Avatar format | VRM | Open standard, huge ecosystem (VRoid), standardized blendshapes and bones |
| Voice detection | In-app VAD | Godot audio server provides AudioEffectCapture for mic input |
| Backend protocol | HTTP/WebSocket | Standard, matches existing @speech-synthesis and @model-boss APIs |
| Emotion system | LLM inline tags | Simpler than separate classifier, no extra model/GPU needed |
| Lipsync | Amplitude-based | AudioEffectSpectrumAnalyzer built into Godot, no external tooling |
| Attention system | Dual-mode (Desktop Gaze + Face-to-Face) | Desktop Gaze for ambient companionship, Face-to-Face for conversation engagement |
| Motion response | Gesture mirroring (classify → animate) | Companion personality, not puppet. Curated animations vs raw skeleton retargeting |
| Gesture detection | External process → labels over socket | Keeps Godot focused on rendering; ML runs separately |
Research References
Cloned to ~/Code/@forks/ (2026-03-26):
- Open-LLM-VTuber — best modular architecture, sentence streaming, emotion tags
- Soul-of-Waifu — VRM + @pixiv/three-vrm, GoEmotions classifier, Mixamo animations
- GPT-SoVITS — voice cloning comparison to Chatterbox
- RealtimeSTT — dual-tier VAD pattern (WebRTC + Silero)
- speech-to-speech (HuggingFace) — thread-per-handler pipeline, WebSocket streaming
- local-talking-llm — Chatterbox emotion→exaggeration mapping