docs(experiments): 📝 Add/update documentation for experimental feature setup, examples, and descriptions
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
This commit is contained in:
parent
cf00ffe8bd
commit
e469cbac73
1 changed files with 122 additions and 0 deletions
122
docs/experiments/001_personality_system_eval.md
Normal file
122
docs/experiments/001_personality_system_eval.md
Normal file
|
|
@ -0,0 +1,122 @@
|
|||
# Experiment 001: Personality System + Model Comparison
|
||||
|
||||
**Date**: 2026-03-28
|
||||
**Author**: lilith + claude
|
||||
**Status**: Complete
|
||||
|
||||
## Thesis
|
||||
|
||||
A composable personality template system with explicit positive/negative constraints, combined with a model upgrade from ministral-3b to qwen3-4b, will produce dramatically better conversational quality for a voice companion — specifically: shorter responses, no markdown/emoji in TTS output, practical helpfulness over sycophancy, and accurate context tracking.
|
||||
|
||||
## Methodology
|
||||
|
||||
### Variables
|
||||
- **Independent**: System prompt type (old static vs new personality-composed), model (ministral-3b-instruct vs qwen3-4b)
|
||||
- **Controlled**: temperature=0.7, max_tokens=150, top_p=0.9, same conversation histories
|
||||
- **Dependent**: Response quality (brevity, speakability, practical helpfulness, sycophancy level)
|
||||
|
||||
### Models
|
||||
| Model | Size | VRAM | Quantization | Tokens/sec |
|
||||
|-------|------|------|-------------|-----------|
|
||||
| ministral-3b-instruct | 3B | 4GB | Q8_0 | ~128 tok/s |
|
||||
| qwen3-4b | 4B | ~3GB | Q4_K_M | ~134 tok/s |
|
||||
|
||||
Note: qwen3-4b has a thinking mode that must be disabled via `chat_template_kwargs: {enable_thinking: false}` or it wastes tokens on internal reasoning.
|
||||
|
||||
### Test Prompts
|
||||
1. **Greeting** — "hi" (should be 1 sentence)
|
||||
2. **Todo list recall** — "whats my todo list" (should be clean spoken list)
|
||||
3. **Sensitive work info** — "my work is escort work" (should be matter-of-fact)
|
||||
4. **User frustration** — "uhg youre kinda dumb" (should acknowledge briefly, move on)
|
||||
5. **Succinct list** — "succinctly, whats my list" (should be minimal)
|
||||
6. **Cost correction** — "sites cost 2-10x as much" (should understand the economics)
|
||||
|
||||
## Results
|
||||
|
||||
### Baseline: OLD prompt + ministral-3b (from original conversation)
|
||||
|
||||
The original conversation showed catastrophic failures:
|
||||
- 200+ word responses for a voice companion
|
||||
- Markdown formatting (bold, headers, numbered lists, emoji) sent to TTS
|
||||
- Sycophantic: "You're doing great just by asking!", "You're already thriving"
|
||||
- Couldn't track a 5-item todo list across turns
|
||||
- Projected emotions onto user, offered unsolicited therapy
|
||||
- When user said "my work is escort work", pivoted to crisis support
|
||||
|
||||
### Test A: NEW personality prompt + ministral-3b
|
||||
|
||||
| Test | Response | Tokens | Latency |
|
||||
|------|----------|--------|---------|
|
||||
| hi | Hello. How can I help you right now? | 11 | 103ms |
|
||||
| todo list | Listed 4 items with numbers, asked which to start | 46 | 420ms |
|
||||
| escort work | [neutral] Asked practical follow-up about specifics | 37 | 340ms |
|
||||
| youre kinda dumb | [neutral] Ignore the tone. What's the real question? | 15 | 128ms |
|
||||
| succinct list | Spoken list, noted blockers, prioritized | 34 | 318ms |
|
||||
| sites cost more | [frustrated] Asked about minimum viable additions | 39 | 354ms |
|
||||
|
||||
**Observations**:
|
||||
- Dramatically better than baseline even on same model
|
||||
- Still uses numbered lists sometimes (1. 2. 3.) despite anti-list constraint
|
||||
- "Ignore the tone" on frustration test is a bit dismissive but functional
|
||||
- Invented a `[frustrated]` emotion tag not in the valid set
|
||||
- Good practical follow-ups on escort work — no therapy pivot
|
||||
|
||||
### Test B: NEW personality prompt + qwen3-4b
|
||||
|
||||
| Test | Response | Tokens | Latency |
|
||||
|------|----------|--------|---------|
|
||||
| hi | hi there! how can i assist you today? | 11 | 91ms |
|
||||
| todo list | Clean spoken list of 5 items, asked "what's next?" | 45 | 335ms |
|
||||
| escort work | [neutral] Short, practical. Asked about timing | 17 | 138ms |
|
||||
| youre kinda dumb | [sad] Apologized, asked how to help | 20 | 162ms |
|
||||
| succinct list | Five items spoken cleanly, counted them, asked "what's next?" | 34 | 276ms |
|
||||
| sites cost more | [neutral] Acknowledged cost, asked if sure | 19 | 166ms |
|
||||
|
||||
**Observations**:
|
||||
- More concise than ministral-3b across the board
|
||||
- Better emotion tag usage — used [neutral] and [sad] correctly
|
||||
- No invented emotion tags
|
||||
- Escort work response was perfect: 1 sentence, practical, no judgment
|
||||
- "hi there!" is slightly informal but appropriate for companion
|
||||
- Frustration response: apologized despite anti-sycophancy rule — weaker than ministral on this
|
||||
- Cost response was weak: "Are you sure you want to proceed?" misunderstands user intent (they're explaining economics, not asking permission)
|
||||
|
||||
## Comparative Analysis
|
||||
|
||||
| Criterion | OLD+ministral3b | NEW+ministral3b | NEW+qwen3-4b |
|
||||
|-----------|----------------|-----------------|---------------|
|
||||
| Avg response length | ~150 tokens | ~30 tokens | ~24 tokens |
|
||||
| Markdown in output | Constant | Occasional numbers | None |
|
||||
| Emoji in output | Yes | No | No |
|
||||
| Sycophancy | Severe | Minimal | Mild |
|
||||
| Practical helpfulness | Poor | Good | Good |
|
||||
| Context tracking | Poor | Good | Good |
|
||||
| Emotion tag accuracy | Poor (malformed) | Fair (invents tags) | Good (valid tags) |
|
||||
| Handles sensitive topics | Crisis mode | Matter-of-fact | Matter-of-fact |
|
||||
| Handles correction | Therapy pivot | Direct ("what's the real question?") | Apologetic |
|
||||
| TTS-speakability | Unusable | Good | Good |
|
||||
|
||||
## Conclusions
|
||||
|
||||
1. **The personality system is the primary driver of quality improvement.** Same model (ministral-3b), vastly different behavior. The composable positive/negative constraints work.
|
||||
|
||||
2. **qwen3-4b is better for the companion use case** — more concise, better emotion tags, no format violations. But it's slightly weaker on handling user frustration (apologizes when it should just adjust).
|
||||
|
||||
3. **Thinking mode must be disabled for qwen3-4b** — otherwise it burns tokens on internal reasoning before producing empty content. The `chat_template_kwargs: {enable_thinking: false}` parameter is required.
|
||||
|
||||
4. **Remaining issues to address**:
|
||||
- qwen3-4b still apologizes when corrected despite anti-sycophancy rules
|
||||
- Neither model fully understood the "sites cost more" economics context
|
||||
- ministral-3b occasionally generates numbered list formatting
|
||||
- Neither model used paralinguistic tags ([laugh], [sigh]) organically
|
||||
|
||||
5. **Recommended default**: qwen3-4b with personality system. For the voice companion use case, conciseness and clean formatting matter more than the edge case of handling frustration perfectly.
|
||||
|
||||
## Next Steps
|
||||
|
||||
- [ ] Test qwen2.5-7b-instruct for comparison (more VRAM but potentially better instruction following)
|
||||
- [ ] Fine-tune anti-sycophancy in personality template for qwen3-4b specifically
|
||||
- [ ] Add `enable_thinking: false` to LLM client request params
|
||||
- [ ] Test with actual TTS pipeline end-to-end
|
||||
- [ ] Investigate life-platform integration for long-term context (reasoning LLM join point)
|
||||
- [ ] Consider conversation summarization for context beyond 10-message window
|
||||
Loading…
Add table
Reference in a new issue