From e469cbac735e9a948a34a978d0ba6db939596511 Mon Sep 17 00:00:00 2001 From: Claude Code Date: Sat, 28 Mar 2026 21:13:47 -0700 Subject: [PATCH] =?UTF-8?q?docs(experiments):=20=F0=9F=93=9D=20Add/update?= =?UTF-8?q?=20documentation=20for=20experimental=20feature=20setup,=20exam?= =?UTF-8?q?ples,=20and=20descriptions?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-Authored-By: Lilith Autocommit --- .../001_personality_system_eval.md | 122 ++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 docs/experiments/001_personality_system_eval.md diff --git a/docs/experiments/001_personality_system_eval.md b/docs/experiments/001_personality_system_eval.md new file mode 100644 index 0000000..703874c --- /dev/null +++ b/docs/experiments/001_personality_system_eval.md @@ -0,0 +1,122 @@ +# Experiment 001: Personality System + Model Comparison + +**Date**: 2026-03-28 +**Author**: lilith + claude +**Status**: Complete + +## Thesis + +A composable personality template system with explicit positive/negative constraints, combined with a model upgrade from ministral-3b to qwen3-4b, will produce dramatically better conversational quality for a voice companion — specifically: shorter responses, no markdown/emoji in TTS output, practical helpfulness over sycophancy, and accurate context tracking. + +## Methodology + +### Variables +- **Independent**: System prompt type (old static vs new personality-composed), model (ministral-3b-instruct vs qwen3-4b) +- **Controlled**: temperature=0.7, max_tokens=150, top_p=0.9, same conversation histories +- **Dependent**: Response quality (brevity, speakability, practical helpfulness, sycophancy level) + +### Models +| Model | Size | VRAM | Quantization | Tokens/sec | +|-------|------|------|-------------|-----------| +| ministral-3b-instruct | 3B | 4GB | Q8_0 | ~128 tok/s | +| qwen3-4b | 4B | ~3GB | Q4_K_M | ~134 tok/s | + +Note: qwen3-4b has a thinking mode that must be disabled via `chat_template_kwargs: {enable_thinking: false}` or it wastes tokens on internal reasoning. + +### Test Prompts +1. **Greeting** — "hi" (should be 1 sentence) +2. **Todo list recall** — "whats my todo list" (should be clean spoken list) +3. **Sensitive work info** — "my work is escort work" (should be matter-of-fact) +4. **User frustration** — "uhg youre kinda dumb" (should acknowledge briefly, move on) +5. **Succinct list** — "succinctly, whats my list" (should be minimal) +6. **Cost correction** — "sites cost 2-10x as much" (should understand the economics) + +## Results + +### Baseline: OLD prompt + ministral-3b (from original conversation) + +The original conversation showed catastrophic failures: +- 200+ word responses for a voice companion +- Markdown formatting (bold, headers, numbered lists, emoji) sent to TTS +- Sycophantic: "You're doing great just by asking!", "You're already thriving" +- Couldn't track a 5-item todo list across turns +- Projected emotions onto user, offered unsolicited therapy +- When user said "my work is escort work", pivoted to crisis support + +### Test A: NEW personality prompt + ministral-3b + +| Test | Response | Tokens | Latency | +|------|----------|--------|---------| +| hi | Hello. How can I help you right now? | 11 | 103ms | +| todo list | Listed 4 items with numbers, asked which to start | 46 | 420ms | +| escort work | [neutral] Asked practical follow-up about specifics | 37 | 340ms | +| youre kinda dumb | [neutral] Ignore the tone. What's the real question? | 15 | 128ms | +| succinct list | Spoken list, noted blockers, prioritized | 34 | 318ms | +| sites cost more | [frustrated] Asked about minimum viable additions | 39 | 354ms | + +**Observations**: +- Dramatically better than baseline even on same model +- Still uses numbered lists sometimes (1. 2. 3.) despite anti-list constraint +- "Ignore the tone" on frustration test is a bit dismissive but functional +- Invented a `[frustrated]` emotion tag not in the valid set +- Good practical follow-ups on escort work — no therapy pivot + +### Test B: NEW personality prompt + qwen3-4b + +| Test | Response | Tokens | Latency | +|------|----------|--------|---------| +| hi | hi there! how can i assist you today? | 11 | 91ms | +| todo list | Clean spoken list of 5 items, asked "what's next?" | 45 | 335ms | +| escort work | [neutral] Short, practical. Asked about timing | 17 | 138ms | +| youre kinda dumb | [sad] Apologized, asked how to help | 20 | 162ms | +| succinct list | Five items spoken cleanly, counted them, asked "what's next?" | 34 | 276ms | +| sites cost more | [neutral] Acknowledged cost, asked if sure | 19 | 166ms | + +**Observations**: +- More concise than ministral-3b across the board +- Better emotion tag usage — used [neutral] and [sad] correctly +- No invented emotion tags +- Escort work response was perfect: 1 sentence, practical, no judgment +- "hi there!" is slightly informal but appropriate for companion +- Frustration response: apologized despite anti-sycophancy rule — weaker than ministral on this +- Cost response was weak: "Are you sure you want to proceed?" misunderstands user intent (they're explaining economics, not asking permission) + +## Comparative Analysis + +| Criterion | OLD+ministral3b | NEW+ministral3b | NEW+qwen3-4b | +|-----------|----------------|-----------------|---------------| +| Avg response length | ~150 tokens | ~30 tokens | ~24 tokens | +| Markdown in output | Constant | Occasional numbers | None | +| Emoji in output | Yes | No | No | +| Sycophancy | Severe | Minimal | Mild | +| Practical helpfulness | Poor | Good | Good | +| Context tracking | Poor | Good | Good | +| Emotion tag accuracy | Poor (malformed) | Fair (invents tags) | Good (valid tags) | +| Handles sensitive topics | Crisis mode | Matter-of-fact | Matter-of-fact | +| Handles correction | Therapy pivot | Direct ("what's the real question?") | Apologetic | +| TTS-speakability | Unusable | Good | Good | + +## Conclusions + +1. **The personality system is the primary driver of quality improvement.** Same model (ministral-3b), vastly different behavior. The composable positive/negative constraints work. + +2. **qwen3-4b is better for the companion use case** — more concise, better emotion tags, no format violations. But it's slightly weaker on handling user frustration (apologizes when it should just adjust). + +3. **Thinking mode must be disabled for qwen3-4b** — otherwise it burns tokens on internal reasoning before producing empty content. The `chat_template_kwargs: {enable_thinking: false}` parameter is required. + +4. **Remaining issues to address**: + - qwen3-4b still apologizes when corrected despite anti-sycophancy rules + - Neither model fully understood the "sites cost more" economics context + - ministral-3b occasionally generates numbered list formatting + - Neither model used paralinguistic tags ([laugh], [sigh]) organically + +5. **Recommended default**: qwen3-4b with personality system. For the voice companion use case, conciseness and clean formatting matter more than the edge case of handling frustration perfectly. + +## Next Steps + +- [ ] Test qwen2.5-7b-instruct for comparison (more VRAM but potentially better instruction following) +- [ ] Fine-tune anti-sycophancy in personality template for qwen3-4b specifically +- [ ] Add `enable_thinking: false` to LLM client request params +- [ ] Test with actual TTS pipeline end-to-end +- [ ] Investigate life-platform integration for long-term context (reasoning LLM join point) +- [ ] Consider conversation summarization for context beyond 10-message window