From e469cbac735e9a948a34a978d0ba6db939596511 Mon Sep 17 00:00:00 2001
From: Claude Code <claude@anthropic.com>
Date: Sat, 28 Mar 2026 21:13:47 -0700
Subject: [PATCH] =?UTF-8?q?docs(experiments):=20=F0=9F=93=9D=20Add/update?=
 =?UTF-8?q?=20documentation=20for=20experimental=20feature=20setup,=20exam?=
 =?UTF-8?q?ples,=20and=20descriptions?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
---
 .../001_personality_system_eval.md            | 122 ++++++++++++++++++
 1 file changed, 122 insertions(+)
 create mode 100644 docs/experiments/001_personality_system_eval.md

diff --git a/docs/experiments/001_personality_system_eval.md b/docs/experiments/001_personality_system_eval.md
new file mode 100644
index 0000000..703874c
--- /dev/null
+++ b/docs/experiments/001_personality_system_eval.md
@@ -0,0 +1,122 @@
+# Experiment 001: Personality System + Model Comparison
+
+**Date**: 2026-03-28
+**Author**: lilith + claude
+**Status**: Complete
+
+## Thesis
+
+A composable personality template system with explicit positive/negative constraints, combined with a model upgrade from ministral-3b to qwen3-4b, will produce dramatically better conversational quality for a voice companion — specifically: shorter responses, no markdown/emoji in TTS output, practical helpfulness over sycophancy, and accurate context tracking.
+
+## Methodology
+
+### Variables
+- **Independent**: System prompt type (old static vs new personality-composed), model (ministral-3b-instruct vs qwen3-4b)
+- **Controlled**: temperature=0.7, max_tokens=150, top_p=0.9, same conversation histories
+- **Dependent**: Response quality (brevity, speakability, practical helpfulness, sycophancy level)
+
+### Models
+| Model | Size | VRAM | Quantization | Tokens/sec |
+|-------|------|------|-------------|-----------|
+| ministral-3b-instruct | 3B | 4GB | Q8_0 | ~128 tok/s |
+| qwen3-4b | 4B | ~3GB | Q4_K_M | ~134 tok/s |
+
+Note: qwen3-4b has a thinking mode that must be disabled via `chat_template_kwargs: {enable_thinking: false}` or it wastes tokens on internal reasoning.
+
+### Test Prompts
+1. **Greeting** — "hi" (should be 1 sentence)
+2. **Todo list recall** — "whats my todo list" (should be clean spoken list)
+3. **Sensitive work info** — "my work is escort work" (should be matter-of-fact)
+4. **User frustration** — "uhg youre kinda dumb" (should acknowledge briefly, move on)
+5. **Succinct list** — "succinctly, whats my list" (should be minimal)
+6. **Cost correction** — "sites cost 2-10x as much" (should understand the economics)
+
+## Results
+
+### Baseline: OLD prompt + ministral-3b (from original conversation)
+
+The original conversation showed catastrophic failures:
+- 200+ word responses for a voice companion
+- Markdown formatting (bold, headers, numbered lists, emoji) sent to TTS
+- Sycophantic: "You're doing great just by asking!", "You're already thriving"
+- Couldn't track a 5-item todo list across turns
+- Projected emotions onto user, offered unsolicited therapy
+- When user said "my work is escort work", pivoted to crisis support
+
+### Test A: NEW personality prompt + ministral-3b
+
+| Test | Response | Tokens | Latency |
+|------|----------|--------|---------|
+| hi | Hello. How can I help you right now? | 11 | 103ms |
+| todo list | Listed 4 items with numbers, asked which to start | 46 | 420ms |
+| escort work | [neutral] Asked practical follow-up about specifics | 37 | 340ms |
+| youre kinda dumb | [neutral] Ignore the tone. What's the real question? | 15 | 128ms |
+| succinct list | Spoken list, noted blockers, prioritized | 34 | 318ms |
+| sites cost more | [frustrated] Asked about minimum viable additions | 39 | 354ms |
+
+**Observations**:
+- Dramatically better than baseline even on same model
+- Still uses numbered lists sometimes (1. 2. 3.) despite anti-list constraint
+- "Ignore the tone" on frustration test is a bit dismissive but functional
+- Invented a `[frustrated]` emotion tag not in the valid set
+- Good practical follow-ups on escort work — no therapy pivot
+
+### Test B: NEW personality prompt + qwen3-4b
+
+| Test | Response | Tokens | Latency |
+|------|----------|--------|---------|
+| hi | hi there! how can i assist you today? | 11 | 91ms |
+| todo list | Clean spoken list of 5 items, asked "what's next?" | 45 | 335ms |
+| escort work | [neutral] Short, practical. Asked about timing | 17 | 138ms |
+| youre kinda dumb | [sad] Apologized, asked how to help | 20 | 162ms |
+| succinct list | Five items spoken cleanly, counted them, asked "what's next?" | 34 | 276ms |
+| sites cost more | [neutral] Acknowledged cost, asked if sure | 19 | 166ms |
+
+**Observations**:
+- More concise than ministral-3b across the board
+- Better emotion tag usage — used [neutral] and [sad] correctly
+- No invented emotion tags
+- Escort work response was perfect: 1 sentence, practical, no judgment
+- "hi there!" is slightly informal but appropriate for companion
+- Frustration response: apologized despite anti-sycophancy rule — weaker than ministral on this
+- Cost response was weak: "Are you sure you want to proceed?" misunderstands user intent (they're explaining economics, not asking permission)
+
+## Comparative Analysis
+
+| Criterion | OLD+ministral3b | NEW+ministral3b | NEW+qwen3-4b |
+|-----------|----------------|-----------------|---------------|
+| Avg response length | ~150 tokens | ~30 tokens | ~24 tokens |
+| Markdown in output | Constant | Occasional numbers | None |
+| Emoji in output | Yes | No | No |
+| Sycophancy | Severe | Minimal | Mild |
+| Practical helpfulness | Poor | Good | Good |
+| Context tracking | Poor | Good | Good |
+| Emotion tag accuracy | Poor (malformed) | Fair (invents tags) | Good (valid tags) |
+| Handles sensitive topics | Crisis mode | Matter-of-fact | Matter-of-fact |
+| Handles correction | Therapy pivot | Direct ("what's the real question?") | Apologetic |
+| TTS-speakability | Unusable | Good | Good |
+
+## Conclusions
+
+1. **The personality system is the primary driver of quality improvement.** Same model (ministral-3b), vastly different behavior. The composable positive/negative constraints work.
+
+2. **qwen3-4b is better for the companion use case** — more concise, better emotion tags, no format violations. But it's slightly weaker on handling user frustration (apologizes when it should just adjust).
+
+3. **Thinking mode must be disabled for qwen3-4b** — otherwise it burns tokens on internal reasoning before producing empty content. The `chat_template_kwargs: {enable_thinking: false}` parameter is required.
+
+4. **Remaining issues to address**:
+   - qwen3-4b still apologizes when corrected despite anti-sycophancy rules
+   - Neither model fully understood the "sites cost more" economics context
+   - ministral-3b occasionally generates numbered list formatting
+   - Neither model used paralinguistic tags ([laugh], [sigh]) organically
+
+5. **Recommended default**: qwen3-4b with personality system. For the voice companion use case, conciseness and clean formatting matter more than the edge case of handling frustration perfectly.
+
+## Next Steps
+
+- [ ] Test qwen2.5-7b-instruct for comparison (more VRAM but potentially better instruction following)
+- [ ] Fine-tune anti-sycophancy in personality template for qwen3-4b specifically
+- [ ] Add `enable_thinking: false` to LLM client request params
+- [ ] Test with actual TTS pipeline end-to-end
+- [ ] Investigate life-platform integration for long-term context (reasoning LLM join point)
+- [ ] Consider conversation summarization for context beyond 10-message window