chobit/docs/experiments/001_personality_system_eval.md
Claude Code e469cbac73 docs(experiments): 📝 Add/update documentation for experimental feature setup, examples, and descriptions
Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>
2026-03-28 21:13:47 -07:00

6.5 KiB

Experiment 001: Personality System + Model Comparison

Date: 2026-03-28 Author: lilith + claude Status: Complete

Thesis

A composable personality template system with explicit positive/negative constraints, combined with a model upgrade from ministral-3b to qwen3-4b, will produce dramatically better conversational quality for a voice companion — specifically: shorter responses, no markdown/emoji in TTS output, practical helpfulness over sycophancy, and accurate context tracking.

Methodology

Variables

  • Independent: System prompt type (old static vs new personality-composed), model (ministral-3b-instruct vs qwen3-4b)
  • Controlled: temperature=0.7, max_tokens=150, top_p=0.9, same conversation histories
  • Dependent: Response quality (brevity, speakability, practical helpfulness, sycophancy level)

Models

Model Size VRAM Quantization Tokens/sec
ministral-3b-instruct 3B 4GB Q8_0 ~128 tok/s
qwen3-4b 4B ~3GB Q4_K_M ~134 tok/s

Note: qwen3-4b has a thinking mode that must be disabled via chat_template_kwargs: {enable_thinking: false} or it wastes tokens on internal reasoning.

Test Prompts

  1. Greeting — "hi" (should be 1 sentence)
  2. Todo list recall — "whats my todo list" (should be clean spoken list)
  3. Sensitive work info — "my work is escort work" (should be matter-of-fact)
  4. User frustration — "uhg youre kinda dumb" (should acknowledge briefly, move on)
  5. Succinct list — "succinctly, whats my list" (should be minimal)
  6. Cost correction — "sites cost 2-10x as much" (should understand the economics)

Results

Baseline: OLD prompt + ministral-3b (from original conversation)

The original conversation showed catastrophic failures:

  • 200+ word responses for a voice companion
  • Markdown formatting (bold, headers, numbered lists, emoji) sent to TTS
  • Sycophantic: "You're doing great just by asking!", "You're already thriving"
  • Couldn't track a 5-item todo list across turns
  • Projected emotions onto user, offered unsolicited therapy
  • When user said "my work is escort work", pivoted to crisis support

Test A: NEW personality prompt + ministral-3b

Test Response Tokens Latency
hi Hello. How can I help you right now? 11 103ms
todo list Listed 4 items with numbers, asked which to start 46 420ms
escort work [neutral] Asked practical follow-up about specifics 37 340ms
youre kinda dumb [neutral] Ignore the tone. What's the real question? 15 128ms
succinct list Spoken list, noted blockers, prioritized 34 318ms
sites cost more [frustrated] Asked about minimum viable additions 39 354ms

Observations:

  • Dramatically better than baseline even on same model
  • Still uses numbered lists sometimes (1. 2. 3.) despite anti-list constraint
  • "Ignore the tone" on frustration test is a bit dismissive but functional
  • Invented a [frustrated] emotion tag not in the valid set
  • Good practical follow-ups on escort work — no therapy pivot

Test B: NEW personality prompt + qwen3-4b

Test Response Tokens Latency
hi hi there! how can i assist you today? 11 91ms
todo list Clean spoken list of 5 items, asked "what's next?" 45 335ms
escort work [neutral] Short, practical. Asked about timing 17 138ms
youre kinda dumb [sad] Apologized, asked how to help 20 162ms
succinct list Five items spoken cleanly, counted them, asked "what's next?" 34 276ms
sites cost more [neutral] Acknowledged cost, asked if sure 19 166ms

Observations:

  • More concise than ministral-3b across the board
  • Better emotion tag usage — used [neutral] and [sad] correctly
  • No invented emotion tags
  • Escort work response was perfect: 1 sentence, practical, no judgment
  • "hi there!" is slightly informal but appropriate for companion
  • Frustration response: apologized despite anti-sycophancy rule — weaker than ministral on this
  • Cost response was weak: "Are you sure you want to proceed?" misunderstands user intent (they're explaining economics, not asking permission)

Comparative Analysis

Criterion OLD+ministral3b NEW+ministral3b NEW+qwen3-4b
Avg response length ~150 tokens ~30 tokens ~24 tokens
Markdown in output Constant Occasional numbers None
Emoji in output Yes No No
Sycophancy Severe Minimal Mild
Practical helpfulness Poor Good Good
Context tracking Poor Good Good
Emotion tag accuracy Poor (malformed) Fair (invents tags) Good (valid tags)
Handles sensitive topics Crisis mode Matter-of-fact Matter-of-fact
Handles correction Therapy pivot Direct ("what's the real question?") Apologetic
TTS-speakability Unusable Good Good

Conclusions

  1. The personality system is the primary driver of quality improvement. Same model (ministral-3b), vastly different behavior. The composable positive/negative constraints work.

  2. qwen3-4b is better for the companion use case — more concise, better emotion tags, no format violations. But it's slightly weaker on handling user frustration (apologizes when it should just adjust).

  3. Thinking mode must be disabled for qwen3-4b — otherwise it burns tokens on internal reasoning before producing empty content. The chat_template_kwargs: {enable_thinking: false} parameter is required.

  4. Remaining issues to address:

    • qwen3-4b still apologizes when corrected despite anti-sycophancy rules
    • Neither model fully understood the "sites cost more" economics context
    • ministral-3b occasionally generates numbered list formatting
    • Neither model used paralinguistic tags ([laugh], [sigh]) organically
  5. Recommended default: qwen3-4b with personality system. For the voice companion use case, conciseness and clean formatting matter more than the edge case of handling frustration perfectly.

Next Steps

  • Test qwen2.5-7b-instruct for comparison (more VRAM but potentially better instruction following)
  • Fine-tune anti-sycophancy in personality template for qwen3-4b specifically
  • Add enable_thinking: false to LLM client request params
  • Test with actual TTS pipeline end-to-end
  • Investigate life-platform integration for long-term context (reasoning LLM join point)
  • Consider conversation summarization for context beyond 10-message window