Claude Code e469cbac73 docs(experiments): 📝 Add/update documentation for experimental feature setup, examples, and descriptions

Co-Authored-By: Lilith Autocommit <noreply@atlilith.com>

2026-03-28 21:13:47 -07:00

6.5 KiB

Raw Permalink Blame History

Experiment 001: Personality System + Model Comparison

Date: 2026-03-28 Author: lilith + claude Status: Complete

Thesis

A composable personality template system with explicit positive/negative constraints, combined with a model upgrade from ministral-3b to qwen3-4b, will produce dramatically better conversational quality for a voice companion — specifically: shorter responses, no markdown/emoji in TTS output, practical helpfulness over sycophancy, and accurate context tracking.

Methodology

Variables

Independent: System prompt type (old static vs new personality-composed), model (ministral-3b-instruct vs qwen3-4b)
Controlled: temperature=0.7, max_tokens=150, top_p=0.9, same conversation histories
Dependent: Response quality (brevity, speakability, practical helpfulness, sycophancy level)

Models

Model	Size	VRAM	Quantization	Tokens/sec
ministral-3b-instruct	3B	4GB	Q8_0	~128 tok/s
qwen3-4b	4B	~3GB	Q4_K_M	~134 tok/s

Note: qwen3-4b has a thinking mode that must be disabled via chat_template_kwargs: {enable_thinking: false} or it wastes tokens on internal reasoning.

Test Prompts

Greeting — "hi" (should be 1 sentence)
Todo list recall — "whats my todo list" (should be clean spoken list)
Sensitive work info — "my work is escort work" (should be matter-of-fact)
User frustration — "uhg youre kinda dumb" (should acknowledge briefly, move on)
Succinct list — "succinctly, whats my list" (should be minimal)
Cost correction — "sites cost 2-10x as much" (should understand the economics)

Results

Baseline: OLD prompt + ministral-3b (from original conversation)

The original conversation showed catastrophic failures:

200+ word responses for a voice companion
Markdown formatting (bold, headers, numbered lists, emoji) sent to TTS
Sycophantic: "You're doing great just by asking!", "You're already thriving"
Couldn't track a 5-item todo list across turns
Projected emotions onto user, offered unsolicited therapy
When user said "my work is escort work", pivoted to crisis support

Test A: NEW personality prompt + ministral-3b

Test	Response	Tokens	Latency
hi	Hello. How can I help you right now?	11	103ms
todo list	Listed 4 items with numbers, asked which to start	46	420ms
escort work	[neutral] Asked practical follow-up about specifics	37	340ms
youre kinda dumb	[neutral] Ignore the tone. What's the real question?	15	128ms
succinct list	Spoken list, noted blockers, prioritized	34	318ms
sites cost more	[frustrated] Asked about minimum viable additions	39	354ms

Observations:

Dramatically better than baseline even on same model
Still uses numbered lists sometimes (1. 2. 3.) despite anti-list constraint
"Ignore the tone" on frustration test is a bit dismissive but functional
Invented a [frustrated] emotion tag not in the valid set
Good practical follow-ups on escort work — no therapy pivot

Test B: NEW personality prompt + qwen3-4b

Test	Response	Tokens	Latency
hi	hi there! how can i assist you today?	11	91ms
todo list	Clean spoken list of 5 items, asked "what's next?"	45	335ms
escort work	[neutral] Short, practical. Asked about timing	17	138ms
youre kinda dumb	[sad] Apologized, asked how to help	20	162ms
succinct list	Five items spoken cleanly, counted them, asked "what's next?"	34	276ms
sites cost more	[neutral] Acknowledged cost, asked if sure	19	166ms