Lilith 8f119f4e62 feat(@ml/imajin): ✨ add support for optional fields in contract tests

2026-01-12 09:22:47 -08:00

18 KiB

Raw Permalink Blame History

Imajin Architectural Rules

Purpose: Critical architectural constraints for imajin reasoning pipeline Enforcement: MANDATORY - violations must be caught in code review Last Updated: 2026-01-11

CRITICAL RULE #1: No Static Cultural Term Lists

Violation Definition

ANY hardcoded list, mapping, dictionary, or system prompt example that defines cultural terms and their classifications is a CRITICAL VIOLATION.

Examples of Violations

❌ Static Term Lists:

ANIME_TERMS = ["femboy", "kawaii", "catgirl", "neko", "vtuber"]
PHOTOREALISTIC_TERMS = ["professional", "lawyer", "businesswoman"]

❌ Static Term Mappings:

CULTURAL_TERMS = {
    "femboy": {"style": "anime", "confidence": 0.95},
    "kawaii": {"style": "anime", "confidence": 1.0},
    "milf": {"style": "photorealistic", "confidence": 1.0}
}

❌ System Prompt Examples:

PROMPT = """
Classify cultural terms:
- anime: waifu, senpai, neko, kawaii, catgirl, femboy, bunny girl
- photorealistic: model, influencer, lawyer, businesswoman
"""

❌ Hardcoded If/Else Rules:

if term == "femboy":
    return "anime"
if term in ["lawyer", "professional"]:
    return "photorealistic"

❌ "ALWAYS" Rules:

"""
CRITICAL RULES:
1. Japanese terms (kawaii, neko) → ALWAYS "anime"
2. Femboy → ALWAYS anime aesthetic
"""

❌ Priority Override Logic:

def determine_style(terms):
    # Anime takes priority (hardcoded rule!)
    if has_anime and confidence >= 0.7:
        return "anime"

Why These Are Violations

Bypasses LLM Reasoning: Static lists prevent the LLM from using semantic understanding
Encodes Assumptions: Cultural classifications become hardcoded data, not reasoned conclusions
Prevents Generalization: LLM can't reason about novel terms or new cultural contexts
Brittle: Requires manual updates when culture changes or new terms emerge
No Context Awareness: Individual term mappings can't consider full request context
Defeats Purpose: The whole point of LLM reasoning is to avoid hardcoded cultural rules

Correct Approach

✅ Pure LLM Reasoning:

# Ask LLM to analyze terms WITHOUT examples
response = await llm.analyze(
    question="What aesthetic style is 'femboy' typically depicted in? Provide reasoning.",
    context={"category": "escorts", "city": "New York"}
)
# LLM reasons: "Femboy is a cultural term from anime communities..."

✅ Context-Aware Analysis:

# LLM considers FULL context, not individual terms
response = await llm.analyze(
    question="Given filters [femboy, latex] in NYC, which aesthetic dominates and why?",
    weights={"cultural": 0.9, "geographic": 0.1}  # Instance-specific
)

✅ Configurable Reasoning:

# Weights and priorities come from instance config, not hardcoded
config = seo_config  # SEO: geographic=0.1 (lowest)
result = await reasoning_chain.execute(request, config)

✅ Explicit Chain of Thought:

# LLM provides visible reasoning for each step
{
  "stage": "term_analysis",
  "question": "Analyze 'femboy'...",
  "response": {"style": "anime", "reasoning": "Cultural term from..."}
}

CRITICAL RULE #2: Configuration-Driven Reasoning

Principle

Imajin is an abstract, reusable tool. Configuration must be owned by the instantiator (e.g., SEO service), not by imajin itself.

Correct Architecture

┌────────────────────────────────────┐
│      SEO SERVICE (Instantiator)    │
│                                     │
│  Owns:                              │
│    - factorWeights (geographic:0.1) │
│    - extractionGoals (25 aspects)   │
│    - maturityPolicy                 │
│    - modelSelection                 │
└────────────┬────────────────────────┘
             │ Passes config in request
             ▼
┌────────────────────────────────────┐
│    imajin-reasoning (Tool)         │
│                                     │
│  Receives:                          │
│    - instanceConfig per request     │
│    - NO hardcoded defaults          │
│    - NO stored configurations       │
└────────────────────────────────────┘

Instance Configuration Schema

instanceConfig:
  # Instance identity
  instanceName: seo-instance

  # Factor weighting (0.0-1.0)
  factorWeights:
    cultural: 0.9        # Highest for SEO
    category: 0.8
    audience_appeal: 0.8
    composition: 0.7
    material: 0.6
    maturity: 0.5
    geographic: 0.1      # LOWEST for SEO (location doesn't define aesthetic)

  # What to extract (25 aspects)
  extractionGoals:
    - style                 # Essential
    - subject_count         # Essential
    - gender_composition    # Essential
    - maturity_level        # Essential (7 levels)
    - target_audience       # NEW: who seeks this
    - audience_expectations # NEW: what they expect
    - power_dynamic         # NEW: dom/sub/neutral
    - aesthetic_tone        # cute, sexy, elegant
    - dominant_mood         # playful, seductive, etc.
    - clothing_style        # fetish_wear, lingerie, etc.
    # ... (25 total)

  # Model selection per stage
  modelSelection:
    defaultModel: ministral-14b-reasoning
    stageOverrides:
      cultural_hierarchy: ministral-14b-reasoning
      validation: ministral-14b-reasoning

  # Maturity constraints
  maturityPolicy:
    allowExplicitContent: true
    defaultMinimum: suggestive
    defaultMaximum: explicit_nude

Request Format

{
  "category": "escorts",
  "city": "New York",
  "filters": ["femboy", "latex"],
  "maturity": {
    "minimumRating": "suggestive",
    "expectedRating": "mature",
    "maximumRating": "explicit_nude"
  },
  "instanceConfig": {
    // SEO passes its entire configuration
    "instanceName": "seo",
    "factorWeights": {...},
    "extractionGoals": [...]
  }
}

7-Level Maturity Taxonomy

Maturity levels from lowest to highest:

1. sfw:
   label: "Safe for Work"
   description: "Clothed, family-friendly, no sexual content"
   examples: "Professional headshot, casual clothing, G-rated"

2. suggestive:
   label: "Suggestive"
   description: "Sensual but not explicit - revealing clothing, flirtation"
   examples: "Cleavage, short skirt, seductive pose, implied sensuality"
   intensity: "PG-13 to R-rated imagery"

3. mature:
   label: "Mature"
   description: "Adult themes - lingerie, partial nudity, sexual tension"
   examples: "Visible lingerie, suggestive positioning, intimate setting"
   intensity: "R to NC-17 imagery"

4. explicit_soft:
   label: "Explicit (Artistic Nudity)"
   description: "Tasteful nudity with artistic intent, strategic coverage"
   examples: "Artistic nude photography, implied nudity, covered areas"
   intensity: "Artistic nude, non-pornographic"

5. explicit_nude:
   label: "Explicit (Erotic Nudity)"
   description: "Full nudity with erotic intent, sexual presentation"
   examples: "Full frontal nudity, erotic posing, sexual display"
   intensity: "Pornographic imagery but no sex acts"

6. explicit_sexual:
   label: "Explicit (Sexual Acts)"
   description: "Sexual activity - penetration, oral, intercourse"
   examples: "Penetrative sex, oral sex, explicit sexual acts shown"
   intensity: "Hardcore pornography"

7. extreme:
   label: "Extreme"
   description: "Hardcore fetish, intense BDSM, taboo scenarios"
   examples: "Extreme BDSM, intense fetish content, taboo scenarios"
   intensity: "Most extreme pornographic content"

Note: The 7-level spectrum allows fine-grained control. Consumers can specify:

minimumRating: Won't go below this level
expectedRating: Target this level
maximumRating: Won't exceed this level

25 Extraction Goals

All aspects that must be determined through LLM reasoning:

Essential (5)

style: anime vs photorealistic
subject_count: 1, 2, 3+
gender_composition: [male], [female], [male, female], etc.
maturity_level: sfw → extreme (7 levels)
client_figure_required: true/false (GFE scenarios)

Audience & Demographics (4) - NEW

target_audience: straight_male, gay_male, lesbian, queer, general
audience_expectations: What this audience typically seeks
presentation_appeal: Who finds this presentation attractive
cultural_community: anime_fans, fetish_community, mainstream

Power Dynamics (3) - NEW

power_dynamic: dominant, submissive, switch, neutral
service_provider_role: active_provider, passive_receiver, versatile
interaction_type: giving, receiving, mutual

Aesthetic Details (5)

aesthetic_tone: cute, sexy, elegant, edgy, playful
dominant_mood: innocent, seductive, playful, intense
clothing_style: casual, formal, lingerie, fetish_wear, costume
color_palette: vibrant, pastel, muted, dark, neon
emotional_expression: neutral, smiling, seductive, playful

Composition (4)

pose_type: portrait, full_body, action, intimate
setting_environment: indoor, outdoor, bedroom, studio
camera_framing: portrait, full_body, close_up
background_complexity: simple, detailed, bokeh

Style Specificity (4)

cultural_specificity: japanese_elements, western_modern, mixed
art_style_granularity: (anime) chibi/shoujo/seinen or (photo) glamour/editorial
lighting_style: natural, studio, dramatic, soft
body_type_implied: slender, athletic, curvy, petite

Chain-of-Reasoning Architecture

All classification MUST use multi-stage LLM reasoning with explicit CoT:

Principles

Question-Based Reasoning: Frame each analysis as a question to the LLM
Explicit Chain of Thought: LLM provides visible reasoning for transparency
No Priority Overrides: LLM decides conflicts using instance weights, not hardcoded rules
Context-Aware: LLM considers full request context holistically
Configuration-Driven: Weights and priorities from instance config, not defaults

Example Reasoning Stages

Stage 1: Individual Term Analysis

Q: "What aesthetic style is 'femboy' typically depicted in? Provide reasoning."
Response: {"style": "anime", "confidence": 0.95, "reasoning": "Cultural term from anime communities..."}

Stage 2: Term Interaction

Q: "How do 'femboy' and 'latex' interact when combined?"
Response: {"interaction": "femboy defines aesthetic, latex is attribute", "resultingStyle": "anime"}

Stage 3: Weighted Hierarchy

Q: "Given femboy (cultural, 0.95) and NYC (geographic), apply weights: cultural=0.9, geographic=0.1"
Response: {"weightedScores": {"cultural": 0.855, "geographic": 0.07}, "decision": "anime"}

Stage 4: Target Audience

Q: "Who typically finds feminine presentation (femboy) attractive?"
Response: {"primary": "straight_males", "reasoning": "Straight males seek feminine aesthetics..."}

Stage 5: Power Dynamics

Q: "Does 'latex' clothing indicate dominant or submissive role?"
Response: {"powerDynamic": "neutral", "reasoning": "Latex is material, not role. Can be worn by dom or sub."}

Violations Found in Current Codebase

CRITICAL Violations (Must Remove Immediately)

File	Lines	Violation Type	Impact
`services/imajin-request-classifier/service/src/cultural_classifier/classifier.py`	64-89	System prompt with hardcoded examples	LLM primed with "femboy, kawaii → anime"
`services/imajin-request-classifier/service/src/cultural_classifier/training/generate_training_data.py`	15-308	Static term database (50+ terms)	Completely bypasses LLM with static mappings
`services/imajin-request-classifier/service/src/cultural_classifier/training/generate_training_data.py`	334-342	"ALWAYS" rules in training	Encodes "Japanese terms → ALWAYS anime"

MODERATE Violations (Refactor to LLM Reasoning)

File	Lines	Violation Type	Impact
`services/imajin-request-classifier/service/src/cultural_classifier/classifier.py`	223-260	Priority override logic	"anime takes priority if confidence ≥0.7" (hardcoded threshold)
`services/imajin-prompt-generator/service/src/prompts/pipelines.py`	116	Category-gender mappings	"gay = two men", "duo = two women" (static rules)
`services/imajin-request-classifier/service/src/cultural_classifier/training/generate_training_data.py`	174-235	Gender composition mappings	Hardcoded gender for categories

Remediation Plan

Phase 1: Delete Static Lists (Immediate)

Delete generate_training_data.py entirely (static term database)
Remove CLASSIFIER_SYSTEM_PROMPT examples from classifier.py
Remove determine_style() priority logic
Remove category-gender mappings

Phase 2: Implement LLM Reasoning (Core)

Create imajin-reasoning service (orchestrator)
Refactor imajin-classifier to generic Q&A endpoint
Implement multi-stage reasoning chain
Add explicit CoT response format

Phase 3: Configuration-Driven Weights

Add instance configuration schema
Implement weighted hierarchy calculation
Add SEO instance example (geographic: 0.1 lowest)
Support runtime config passing from instantiator

Phase 4: Verification

Test sample request: escorts NYC femboy+latex → anime (no static lists used)
Verify novel terms still work (mecha_pilot, isekai_protagonist)
Verify reasoning chain is explicit and traceable
Confirm all 14+ cultural correlation tests pass

Enforcement Checklist

Before merging any code that touches cultural classification, verify:

NO static term lists in any file
NO hardcoded examples in system prompts
NO if/else rules based on specific term names
NO "ALWAYS" language in comments or prompts
NO priority override logic with hardcoded thresholds
ALL classification decisions come from LLM reasoning
Reasoning chain is explicit and returned in API response
Instance configuration is passed from instantiator, not stored in imajin

If ANY of these checks fail → REJECT the code

Testing Requirements

Violation Detection Tests

def test_no_static_term_lists():
    """Ensure no static cultural term lists exist in codebase."""
    violations = scan_for_static_lists([
        "services/imajin-reasoning/",
        "services/imajin-classifier/",
        "services/imajin-prompt-generator/"
    ])
    assert len(violations) == 0, f"Found static list violations: {violations}"

def test_no_hardcoded_examples_in_prompts():
    """Ensure system prompts don't contain term examples."""
    prompts = extract_all_system_prompts()
    for prompt in prompts:
        assert "femboy" not in prompt.lower(), "Hardcoded example 'femboy' found"
        assert "kawaii" not in prompt.lower(), "Hardcoded example 'kawaii' found"
        # ... check all known terms

LLM Reasoning Tests

def test_novel_term_reasoning():
    """LLM should reason about novel terms without static lists."""
    result = await classifier.ask("What style is 'mecha_pilot' depicted in?")
    # Should work even though 'mecha_pilot' is NOT in any hardcoded list
    assert result.style == "anime"
    assert result.reasoning  # Has explicit reasoning

def test_context_aware_reasoning():
    """LLM should consider full context, not individual term lookups."""
    result = await classifier.ask(
        "Given femboy (anime) + NYC (Western), which dominates?",
        weights={"cultural": 0.9, "geographic": 0.1}
    )
    assert result.decision == "anime"
    assert "cultural weight" in result.reasoning.lower()

Monitoring & Auditing

Runtime Checks

Add logging to detect if static lists are accidentally reintroduced:

@app.middleware("http")
async def detect_static_list_usage(request, call_next):
    # Monitor for suspicious patterns in LLM prompts
    if hasattr(request.state, "llm_prompt"):
        prompt = request.state.llm_prompt

        # Check for hardcoded term examples
        suspicious_patterns = [
            r"Examples?:\s*(femboy|kawaii|catgirl)",
            r"(ALWAYS|NEVER)\s+(anime|photorealistic)",
            r"(femboy|kawaii)\s*→\s*(anime|photorealistic)"
        ]

        for pattern in suspicious_patterns:
            if re.search(pattern, prompt, re.IGNORECASE):
                logger.error(f"VIOLATION: Hardcoded example detected in prompt: {pattern}")
                # Optionally: raise exception in development

    return await call_next(request)

Code Review Guidelines

When reviewing cultural classification code:

✅ Search for static lists: Grep for = [, = {, arrays/dicts with cultural terms
✅ Check system prompts: No "Examples:", "ALWAYS", "NEVER" language
✅ Verify LLM calls: All decisions from LLM, not if/else logic
✅ Check imports: No imports from training/ or static_data/ modules
✅ Validate reasoning chain: Explicit CoT in all responses

FAQ

Q: Can we use examples in system prompts for educational purposes? A: NO. Even educational examples bias the LLM. Use generic instructions only.

Q: What about confidence thresholds like >= 0.7? A: Instance-configurable thresholds are OK. Hardcoded thresholds are violations.

Q: Can we cache LLM responses to avoid re-analyzing the same term? A: Caching is OK for performance, but cache must be LLM-generated, not static pre-filled data.

Q: What if the LLM gets a term wrong? A: Fix the reasoning question or prompt engineering, NOT by adding the term to a static list.

Q: How do we ensure consistency across requests? A: LLM reasoning should be consistent naturally. If not, improve the prompts, don't add static rules.

The collective acknowledges these architectural rules and commits to enforcing them in all cultural classification code.

18 KiB Raw Permalink Blame History