Visualize Improvement Loop (v6 — Install & Pipeline)

The git repo (https://github.com/careerhackeralex/visualize) is the source of truth. Every iteration installs the plugin from GitHub, invokes it through Claude Code, evaluates real output with the automated pipeline, fixes the skill, and pushes improvements back.

Architecture

Loop fires (manual or cron)
  └── Agent starts
        │
        ├── Step 0: Install plugin from GitHub
        ├── Step 1: Check state (skip if SHIP/VIRAL)
        ├── Step 2: Generate — invoke skill via Claude Code CLI
        ├── Step 3: Evaluate — 3-layer pipeline + agent vision scoring
        ├── Step 4: Analyze — root cause, trace to SKILL.md
        ├── Step 5: Uninstall plugin
        ├── Step 6: Clone repo, fix SKILL.md + references
        ├── Step 7: Push fixes to git
        ├── Step 8: Re-install from git (gets updated version)
        ├── Step 9: Re-generate worst prompts to verify fixes
        ├── Step 10: Validate — pipeline re-eval + screenshots
        ├── Step 11: Commit final results & push
        └── Step 12: Cleanup & report

Step 0: Install Plugin from GitHub

# Add marketplace (if not already added)
claude plugin marketplace add careerhackeralex/visualize

# Install the plugin
claude plugin install visualize@careerhackeralex

This is exactly what a real user does. We test what they get.

Step 1: Check State

# Clone repo separately just to read state
WORK_DIR=$(mktemp -d)
cd "$WORK_DIR"
git clone https://github.com/careerhackeralex/visualize.git

Read eval/loop-state.json
If gate is SHIP or VIRAL → report status, cleanup, exit
If loopsSinceResearch >= 5 AND score plateau → research phase

Step 2: Generate (The Real Test)

Create a temp output directory and invoke the skill through Claude Code:

mkdir -p "$WORK_DIR/outputs"
cd "$WORK_DIR/outputs"

Create a random persona with:

Role (e.g., "VP of Sales at a B2B SaaS company")
Situation (e.g., "preparing quarterly board update")
Skill level (e.g., "non-technical, wants something impressive fast")

Generate 3-5 prompts they'd realistically type, plus 2-3 fixed prompts from eval/stress-tests.md.

For each test prompt:

claude -p \
  --dangerously-skip-permissions \
  --allow-dangerously-skip-permissions \
  --model sonnet \
  "{test prompt}. Save as {filename}.html"

No --plugin-dir. The skill is installed as a real plugin. Claude Code discovers it automatically, just like a real user.

Test prompt selection — always include:

1 slide deck / presentation
1 dashboard with data
1 infographic / visual storytelling
1 non-English (Korean, Japanese, etc.)
1-2 rotating types (cheatsheet, comparison, flowchart, poster, carousel, timeline, etc.)

Save outputs to eval/rounds/round-{N}/generated/.

Step 3: Evaluate — 3-Layer Pipeline

Run the automated eval pipeline (Layers 1 & 2):

cd "$WORK_DIR/visualize/eval/pipeline"
node run.js --dir "$WORK_DIR/outputs" --round {N}

This runs:

Layer 1: Format detection from DOM signals
Layer 2: 45 deterministic DOM checks (0-100 score)
Screenshots: Dark/light desktop + dark mobile captured

Results saved to eval/rounds/round-{N}/scores.json. Screenshots saved to eval/rounds/round-{N}/screenshots/.

Then you (the agent) perform Layer 3: look at the screenshots and score 7 visual/IA dimensions using the calibrated prompt in eval/pipeline/LOOP-PROMPT.md:

First Impression (15%)
Typography (15%)
Color (10%)
Layout (15%)
Content (15%)
Information Architecture (20%)
Format Execution (10%)

Calibration anchors (use these to stay consistent):

Score	Benchmark
10	Apple.com product page
9	Stripe dashboard, Linear UI
8	Well-designed SaaS landing page
7	Decent Bootstrap template
6	Generic HTML template
5	Unstyled HTML with some CSS

Step 4: Analyze

Don't jump to fixes. Think first:

Weakest dimensions across all outputs (Layer 2 failing checks + Layer 3 low scores)
Worst outputs — which prompts produced bad results?
Skill compliance — did Claude Code actually follow SKILL.md? If not, WHY?
Root cause — trace every issue to a specific SKILL.md instruction (missing, unclear, wrong, buried, contradictory)
Tag each fix: SKILL / REF:skeleton / REF:design-system / REF:other

Write analysis to eval/rounds/round-{N}/analysis.md.

Step 5: Uninstall Plugin

claude plugin uninstall visualize@careerhackeralex

Clean slate before we make fixes.

Step 6: Fix SKILL.md + References

Work in the cloned repo from Step 1 ($WORK_DIR/visualize).

SKILL.md (most important):

Add missing instructions with concrete examples
Clarify vague instructions ("use 2rem" not "use appropriate sizing")
Add anti-patterns ("DON'T: gradient text on headings")
Simplify verbose instructions
Make ignored instructions more prominent ("CRITICAL:" prefix)
Keep under 5,000 words

References (priority order):

skeleton.md — the HTML template, affects every output
design-system.md — typography, colors, spacing
types.md, menu.md, libraries.md, css-techniques.md

Simplify everything:

Remove unused CSS rules
Consolidate duplicates
Replace JS with CSS where possible
All top-level JS uses var

Step 7: Push Fixes to Git

cd "$WORK_DIR/visualize"
git add -A
git commit -m "eval: round {N} fixes — {summary of SKILL.md changes}"
git push origin main

Step 8: Re-install from Git

# Update marketplace to get latest
claude plugin marketplace update careerhackeralex

# Re-install (gets the fixed version)
claude plugin install visualize@careerhackeralex

Now the installed plugin has our fixes.

Step 9: Re-generate (Verify Fixes)

Re-run the worst 2-3 prompts from Step 2:

cd "$WORK_DIR/outputs-fixed"
claude -p \
  --dangerously-skip-permissions \
  --allow-dangerously-skip-permissions \
  --model sonnet \
  "{same prompt}. Save as {filename}.html"

Compare before/after. Fixes should produce noticeably better output.

Step 10: Validate

Run the pipeline again on re-generated files:

cd "$WORK_DIR/visualize/eval/pipeline"
node run.js --dir "$WORK_DIR/outputs-fixed" --round {N}

Verify:

Layer 2 score improved
Zero console errors
Both dark and light themes work
Hamburger menu works (toggle, download, print)
If regression → revert, try different approach

Save post-fix screenshots to eval/rounds/round-{N}/screenshots/.

Step 11: Commit Final Results & Push

Copy good re-generated outputs to examples/ if they're better than existing ones:

cd "$WORK_DIR/visualize"
cp "$WORK_DIR/outputs-fixed/"*.html examples/
git add -A
git commit -m "eval: round {N} — avg {score} ({delta}), gate {GATE}

Key SKILL.md changes:
- {change 1}
- {change 2}

Generated via: claude -p (installed plugin, sonnet)
Weakest: {dim} ({score}), Strongest: {dim} ({score})"
git push origin main

Update eval/loop-state.json in this commit.

Step 12: Cleanup & Report

# Uninstall plugin
claude plugin uninstall visualize@careerhackeralex
claude plugin marketplace remove careerhackeralex

# Delete temp directory
rm -rf "$WORK_DIR"

Output summary:

Round number, persona, prompts
Per-file scores (Layer 2 + Layer 3)
Average score and delta from last round
Top SKILL.md changes made
Before/after for worst file
Gate status
Git commit hash

Research Phase

Triggered when: loopsSinceResearch >= 5 AND score plateau (< 0.3 improvement over 3 rounds).

Web search latest CSS/visualization techniques
Study award-winning designs (Stripe, Linear, Apple)
Analyze Gamma/Canva — what makes their output look professional?
Document in eval/rounds/round-{N}/research.md
Update SKILL.md + references
Reset loopsSinceResearch = 0
30 minutes max, no rabbit holes

Quality Gates

Gate	Avg	Min	Meaning
VIRAL	≥ 9.5	all ≥ 9	Screenshot-worthy
SHIP	≥ 9.0	all ≥ 8	Ready for public
ACCEPTABLE	≥ 8.0	all ≥ 7	Functional, decent
NEEDS WORK	≥ 7.0	or any < 7	Keep iterating
FAIL	< 7.0	or any < 5	Significant issues

Invariants (Never Break These)

Git repo always in working state
SKILL.md under 5,000 words
All top-level JS uses var
Class-based theming only (no @media prefers-color-scheme)
Content visible by default
Every fix goes into SKILL.md or references, not just individual files
Plugin structure valid (.claude-plugin/plugin.json + skills/visualize/)
All evaluated outputs generated by invoking the installed skill, never hand-crafted
Plugin installed from GitHub, not local path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visualize Improvement Loop (v6 — Install & Pipeline)

Architecture

Step 0: Install Plugin from GitHub

Step 1: Check State

Step 2: Generate (The Real Test)

Step 3: Evaluate — 3-Layer Pipeline

Step 4: Analyze

Step 5: Uninstall Plugin

Step 6: Fix SKILL.md + References

Step 7: Push Fixes to Git

Step 8: Re-install from Git

Step 9: Re-generate (Verify Fixes)

Step 10: Validate

Step 11: Commit Final Results & Push

Step 12: Cleanup & Report

Research Phase

Quality Gates

Invariants (Never Break These)

FilesExpand file tree

LOOP.md

Latest commit

History

LOOP.md

File metadata and controls

Visualize Improvement Loop (v6 — Install & Pipeline)

Architecture

Step 0: Install Plugin from GitHub

Step 1: Check State

Step 2: Generate (The Real Test)

Step 3: Evaluate — 3-Layer Pipeline

Step 4: Analyze

Step 5: Uninstall Plugin

Step 6: Fix SKILL.md + References

Step 7: Push Fixes to Git

Step 8: Re-install from Git

Step 9: Re-generate (Verify Fixes)

Step 10: Validate

Step 11: Commit Final Results & Push

Step 12: Cleanup & Report

Research Phase

Quality Gates

Invariants (Never Break These)