The git repo (https://github.com/careerhackeralex/visualize) is the source of truth.
Every iteration installs the plugin from GitHub, invokes it through Claude Code, evaluates real output with the automated pipeline, fixes the skill, and pushes improvements back.
Loop fires (manual or cron)
└── Agent starts
│
├── Step 0: Install plugin from GitHub
├── Step 1: Check state (skip if SHIP/VIRAL)
├── Step 2: Generate — invoke skill via Claude Code CLI
├── Step 3: Evaluate — 3-layer pipeline + agent vision scoring
├── Step 4: Analyze — root cause, trace to SKILL.md
├── Step 5: Uninstall plugin
├── Step 6: Clone repo, fix SKILL.md + references
├── Step 7: Push fixes to git
├── Step 8: Re-install from git (gets updated version)
├── Step 9: Re-generate worst prompts to verify fixes
├── Step 10: Validate — pipeline re-eval + screenshots
├── Step 11: Commit final results & push
└── Step 12: Cleanup & report
# Add marketplace (if not already added)
claude plugin marketplace add careerhackeralex/visualize
# Install the plugin
claude plugin install visualize@careerhackeralexThis is exactly what a real user does. We test what they get.
# Clone repo separately just to read state
WORK_DIR=$(mktemp -d)
cd "$WORK_DIR"
git clone https://github.com/careerhackeralex/visualize.git- Read
eval/loop-state.json - If gate is SHIP or VIRAL → report status, cleanup, exit
- If
loopsSinceResearch >= 5AND score plateau → research phase
Create a temp output directory and invoke the skill through Claude Code:
mkdir -p "$WORK_DIR/outputs"
cd "$WORK_DIR/outputs"Create a random persona with:
- Role (e.g., "VP of Sales at a B2B SaaS company")
- Situation (e.g., "preparing quarterly board update")
- Skill level (e.g., "non-technical, wants something impressive fast")
Generate 3-5 prompts they'd realistically type, plus 2-3 fixed prompts from eval/stress-tests.md.
For each test prompt:
claude -p \
--dangerously-skip-permissions \
--allow-dangerously-skip-permissions \
--model sonnet \
"{test prompt}. Save as {filename}.html"No --plugin-dir. The skill is installed as a real plugin. Claude Code discovers it automatically, just like a real user.
Test prompt selection — always include:
- 1 slide deck / presentation
- 1 dashboard with data
- 1 infographic / visual storytelling
- 1 non-English (Korean, Japanese, etc.)
- 1-2 rotating types (cheatsheet, comparison, flowchart, poster, carousel, timeline, etc.)
Save outputs to eval/rounds/round-{N}/generated/.
Run the automated eval pipeline (Layers 1 & 2):
cd "$WORK_DIR/visualize/eval/pipeline"
node run.js --dir "$WORK_DIR/outputs" --round {N}This runs:
- Layer 1: Format detection from DOM signals
- Layer 2: 45 deterministic DOM checks (0-100 score)
- Screenshots: Dark/light desktop + dark mobile captured
Results saved to eval/rounds/round-{N}/scores.json.
Screenshots saved to eval/rounds/round-{N}/screenshots/.
Then you (the agent) perform Layer 3: look at the screenshots and score 7 visual/IA dimensions using the calibrated prompt in eval/pipeline/LOOP-PROMPT.md:
- First Impression (15%)
- Typography (15%)
- Color (10%)
- Layout (15%)
- Content (15%)
- Information Architecture (20%)
- Format Execution (10%)
Calibration anchors (use these to stay consistent):
| Score | Benchmark |
|---|---|
| 10 | Apple.com product page |
| 9 | Stripe dashboard, Linear UI |
| 8 | Well-designed SaaS landing page |
| 7 | Decent Bootstrap template |
| 6 | Generic HTML template |
| 5 | Unstyled HTML with some CSS |
Don't jump to fixes. Think first:
- Weakest dimensions across all outputs (Layer 2 failing checks + Layer 3 low scores)
- Worst outputs — which prompts produced bad results?
- Skill compliance — did Claude Code actually follow SKILL.md? If not, WHY?
- Root cause — trace every issue to a specific SKILL.md instruction (missing, unclear, wrong, buried, contradictory)
- Tag each fix:
SKILL/REF:skeleton/REF:design-system/REF:other
Write analysis to eval/rounds/round-{N}/analysis.md.
claude plugin uninstall visualize@careerhackeralexClean slate before we make fixes.
Work in the cloned repo from Step 1 ($WORK_DIR/visualize).
SKILL.md (most important):
- Add missing instructions with concrete examples
- Clarify vague instructions ("use 2rem" not "use appropriate sizing")
- Add anti-patterns ("DON'T: gradient text on headings")
- Simplify verbose instructions
- Make ignored instructions more prominent ("CRITICAL:" prefix)
- Keep under 5,000 words
References (priority order):
skeleton.md— the HTML template, affects every outputdesign-system.md— typography, colors, spacingtypes.md,menu.md,libraries.md,css-techniques.md
Simplify everything:
- Remove unused CSS rules
- Consolidate duplicates
- Replace JS with CSS where possible
- All top-level JS uses
var
cd "$WORK_DIR/visualize"
git add -A
git commit -m "eval: round {N} fixes — {summary of SKILL.md changes}"
git push origin main# Update marketplace to get latest
claude plugin marketplace update careerhackeralex
# Re-install (gets the fixed version)
claude plugin install visualize@careerhackeralexNow the installed plugin has our fixes.
Re-run the worst 2-3 prompts from Step 2:
cd "$WORK_DIR/outputs-fixed"
claude -p \
--dangerously-skip-permissions \
--allow-dangerously-skip-permissions \
--model sonnet \
"{same prompt}. Save as {filename}.html"Compare before/after. Fixes should produce noticeably better output.
Run the pipeline again on re-generated files:
cd "$WORK_DIR/visualize/eval/pipeline"
node run.js --dir "$WORK_DIR/outputs-fixed" --round {N}Verify:
- Layer 2 score improved
- Zero console errors
- Both dark and light themes work
- Hamburger menu works (toggle, download, print)
- If regression → revert, try different approach
Save post-fix screenshots to eval/rounds/round-{N}/screenshots/.
Copy good re-generated outputs to examples/ if they're better than existing ones:
cd "$WORK_DIR/visualize"
cp "$WORK_DIR/outputs-fixed/"*.html examples/
git add -A
git commit -m "eval: round {N} — avg {score} ({delta}), gate {GATE}
Key SKILL.md changes:
- {change 1}
- {change 2}
Generated via: claude -p (installed plugin, sonnet)
Weakest: {dim} ({score}), Strongest: {dim} ({score})"
git push origin mainUpdate eval/loop-state.json in this commit.
# Uninstall plugin
claude plugin uninstall visualize@careerhackeralex
claude plugin marketplace remove careerhackeralex
# Delete temp directory
rm -rf "$WORK_DIR"Output summary:
- Round number, persona, prompts
- Per-file scores (Layer 2 + Layer 3)
- Average score and delta from last round
- Top SKILL.md changes made
- Before/after for worst file
- Gate status
- Git commit hash
Triggered when: loopsSinceResearch >= 5 AND score plateau (< 0.3 improvement over 3 rounds).
- Web search latest CSS/visualization techniques
- Study award-winning designs (Stripe, Linear, Apple)
- Analyze Gamma/Canva — what makes their output look professional?
- Document in
eval/rounds/round-{N}/research.md - Update SKILL.md + references
- Reset
loopsSinceResearch = 0 - 30 minutes max, no rabbit holes
| Gate | Avg | Min | Meaning |
|---|---|---|---|
| VIRAL | ≥ 9.5 | all ≥ 9 | Screenshot-worthy |
| SHIP | ≥ 9.0 | all ≥ 8 | Ready for public |
| ACCEPTABLE | ≥ 8.0 | all ≥ 7 | Functional, decent |
| NEEDS WORK | ≥ 7.0 | or any < 7 | Keep iterating |
| FAIL | < 7.0 | or any < 5 | Significant issues |
- Git repo always in working state
- SKILL.md under 5,000 words
- All top-level JS uses
var - Class-based theming only (no @media prefers-color-scheme)
- Content visible by default
- Every fix goes into SKILL.md or references, not just individual files
- Plugin structure valid (
.claude-plugin/plugin.json+skills/visualize/) - All evaluated outputs generated by invoking the installed skill, never hand-crafted
- Plugin installed from GitHub, not local path