Replatform pipeline architecture¶
Status: canonical reference for the v3 pipeline as of 2026-05-05. Goal of the system: rebuild ~1800 Wix small-business sites onto Astro + Cloudflare Pages with high visual fidelity, deterministically, at near-zero per-site LLM cost.
This document describes how the pipeline is organised today, why each piece exists, and the principles we landed on after a hard session of trying both an LLM-driven convergence loop and a deterministic verifier perimeter. Read retro-2026-05-05.md for the operational narrative; this file is the structural reference.
1. Pipeline at a glance¶
LIVE WIX SITE ──► PHASE 1 EXTRACT (Playwright + cheerio rewrites)
│
▼
PHASE 2 CAPTURE (Playwright layout snapshots)
│
▼
PHASE 3 THEME (palette + font extraction)
│
▼
PHASE 4 ASSEMBLE (matchers → translators → emit)
│
▼
PHASE 5 BUILD (npm install + astro build)
│
▼
PHASE 6 DEPLOY (wrangler pages deploy)
│
▼
PHASE 7 QA (Gemini critic, optional)
Phases 1-3 produce the inputs for the assembler. Phase 4 is where most logic lives. Phases 5-6 are vendor commands. Phase 7 is the final verification step.
The crucial change over what was here at the start of this session: each phase now has a deterministic verifier gate sitting between it and the next phase. The LLM only sees what the deterministic gates can't catch.
2. Phase responsibilities and code map¶
Phase 1 — DOM extraction (lib/dom-pipeline.js)¶
Walks every page of the live Wix site via Playwright. For each page emits a clean <slug>.body.html and <slug>.head.html under builds/<domain>/src/data/. Strips Wix runtime, rewrites Wix CDN URLs to local /assets/images/<hash> paths, downloads referenced media into builds/<domain>/public/assets/.
CLI flags worth knowing:
- --no-build — skip the npm install + astro build at the end (Phase 5 happens here too if you want)
- --max-pages N — for batch discovery runs that only need the homepage
Important footgun fixed this session: the cleanup regex that strips leftover /v1/fill/... Wix transformation suffixes used [^"\s<>]* for its consumption. On split-layout sections that meant it ate the closing ") and ; of CSS url(...) rules, leaving malformed CSS no parser could recover. The exclusion class now includes ) and ;. Affected every site with split-layout bgMedia images.
Phase 2 — Layout capture (lib/capture-layout.js)¶
Runs Playwright against the live URL and captures bounding-rect + computed-color information into layout.json. Used by matchers that need x/y/w/h (e.g. About's media-side detection, Hero's slideshow enumeration) and by capture of Wix Pro Gallery items / nav hover colours / slideshow images.
Output paths are still hardcoded to /home/admin/replatform/builds/<domain>/ — see known-issues.md for the build-dir-mismatch workaround the runner uses.
Phase 3 — Theme extraction (lib/theme-extractor.js, lib/theme-generator.js)¶
Walks the live page once for palette + fonts. Writes theme.json. Theme generator emits scoped theme.css (with !important rules to beat Astro-scoped component styles).
Known limitation: only extracts the dominant palette. Brand accents that appear only in CTAs (e.g. garvanbay's yellow, FCR-specific brand-tertiary colours) are not currently captured. This is a real signal gap surfaced by the visual critic.
Phase 4 — Assemble (lib/assembler-fulldev.js + helpers)¶
This is the heart of the pipeline. Three sub-stages:
SECTION ENUMERATION builds the list of sections to process
↓
MATCHERS (lib/matchers/*) return { name, score, props } per section
↓
DEDUPE collapse outer+inner ancestor pairs of same matcher
↓
VARIANT-PICKER + TRANSLATORS route matched section → fulldev component + props
↓
EMIT write content.ts + index.astro + scaffold + global.css
Section enumeration is in lib/assembler-fulldev.js directly. It walks:
- the site <header> (one tag, tag: 'header')
- top-level [id^="pinned"] overlay layers (tag: 'pinned')
- top-level <section> elements with content (tag: 'section')
- the site <footer> (one tag, tag: 'footer')
Position is computed only over <section> elements so adding pinned layers doesn't displace Hero's position-0 expectation.
Phase 5/6 — Build + deploy¶
npx astro build then npx wrangler pages deploy dist --project-name <slug>-fulldev. Cloudflare Pages keeps every deployment; the master alias points to the latest. Each per-site deploy URL is captured in the converge / batch run logs.
Phase 7 — Final QA¶
lib/visual-critic.js runs Gemini against full-page screenshots and emits a structured-JSON report (per-region scores + atomic-decomposed issue list). Today this is the optional final pass — the goal in the new architecture is for the deterministic gates to catch most issues, leaving Gemini to see only cross-region / design-detail issues.
3. Matchers (lib/matchers/*.js)¶
Each matcher is a self-contained module with this shape:
module.exports = {
name: 'About',
component: 'About.astro',
priority: 50, // higher wins ties
match(ctx) { /* returns 0..1 */ },
extract(ctx) { /* returns props */ },
};
Current matchers + responsibilities:
| Matcher | Priority | Fires on | Emits |
|---|---|---|---|
| TopBar | 900 | pinned layer with phone/email/social | phone, email, location, cta, social |
| Hero | 100 | first section with bg-media or text-only hero | heading, subtext, cta, phone, backgroundImage/Video, slideshowImages |
| Header | 100 | <header> tag |
logo, nav, cta |
| Footer | 100 | <footer> tag |
columns, social, companyName, logo |
| Contact | 90 | section with <textarea> or email-typed input |
heading, fields, phone, email |
| FAQ | 85 | section with accordion or "frequently asked" heading | heading, items[] |
| ServiceGrid | 70 | section with 2+ images and 2+ internal hrefs | services[] (title, href, image, description) |
| InfoCards | 65 | similar to ServiceGrid but icons-only | cards[] |
| Testimonials | 60 | section with review structure | reviews[] |
| About | 50 | fallback for heading + paragraph + image sections | heading, text, image, images[], video, cta[] |
| ImageStrip | 55 | heading-less section with 2+ decorative images, low text | images[], heading? |
| ... |
Matchers must respect a minimal signal contract — see lib/verify-matcher.js for the assertions. Any extracted prop that uses a non-standard shape (scalar instead of array, label instead of text) shows up as a translator gap.
4. Variant-picker + translators (lib/assembler-fulldev/)¶
Files:
- variant-picker.js — small dispatcher: matcher name → translator function.
- translate.js — one translator per matcher. Maps matcher props → fulldev component props + slot HTML.
- dedupe.js — collapses ancestor/descendant pairs that match the same matcher.
- theme.js — converts theme.json palette to oklch CSS vars + font imports.
- scaffold.js — emits package.json, astro.config.mjs, tsconfig.json, components.json.
- emit.js — writes content.ts, index.astro, global.css. Copies Phase-1 public/assets/ into assembled-fulldev/public/.
The translator can return either a single { component, props, slot } block or an array of blocks. When About has 3+ images, translateAbout returns [content-1 (text), gallery-wcp (3 images)] — splits one source section into two rendered blocks. pickAll flattens arrays into the layoutMap.
A subtle but real contract: the translator's job is to coerce drift between a matcher's emitted shape and the component's prop interface. When the bespoke Header.astro reads nav[].label and the assembler's header-wcp reads menus[].text, the translator does the rename. Verify-translator catches the case where the translator silently drops a signal (e.g. cta-wcp had no image prop, so props.backgroundImage from CTAStrip never made it through).
5. Components (lib/components-v3/)¶
lib/components-v3/
├── blocks/ 35 files — fulldev/ui blocks + 4 custom *-wcp variants
│ ├── header-1.astro (fulldev, inline logo-left/nav-right)
│ ├── header-wcp.astro (custom, stacked-centered logo+nav-below)
│ ├── hero-{1..7}.astro (fulldev hero variants)
│ ├── content-{1..4}.astro (fulldev content variants)
│ ├── features-{1..5}.astro
│ ├── faqs-1.astro
│ ├── footer-1.astro
│ ├── reviews-1.astro
│ ├── contact-1.astro
│ ├── cta-wcp.astro (custom — fulldev's cta-1 forces a testimonial)
│ ├── gallery-wcp.astro (custom — image grid + lightbox)
│ ├── topbar-wcp.astro (custom — fulldev's banner-1 too narrow)
│ └── ...
├── ui/ 30 files — fulldev primitives (Button, Card, NavigationMenu, ...)
└── lib/utils.ts cn() helper
Custom *-wcp.astro variants (4 total) cover patterns fulldev doesn't ship: contact-strip topbar, image grid with lightbox, plain CTA without testimonial, stacked-centered header. Each has a clear "reuse target" comment naming what % of FCR portfolio it covers — see ADR-0003 for the breakdown.
Components consume props with no awareness of which matcher emitted them. This is the seam the translator must respect.
6. The verifier perimeter (lib/verify-*.js)¶
The session's primary architectural learning. Each verifier is a deterministic gate between two pipeline phases. Failure routes to a specific layer's owner.
Phase-1 extracted body.html
↓
┌────────────────────────┐
│ lib/verify-matcher.js │ (matchers gap report)
└────────────────────────┘
↓
Phase-4 matched layoutMap (post-dedupe)
↓
┌──────────────────────────┐
│ lib/verify-translator.js │ (translator gap report)
└──────────────────────────┘
↓
Phase-4 emitted content.ts + index.astro
↓
Phase-5/6 build + deploy
↓
┌──────────────────────────┐
│ lib/verify-content.js │ (rendered DOM vs live, structural)
└──────────────────────────┘
↓
deployed pages.dev URL
↓
┌──────────────────────────────────┐
│ lib/verify-design.js │ (Gemini, one region per call)
└──────────────────────────────────┘
↓
┌──────────────────────────────────┐
│ TODO lib/verify-interactions.js │ (Playwright assertions)
└──────────────────────────────────┘
↓
┌────────────────────────┐
│ lib/visual-critic.js │ (Gemini, full-page QA)
└────────────────────────┘
What each verifier checks¶
| Verifier | Failure means | Routes to |
|---|---|---|
| verify-matcher | DOM signal not captured by the matching matcher | lib/matchers/<name>.js |
| verify-translator | matcher prop didn't survive translation OR variant doesn't accept it | lib/assembler-fulldev/translate.js or lib/components-v3/blocks/*.astro |
| verify-content | rendered HTML doesn't match live's section count / element types | usually a component that doesn't render a prop, sometimes a missing matcher |
| verify-design | colour / typography / spacing / hierarchy drift in one region | single owner field on each issue: lib/components-v3/blocks/<block>.astro or lib/assembler-fulldev/theme.js |
| verify-interactions (TODO) | dropdown/lightbox/scroll-reveal broken | component JS |
| visual-critic (final QA) | cross-region / hierarchy / flow issues | escalate, may require new matcher or block variant |
Why the perimeter matters more than the convergence loop¶
Each gate's failure has one owner. When verify-matcher reports an image-extraction-gap, you know the bug is in the matcher's extract() function, not in CSS, theme, or rendering. When verify-translator reports a prop-flow-loss, you know it's the translator or component contract.
This contrasts with the LLM convergence loop, which had Gemini observe a symptom (e.g. "service section image missing") and dispatch the agent to fix something somewhere — frequently the wrong layer (CSS), causing regressions while the real bug (a regex) was untouched.
7. The LLM agent loop (built but secondary now)¶
Architecture committed in lib/visual-critic.js, lib/visual-fix-agent.js, lib/converge.js. End-to-end works; the convergence runner closes the loop critic → fix-agent → assemble → build → deploy → critic. The retro documents why this is not the primary intervention path.
When the LLM loop is the right tool: - Final QA (visual-critic, scoped or full-page) — finds drift the deterministic gates cannot encode. - Catching brand colour mismatches the theme extractor missed. - Cross-region hierarchy / flow issues.
When the LLM loop is the wrong tool: - Anywhere a deterministic check can express the rule. "About should emit an image when the section has a bgMedia child with a background-image url" is a deterministic check, not a Gemini critique. - Site-by-site CSS patches that don't generalise. Fix the matcher / library component once instead.
8. The runner (scripts/run-fulldev-batch.sh)¶
The orchestration shell for batch operations. Reads a host list, runs:
quick-crawl → dom-pipeline (--no-build --max-pages 1) → capture-layout → theme-extractor → assembler-fulldev (--slug index)
per host, captures stdout/stderr/status.json, is resumable (skips hosts whose status records ok=true). Caches Phase-1 outputs aggressively — re-running with code changes only re-runs the assembler (and downstream).
Cache flag: pass --no-cache to force a full Playwright re-extract.
Designed to run on EC2 (cathals-demo) where the live-site fetches happen.
9. Operating principles (the rules we landed on)¶
-
Build a verify gate before touching the LLM. If the rule can be expressed deterministically — DOM count, signal completeness, prop flow — encode it.
-
Each gate's failure must route to one owner. "Fix something somewhere" is the failure mode the LLM loop kept hitting. Single-owner gates make blast radius small.
-
Prefer fixing the matcher / library component before chasing CSS. Foundation fixes pay forward across the portfolio. Per-site CSS patches do not. The bgMedia-regex fix from this session likely helped 30-50% of FCR sites.
-
Pre-decompose multi-file work into atomic single-file changes. When a fix legitimately needs a matcher signal AND a translator route AND a component prop, emit three issues (with shared
parent_idso they're traceable to the same observation), not one. -
Don't trust the LLM's "looks fine" signal. Trust the deterministic gates' green/red. The convergence run hit "0 open issues" while the summary said "missing images, buttons, entire layout structures."
-
The matcher's job is signal capture; the translator's job is shape coercion; the component's job is rendering. When in doubt about where to put logic, pick the layer that makes the contract obvious to the next developer.
-
Treat the gate output as the source of truth, not the deploy. A clean
verify-matcher.js+verify-translator.jsis a stronger signal than a deploy that "looks OK" — the deploy can hide regressions the verifiers catch.
10. File-by-file reference¶
lib/
├── dom-pipeline.js Phase 1 — DOM extraction + image self-host
├── capture-layout.js Phase 2 — Playwright layout / palette snapshot
├── theme-extractor.js Phase 3 — palette + fonts → theme.json
├── theme-generator.js Phase 3 — theme.json → scoped theme.css
├── assembler-v3.js v3 (bespoke) assembler — for components/ output
├── assembler-fulldev.js Phase 4 — fulldev assembler entry point
├── assembler-fulldev/
│ ├── variant-picker.js Dispatch matcher.name → translator
│ ├── translate.js Per-matcher translators (Header, Hero, About, ...)
│ ├── dedupe.js Ancestor + signature dedupe
│ ├── scaffold.js Astro project files
│ ├── theme.js Palette + font CSS var emission
│ └── emit.js content.ts + index.astro + assets copy
│
├── matchers/ Per-section recogniser modules
│ ├── _helpers.js cleanText, isPhoneNumber, ...
│ ├── About.js Generic heading + paragraph + image fallback
│ ├── Hero.js First-section with bg-media or text-only hero
│ ├── Header.js <header> nav extraction
│ ├── Footer.js <footer> column extraction
│ ├── TopBar.js pinned overlay phone/email/social strip
│ ├── ServiceGrid.js Multi-image / multi-href service tiles
│ ├── InfoCards.js Icon-only service tiles
│ ├── FAQ.js Accordion / Q&A
│ ├── Testimonials.js Review cards
│ ├── CTAStrip.js Standalone heading + button banner
│ ├── Gallery.js Image grid (Pro Gallery)
│ ├── Contact.js Form-bearing contact section
│ ├── USPBar.js Pinned strip with bullet trust signals
│ ├── BookingWidget.js Wix booking embed
│ ├── LocationMap.js Map-bearing "find us" section
│ ├── VideoGallery.js Video carousel
│ └── FloatingSocial.js Pinned social-icon overlay
│
├── components-v3/
│ ├── blocks/ 35 fulldev blocks + 4 custom *-wcp
│ ├── ui/ 30 fulldev primitives
│ └── lib/utils.ts cn() helper
│
├── verify-matcher.js Gate 1 — matcher signal completeness
├── verify-translator.js Gate 2 — translator prop-flow + variant fit
├── verify-content.js Gate 3 — rendered DOM vs live structural diff
├── verify-design.js Gate 4 — per-region Gemini design diff
│
├── visual-critic.js LLM full-page diff with structured JSON
├── visual-fix-agent.js Per-issue Claude tool-use loop
├── converge.js LLM-loop orchestrator (deterministic gates supersede)
├── placement-check.js Predecessor of visual-critic (placement-only)
└── structure-matcher.js loadMatchers + matchAllSections — the matcher dispatcher
scripts/
├── run-fulldev-batch.sh Phase 1-4 orchestrator for batch + single-site runs
└── pull-from-ec2.sh Pulls Phase 1-3 artifacts from EC2 for local work
docs/pipeline/
├── README.md Original pipeline overview
├── known-issues.md Standing bug list, fixed items deleted not archived
├── retro-2026-05-05.md The architecture-pivot retro
├── architecture.md This file
├── phase3-batch-2026-05-04.md Sample-run artefact
└── decisions/ ADRs
├── 0001-component-library.md fulldev as foundation
├── 0002-primitive-neutrality.md primitive aesthetic fingerprint
└── 0003-assembler-fulldev.md assembler-fulldev design
11. Open architectural items (in priority order for next session)¶
- ~~Reduce the 35 unmatched-section count across the 20-site sample.~~ DONE 2026-05-05 — verifier on EC2 confirms
unmatched=0across all 20 sites; total matcher gaps 67 → 51 (residual is image/cta extraction-completeness, the next layer of work). Contact.jsaccepts a heading-led panel with tel/mailto OR a/contact*page link (score 0.7).CTAStrip.jsallows ONE content image (single-card service tile pattern), withalt="bgImage"treated as background; nested-bgMedia threshold raised to 2 (single-tile photos as CSS bg pass through, true split-layout still routes elsewhere).About.jsadds a last-resort fallback (0.15) for short text-only paragraph blocks (taglines, bios) that have no image / button / heading.ImageStrip.js(priority 55) catches 2+ image decorative rows AND 1-image full-bleed banners (including section-bg-only payloads).- New
FloatingCTA.js(priority 35) matches single-anchor pinned overlays; the translator no-ops scroll-to-top FABs and emits a deferredfloating-cta-wcpblock for real CTAs (ENQUIRE / Call us). -
Translator updates:
Contactsurfacesprops.phone/props.emailin thecontact-1slot;ImageStriproutes togallery-wcpvia the Gallery translator. -
~~Build verify-design.js~~ Built 2026-05-05.
lib/verify-design.jstakes a region label + per-side CSS selectors, screenshots that one region from LIVE and OURS, and asks Gemini for a per-region score (0..10) and atomic issues categorised ascolour | typography | spacing | hierarchy. Each issue carries a singleownerrepo path (block component ortheme.js) so the fix-agent's blast radius stays small. Exits 1 if score < 8 or any issues found. -
Build verify-interactions.js — Playwright assertions: hover dropdowns open, click images open lightbox, scroll-reveal triggers, video plays. Pass/fail per interaction.
-
Section splitter so nested widget containers (Wix Blog feed inside FAQ section, USP strip inside hero section) become enumerated sections in their own right and pick up matchers other than the parent's.
-
Theme accent extraction — capture brand-secondary / brand-yellow colours used on CTAs but not in section backgrounds. Gemini critic flagged this on garvanbay; theme-extractor doesn't.
-
Build-dir mismatch resolution —
dom-pipeline.jswrites to~/replatform-dashboard/builds/butcapture-layout.jsandtheme-extractor.jswrite to~/replatform/builds/. The runner cp's between them. Pick one canonicalBUILD_DIRand have every phase honour it.
12. What "done" looks like¶
For a single site:
- verify-matcher and verify-translator both report 0 gaps.
- verify-content reports only counting-artifact gaps (logical-image dedupe, no-heading sections).
- verify-design (when built) gives 8+/10 per region.
- verify-interactions (when built) all-pass.
- Final visual-critic full-page pass surfaces only minor cross-region issues.
For the portfolio: - Same threshold across the 20-site sample (or a chosen statistical bar — e.g. 80% of sites at 0 matcher gaps, 90% at < 3 translator gaps). - The batch runner can rebuild any site in ~30 seconds (cache hits) end-to-end.
We are at "garvanbay clean at 0/0" today; the same fixes likely move multiple other sites toward that threshold without further work, but the verifiers will tell us exactly which.