Replatform pipeline architecture¶

Status: canonical reference for the v3 pipeline as of 2026-05-05. Goal of the system: rebuild ~1800 Wix small-business sites onto Astro + Cloudflare Pages with high visual fidelity, deterministically, at near-zero per-site LLM cost.

This document describes how the pipeline is organised today, why each piece exists, and the principles we landed on after a hard session of trying both an LLM-driven convergence loop and a deterministic verifier perimeter. Read retro-2026-05-05.md for the operational narrative; this file is the structural reference.

1. Pipeline at a glance¶

LIVE WIX SITE  ──►  PHASE 1  EXTRACT       (Playwright + cheerio rewrites)
                           │
                           ▼
                    PHASE 2  CAPTURE       (Playwright layout snapshots)
                           │
                           ▼
                    PHASE 3  THEME         (palette + font extraction)
                           │
                           ▼
                    PHASE 4  ASSEMBLE      (matchers → translators → emit)
                           │
                           ▼
                    PHASE 5  BUILD         (npm install + astro build)
                           │
                           ▼
                    PHASE 6  DEPLOY        (wrangler pages deploy)
                           │
                           ▼
                    PHASE 7  QA            (Gemini critic, optional)

Phases 1-3 produce the inputs for the assembler. Phase 4 is where most logic lives. Phases 5-6 are vendor commands. Phase 7 is the final verification step.

The crucial change over what was here at the start of this session: each phase now has a deterministic verifier gate sitting between it and the next phase. The LLM only sees what the deterministic gates can't catch.

2. Phase responsibilities and code map¶

Phase 1 — DOM extraction (`lib/dom-pipeline.js`)¶

Walks every page of the live Wix site via Playwright. For each page emits a clean <slug>.body.html and <slug>.head.html under builds/<domain>/src/data/. Strips Wix runtime, rewrites Wix CDN URLs to local /assets/images/<hash> paths, downloads referenced media into builds/<domain>/public/assets/.

CLI flags worth knowing: - --no-build — skip the npm install + astro build at the end (Phase 5 happens here too if you want) - --max-pages N — for batch discovery runs that only need the homepage

Important footgun fixed this session: the cleanup regex that strips leftover /v1/fill/... Wix transformation suffixes used [^"\s<>]* for its consumption. On split-layout sections that meant it ate the closing ") and ; of CSS url(...) rules, leaving malformed CSS no parser could recover. The exclusion class now includes ) and ;. Affected every site with split-layout bgMedia images.

Phase 2 — Layout capture (`lib/capture-layout.js`)¶

Runs Playwright against the live URL and captures bounding-rect + computed-color information into layout.json. Used by matchers that need x/y/w/h (e.g. About's media-side detection, Hero's slideshow enumeration) and by capture of Wix Pro Gallery items / nav hover colours / slideshow images.

Output paths are still hardcoded to /home/admin/replatform/builds/<domain>/ — see known-issues.md for the build-dir-mismatch workaround the runner uses.

Phase 3 — Theme extraction (`lib/theme-extractor.js`, `lib/theme-generator.js`)¶

Walks the live page once for palette + fonts. Writes theme.json. Theme generator emits scoped theme.css (with !important rules to beat Astro-scoped component styles).

Known limitation: only extracts the dominant palette. Brand accents that appear only in CTAs (e.g. garvanbay's yellow, FCR-specific brand-tertiary colours) are not currently captured. This is a real signal gap surfaced by the visual critic.

Phase 4 — Assemble (`lib/assembler-fulldev.js` + helpers)¶

This is the heart of the pipeline. Three sub-stages:

SECTION ENUMERATION          builds the list of sections to process
        ↓
MATCHERS (lib/matchers/*)    return { name, score, props } per section
        ↓
DEDUPE                       collapse outer+inner ancestor pairs of same matcher
        ↓
VARIANT-PICKER + TRANSLATORS route matched section → fulldev component + props
        ↓
EMIT                         write content.ts + index.astro + scaffold + global.css

Section enumeration is in lib/assembler-fulldev.js directly. It walks: - the site <header> (one tag, tag: 'header') - top-level [id^="pinned"] overlay layers (tag: 'pinned') - top-level <section> elements with content (tag: 'section') - the site <footer> (one tag, tag: 'footer')

Position is computed only over <section> elements so adding pinned layers doesn't displace Hero's position-0 expectation.

Phase 5/6 — Build + deploy¶

npx astro build then npx wrangler pages deploy dist --project-name <slug>-fulldev. Cloudflare Pages keeps every deployment; the master alias points to the latest. Each per-site deploy URL is captured in the converge / batch run logs.

Phase 7 — Final QA¶

lib/visual-critic.js runs Gemini against full-page screenshots and emits a structured-JSON report (per-region scores + atomic-decomposed issue list). Today this is the optional final pass — the goal in the new architecture is for the deterministic gates to catch most issues, leaving Gemini to see only cross-region / design-detail issues.

3. Matchers (`lib/matchers/*.js`)¶

Each matcher is a self-contained module with this shape:

module.exports = {
  name: 'About',
  component: 'About.astro',
  priority: 50,                  // higher wins ties
  match(ctx) { /* returns 0..1 */ },
  extract(ctx) { /* returns props */ },
};

Current matchers + responsibilities:

Matcher	Priority	Fires on	Emits
TopBar	900	pinned layer with phone/email/social	phone, email, location, cta, social
Hero	100	first section with bg-media or text-only hero	heading, subtext, cta, phone, backgroundImage/Video, slideshowImages
Header	100	`<header>` tag	logo, nav, cta
Footer	100	`<footer>` tag	columns, social, companyName, logo
Contact	90	section with `<textarea>` or email-typed input	heading, fields, phone, email
FAQ	85	section with accordion or "frequently asked" heading	heading, items[]
ServiceGrid	70	section with 2+ images and 2+ internal hrefs	services[] (title, href, image, description)
InfoCards	65	similar to ServiceGrid but icons-only	cards[]
Testimonials	60	section with review structure	reviews[]
About	50	fallback for heading + paragraph + image sections	heading, text, image, images[], video, cta[]
ImageStrip	55	heading-less section with 2+ decorative images, low text	images[], heading?
...

Matchers must respect a minimal signal contract — see lib/verify-matcher.js for the assertions. Any extracted prop that uses a non-standard shape (scalar instead of array, label instead of text) shows up as a translator gap.

4. Variant-picker + translators (`lib/assembler-fulldev/`)¶

Files: - variant-picker.js — small dispatcher: matcher name → translator function. - translate.js — one translator per matcher. Maps matcher props → fulldev component props + slot HTML. - dedupe.js — collapses ancestor/descendant pairs that match the same matcher. - theme.js — converts theme.json palette to oklch CSS vars + font imports. - scaffold.js — emits package.json, astro.config.mjs, tsconfig.json, components.json. - emit.js — writes content.ts, index.astro, global.css. Copies Phase-1 public/assets/ into assembled-fulldev/public/.

The translator can return either a single { component, props, slot } block or an array of blocks. When About has 3+ images, translateAbout returns [content-1 (text), gallery-wcp (3 images)] — splits one source section into two rendered blocks. pickAll flattens arrays into the layoutMap.

A subtle but real contract: the translator's job is to coerce drift between a matcher's emitted shape and the component's prop interface. When the bespoke Header.astro reads nav[].label and the assembler's header-wcp reads menus[].text, the translator does the rename. Verify-translator catches the case where the translator silently drops a signal (e.g. cta-wcp had no image prop, so props.backgroundImage from CTAStrip never made it through).

5. Components (`lib/components-v3/`)¶

lib/components-v3/
├── blocks/                  35 files — fulldev/ui blocks + 4 custom *-wcp variants
│   ├── header-1.astro       (fulldev, inline logo-left/nav-right)
│   ├── header-wcp.astro     (custom, stacked-centered logo+nav-below)
│   ├── hero-{1..7}.astro    (fulldev hero variants)
│   ├── content-{1..4}.astro (fulldev content variants)
│   ├── features-{1..5}.astro
│   ├── faqs-1.astro
│   ├── footer-1.astro
│   ├── reviews-1.astro
│   ├── contact-1.astro
│   ├── cta-wcp.astro        (custom — fulldev's cta-1 forces a testimonial)
│   ├── gallery-wcp.astro    (custom — image grid + lightbox)
│   ├── topbar-wcp.astro     (custom — fulldev's banner-1 too narrow)
│   └── ...
├── ui/                      30 files — fulldev primitives (Button, Card, NavigationMenu, ...)
└── lib/utils.ts             cn() helper

Custom *-wcp.astro variants (4 total) cover patterns fulldev doesn't ship: contact-strip topbar, image grid with lightbox, plain CTA without testimonial, stacked-centered header. Each has a clear "reuse target" comment naming what % of FCR portfolio it covers — see ADR-0003 for the breakdown.

Components consume props with no awareness of which matcher emitted them. This is the seam the translator must respect.

6. The verifier perimeter (`lib/verify-*.js`)¶

The session's primary architectural learning. Each verifier is a deterministic gate between two pipeline phases. Failure routes to a specific layer's owner.

        Phase-1 extracted body.html
                ↓
            ┌────────────────────────┐
            │  lib/verify-matcher.js │   (matchers gap report)
            └────────────────────────┘
                ↓
        Phase-4 matched layoutMap (post-dedupe)
                ↓
            ┌──────────────────────────┐
            │  lib/verify-translator.js │  (translator gap report)
            └──────────────────────────┘
                ↓
        Phase-4 emitted content.ts + index.astro
                ↓
            Phase-5/6 build + deploy
                ↓
            ┌──────────────────────────┐
            │  lib/verify-content.js   │  (rendered DOM vs live, structural)
            └──────────────────────────┘
                ↓
        deployed pages.dev URL
                ↓
            ┌──────────────────────────────────┐
            │  lib/verify-design.js            │  (Gemini, one region per call)
            └──────────────────────────────────┘
                ↓
            ┌──────────────────────────────────┐
            │  TODO  lib/verify-interactions.js │ (Playwright assertions)
            └──────────────────────────────────┘
                ↓
            ┌────────────────────────┐
            │  lib/visual-critic.js  │  (Gemini, full-page QA)
            └────────────────────────┘

What each verifier checks¶

Verifier	Failure means	Routes to
verify-matcher	DOM signal not captured by the matching matcher	`lib/matchers/<name>.js`
verify-translator	matcher prop didn't survive translation OR variant doesn't accept it	`lib/assembler-fulldev/translate.js` or `lib/components-v3/blocks/*.astro`
verify-content	rendered HTML doesn't match live's section count / element types	usually a component that doesn't render a prop, sometimes a missing matcher
verify-design	colour / typography / spacing / hierarchy drift in one region	single `owner` field on each issue: `lib/components-v3/blocks/<block>.astro` or `lib/assembler-fulldev/theme.js`
verify-interactions (TODO)	dropdown/lightbox/scroll-reveal broken	component JS
visual-critic (final QA)	cross-region / hierarchy / flow issues	escalate, may require new matcher or block variant

Why the perimeter matters more than the convergence loop¶

Each gate's failure has one owner. When verify-matcher reports an image-extraction-gap, you know the bug is in the matcher's extract() function, not in CSS, theme, or rendering. When verify-translator reports a prop-flow-loss, you know it's the translator or component contract.

This contrasts with the LLM convergence loop, which had Gemini observe a symptom (e.g. "service section image missing") and dispatch the agent to fix something somewhere — frequently the wrong layer (CSS), causing regressions while the real bug (a regex) was untouched.

7. The LLM agent loop (built but secondary now)¶

Architecture committed in lib/visual-critic.js, lib/visual-fix-agent.js, lib/converge.js. End-to-end works; the convergence runner closes the loop critic → fix-agent → assemble → build → deploy → critic. The retro documents why this is not the primary intervention path.

When the LLM loop is the right tool: - Final QA (visual-critic, scoped or full-page) — finds drift the deterministic gates cannot encode. - Catching brand colour mismatches the theme extractor missed. - Cross-region hierarchy / flow issues.

When the LLM loop is the wrong tool: - Anywhere a deterministic check can express the rule. "About should emit an image when the section has a bgMedia child with a background-image url" is a deterministic check, not a Gemini critique. - Site-by-site CSS patches that don't generalise. Fix the matcher / library component once instead.

8. The runner (`scripts/run-fulldev-batch.sh`)¶

The orchestration shell for batch operations. Reads a host list, runs:

quick-crawl → dom-pipeline (--no-build --max-pages 1) → capture-layout → theme-extractor → assembler-fulldev (--slug index)

per host, captures stdout/stderr/status.json, is resumable (skips hosts whose status records ok=true). Caches Phase-1 outputs aggressively — re-running with code changes only re-runs the assembler (and downstream).

Cache flag: pass --no-cache to force a full Playwright re-extract.

Designed to run on EC2 (cathals-demo) where the live-site fetches happen.

9. Operating principles (the rules we landed on)¶

Build a verify gate before touching the LLM. If the rule can be expressed deterministically — DOM count, signal completeness, prop flow — encode it.
Each gate's failure must route to one owner. "Fix something somewhere" is the failure mode the LLM loop kept hitting. Single-owner gates make blast radius small.
Prefer fixing the matcher / library component before chasing CSS. Foundation fixes pay forward across the portfolio. Per-site CSS patches do not. The bgMedia-regex fix from this session likely helped 30-50% of FCR sites.
Pre-decompose multi-file work into atomic single-file changes. When a fix legitimately needs a matcher signal AND a translator route AND a component prop, emit three issues (with shared parent_id so they're traceable to the same observation), not one.
Don't trust the LLM's "looks fine" signal. Trust the deterministic gates' green/red. The convergence run hit "0 open issues" while the summary said "missing images, buttons, entire layout structures."
The matcher's job is signal capture; the translator's job is shape coercion; the component's job is rendering. When in doubt about where to put logic, pick the layer that makes the contract obvious to the next developer.
Treat the gate output as the source of truth, not the deploy. A clean verify-matcher.js + verify-translator.js is a stronger signal than a deploy that "looks OK" — the deploy can hide regressions the verifiers catch.

10. File-by-file reference¶

lib/
├── dom-pipeline.js              Phase 1 — DOM extraction + image self-host
├── capture-layout.js            Phase 2 — Playwright layout / palette snapshot
├── theme-extractor.js           Phase 3 — palette + fonts → theme.json
├── theme-generator.js           Phase 3 — theme.json → scoped theme.css
├── assembler-v3.js              v3 (bespoke) assembler — for components/ output
├── assembler-fulldev.js         Phase 4 — fulldev assembler entry point
├── assembler-fulldev/
│   ├── variant-picker.js        Dispatch matcher.name → translator
│   ├── translate.js             Per-matcher translators (Header, Hero, About, ...)
│   ├── dedupe.js                Ancestor + signature dedupe
│   ├── scaffold.js              Astro project files
│   ├── theme.js                 Palette + font CSS var emission
│   └── emit.js                  content.ts + index.astro + assets copy
│
├── matchers/                    Per-section recogniser modules
│   ├── _helpers.js              cleanText, isPhoneNumber, ...
│   ├── About.js                 Generic heading + paragraph + image fallback
│   ├── Hero.js                  First-section with bg-media or text-only hero
│   ├── Header.js                <header> nav extraction
│   ├── Footer.js                <footer> column extraction
│   ├── TopBar.js                pinned overlay phone/email/social strip
│   ├── ServiceGrid.js           Multi-image / multi-href service tiles
│   ├── InfoCards.js             Icon-only service tiles
│   ├── FAQ.js                   Accordion / Q&A
│   ├── Testimonials.js          Review cards
│   ├── CTAStrip.js              Standalone heading + button banner
│   ├── Gallery.js               Image grid (Pro Gallery)
│   ├── Contact.js               Form-bearing contact section
│   ├── USPBar.js                Pinned strip with bullet trust signals
│   ├── BookingWidget.js         Wix booking embed
│   ├── LocationMap.js           Map-bearing "find us" section
│   ├── VideoGallery.js          Video carousel
│   └── FloatingSocial.js        Pinned social-icon overlay
│
├── components-v3/
│   ├── blocks/                  35 fulldev blocks + 4 custom *-wcp
│   ├── ui/                      30 fulldev primitives
│   └── lib/utils.ts             cn() helper
│
├── verify-matcher.js            Gate 1 — matcher signal completeness
├── verify-translator.js         Gate 2 — translator prop-flow + variant fit
├── verify-content.js            Gate 3 — rendered DOM vs live structural diff
├── verify-design.js             Gate 4 — per-region Gemini design diff
│
├── visual-critic.js             LLM full-page diff with structured JSON
├── visual-fix-agent.js          Per-issue Claude tool-use loop
├── converge.js                  LLM-loop orchestrator (deterministic gates supersede)
├── placement-check.js           Predecessor of visual-critic (placement-only)
└── structure-matcher.js         loadMatchers + matchAllSections — the matcher dispatcher

scripts/
├── run-fulldev-batch.sh         Phase 1-4 orchestrator for batch + single-site runs
└── pull-from-ec2.sh             Pulls Phase 1-3 artifacts from EC2 for local work

docs/pipeline/
├── README.md                    Original pipeline overview
├── known-issues.md              Standing bug list, fixed items deleted not archived
├── retro-2026-05-05.md          The architecture-pivot retro
├── architecture.md              This file
├── phase3-batch-2026-05-04.md   Sample-run artefact
└── decisions/                   ADRs
    ├── 0001-component-library.md       fulldev as foundation
    ├── 0002-primitive-neutrality.md    primitive aesthetic fingerprint
    └── 0003-assembler-fulldev.md       assembler-fulldev design

11. Open architectural items (in priority order for next session)¶

~~Reduce the 35 unmatched-section count across the 20-site sample.~~ DONE 2026-05-05 — verifier on EC2 confirms unmatched=0 across all 20 sites; total matcher gaps 67 → 51 (residual is image/cta extraction-completeness, the next layer of work).
Contact.js accepts a heading-led panel with tel/mailto OR a /contact* page link (score 0.7).
CTAStrip.js allows ONE content image (single-card service tile pattern), with alt="bgImage" treated as background; nested-bgMedia threshold raised to 2 (single-tile photos as CSS bg pass through, true split-layout still routes elsewhere).
About.js adds a last-resort fallback (0.15) for short text-only paragraph blocks (taglines, bios) that have no image / button / heading.
ImageStrip.js (priority 55) catches 2+ image decorative rows AND 1-image full-bleed banners (including section-bg-only payloads).
New FloatingCTA.js (priority 35) matches single-anchor pinned overlays; the translator no-ops scroll-to-top FABs and emits a deferred floating-cta-wcp block for real CTAs (ENQUIRE / Call us).
Translator updates: Contact surfaces props.phone / props.email in the contact-1 slot; ImageStrip routes to gallery-wcp via the Gallery translator.
~~Build verify-design.js~~ Built 2026-05-05. lib/verify-design.js takes a region label + per-side CSS selectors, screenshots that one region from LIVE and OURS, and asks Gemini for a per-region score (0..10) and atomic issues categorised as colour | typography | spacing | hierarchy. Each issue carries a single owner repo path (block component or theme.js) so the fix-agent's blast radius stays small. Exits 1 if score < 8 or any issues found.
Build verify-interactions.js — Playwright assertions: hover dropdowns open, click images open lightbox, scroll-reveal triggers, video plays. Pass/fail per interaction.
Section splitter so nested widget containers (Wix Blog feed inside FAQ section, USP strip inside hero section) become enumerated sections in their own right and pick up matchers other than the parent's.
Theme accent extraction — capture brand-secondary / brand-yellow colours used on CTAs but not in section backgrounds. Gemini critic flagged this on garvanbay; theme-extractor doesn't.
Build-dir mismatch resolution — dom-pipeline.js writes to ~/replatform-dashboard/builds/ but capture-layout.js and theme-extractor.js write to ~/replatform/builds/. The runner cp's between them. Pick one canonical BUILD_DIR and have every phase honour it.

12. What "done" looks like¶

For a single site: - verify-matcher and verify-translator both report 0 gaps. - verify-content reports only counting-artifact gaps (logical-image dedupe, no-heading sections). - verify-design (when built) gives 8+/10 per region. - verify-interactions (when built) all-pass. - Final visual-critic full-page pass surfaces only minor cross-region issues.

For the portfolio: - Same threshold across the 20-site sample (or a chosen statistical bar — e.g. 80% of sites at 0 matcher gaps, 90% at < 3 translator gaps). - The batch runner can rebuild any site in ~30 seconds (cache hits) end-to-end.

We are at "garvanbay clean at 0/0" today; the same fixes likely move multiple other sites toward that threshold without further work, but the verifiers will tell us exactly which.