Skip to content

Replatform pipeline architecture

Status: canonical reference for the v3 pipeline as of 2026-05-05. Goal of the system: rebuild ~1800 Wix small-business sites onto Astro + Cloudflare Pages with high visual fidelity, deterministically, at near-zero per-site LLM cost.

This document describes how the pipeline is organised today, why each piece exists, and the principles we landed on after a hard session of trying both an LLM-driven convergence loop and a deterministic verifier perimeter. Read retro-2026-05-05.md for the operational narrative; this file is the structural reference.


1. Pipeline at a glance

LIVE WIX SITE  ──►  PHASE 1  EXTRACT       (Playwright + cheerio rewrites)
                    PHASE 2  CAPTURE       (Playwright layout snapshots)
                    PHASE 3  THEME         (palette + font extraction)
                    PHASE 4  ASSEMBLE      (matchers → translators → emit)
                    PHASE 5  BUILD         (npm install + astro build)
                    PHASE 6  DEPLOY        (wrangler pages deploy)
                    PHASE 7  QA            (Gemini critic, optional)

Phases 1-3 produce the inputs for the assembler. Phase 4 is where most logic lives. Phases 5-6 are vendor commands. Phase 7 is the final verification step.

The crucial change over what was here at the start of this session: each phase now has a deterministic verifier gate sitting between it and the next phase. The LLM only sees what the deterministic gates can't catch.


2. Phase responsibilities and code map

Phase 1 — DOM extraction (lib/dom-pipeline.js)

Walks every page of the live Wix site via Playwright. For each page emits a clean <slug>.body.html and <slug>.head.html under builds/<domain>/src/data/. Strips Wix runtime, rewrites Wix CDN URLs to local /assets/images/<hash> paths, downloads referenced media into builds/<domain>/public/assets/.

CLI flags worth knowing: - --no-build — skip the npm install + astro build at the end (Phase 5 happens here too if you want) - --max-pages N — for batch discovery runs that only need the homepage

Important footgun fixed this session: the cleanup regex that strips leftover /v1/fill/... Wix transformation suffixes used [^"\s<>]* for its consumption. On split-layout sections that meant it ate the closing ") and ; of CSS url(...) rules, leaving malformed CSS no parser could recover. The exclusion class now includes ) and ;. Affected every site with split-layout bgMedia images.

Phase 2 — Layout capture (lib/capture-layout.js)

Runs Playwright against the live URL and captures bounding-rect + computed-color information into layout.json. Used by matchers that need x/y/w/h (e.g. About's media-side detection, Hero's slideshow enumeration) and by capture of Wix Pro Gallery items / nav hover colours / slideshow images.

Output paths are still hardcoded to /home/admin/replatform/builds/<domain>/ — see known-issues.md for the build-dir-mismatch workaround the runner uses.

Phase 3 — Theme extraction (lib/theme-extractor.js, lib/theme-generator.js)

Walks the live page once for palette + fonts. Writes theme.json. Theme generator emits scoped theme.css (with !important rules to beat Astro-scoped component styles).

Known limitation: only extracts the dominant palette. Brand accents that appear only in CTAs (e.g. garvanbay's yellow, FCR-specific brand-tertiary colours) are not currently captured. This is a real signal gap surfaced by the visual critic.

Phase 4 — Assemble (lib/assembler-fulldev.js + helpers)

This is the heart of the pipeline. Three sub-stages:

SECTION ENUMERATION          builds the list of sections to process
MATCHERS (lib/matchers/*)    return { name, score, props } per section
DEDUPE                       collapse outer+inner ancestor pairs of same matcher
VARIANT-PICKER + TRANSLATORS route matched section → fulldev component + props
EMIT                         write content.ts + index.astro + scaffold + global.css

Section enumeration is in lib/assembler-fulldev.js directly. It walks: - the site <header> (one tag, tag: 'header') - top-level [id^="pinned"] overlay layers (tag: 'pinned') - top-level <section> elements with content (tag: 'section') - the site <footer> (one tag, tag: 'footer')

Position is computed only over <section> elements so adding pinned layers doesn't displace Hero's position-0 expectation.

Phase 5/6 — Build + deploy

npx astro build then npx wrangler pages deploy dist --project-name <slug>-fulldev. Cloudflare Pages keeps every deployment; the master alias points to the latest. Each per-site deploy URL is captured in the converge / batch run logs.

Phase 7 — Final QA

lib/visual-critic.js runs Gemini against full-page screenshots and emits a structured-JSON report (per-region scores + atomic-decomposed issue list). Today this is the optional final pass — the goal in the new architecture is for the deterministic gates to catch most issues, leaving Gemini to see only cross-region / design-detail issues.


3. Matchers (lib/matchers/*.js)

Each matcher is a self-contained module with this shape:

module.exports = {
  name: 'About',
  component: 'About.astro',
  priority: 50,                  // higher wins ties
  match(ctx) { /* returns 0..1 */ },
  extract(ctx) { /* returns props */ },
};

Current matchers + responsibilities:

Matcher Priority Fires on Emits
TopBar 900 pinned layer with phone/email/social phone, email, location, cta, social
Hero 100 first section with bg-media or text-only hero heading, subtext, cta, phone, backgroundImage/Video, slideshowImages
Header 100 <header> tag logo, nav, cta
Footer 100 <footer> tag columns, social, companyName, logo
Contact 90 section with <textarea> or email-typed input heading, fields, phone, email
FAQ 85 section with accordion or "frequently asked" heading heading, items[]
ServiceGrid 70 section with 2+ images and 2+ internal hrefs services[] (title, href, image, description)
InfoCards 65 similar to ServiceGrid but icons-only cards[]
Testimonials 60 section with review structure reviews[]
About 50 fallback for heading + paragraph + image sections heading, text, image, images[], video, cta[]
ImageStrip 55 heading-less section with 2+ decorative images, low text images[], heading?
...

Matchers must respect a minimal signal contract — see lib/verify-matcher.js for the assertions. Any extracted prop that uses a non-standard shape (scalar instead of array, label instead of text) shows up as a translator gap.


4. Variant-picker + translators (lib/assembler-fulldev/)

Files: - variant-picker.js — small dispatcher: matcher name → translator function. - translate.js — one translator per matcher. Maps matcher props → fulldev component props + slot HTML. - dedupe.js — collapses ancestor/descendant pairs that match the same matcher. - theme.js — converts theme.json palette to oklch CSS vars + font imports. - scaffold.js — emits package.json, astro.config.mjs, tsconfig.json, components.json. - emit.js — writes content.ts, index.astro, global.css. Copies Phase-1 public/assets/ into assembled-fulldev/public/.

The translator can return either a single { component, props, slot } block or an array of blocks. When About has 3+ images, translateAbout returns [content-1 (text), gallery-wcp (3 images)] — splits one source section into two rendered blocks. pickAll flattens arrays into the layoutMap.

A subtle but real contract: the translator's job is to coerce drift between a matcher's emitted shape and the component's prop interface. When the bespoke Header.astro reads nav[].label and the assembler's header-wcp reads menus[].text, the translator does the rename. Verify-translator catches the case where the translator silently drops a signal (e.g. cta-wcp had no image prop, so props.backgroundImage from CTAStrip never made it through).


5. Components (lib/components-v3/)

lib/components-v3/
├── blocks/                  35 files — fulldev/ui blocks + 4 custom *-wcp variants
│   ├── header-1.astro       (fulldev, inline logo-left/nav-right)
│   ├── header-wcp.astro     (custom, stacked-centered logo+nav-below)
│   ├── hero-{1..7}.astro    (fulldev hero variants)
│   ├── content-{1..4}.astro (fulldev content variants)
│   ├── features-{1..5}.astro
│   ├── faqs-1.astro
│   ├── footer-1.astro
│   ├── reviews-1.astro
│   ├── contact-1.astro
│   ├── cta-wcp.astro        (custom — fulldev's cta-1 forces a testimonial)
│   ├── gallery-wcp.astro    (custom — image grid + lightbox)
│   ├── topbar-wcp.astro     (custom — fulldev's banner-1 too narrow)
│   └── ...
├── ui/                      30 files — fulldev primitives (Button, Card, NavigationMenu, ...)
└── lib/utils.ts             cn() helper

Custom *-wcp.astro variants (4 total) cover patterns fulldev doesn't ship: contact-strip topbar, image grid with lightbox, plain CTA without testimonial, stacked-centered header. Each has a clear "reuse target" comment naming what % of FCR portfolio it covers — see ADR-0003 for the breakdown.

Components consume props with no awareness of which matcher emitted them. This is the seam the translator must respect.


6. The verifier perimeter (lib/verify-*.js)

The session's primary architectural learning. Each verifier is a deterministic gate between two pipeline phases. Failure routes to a specific layer's owner.

        Phase-1 extracted body.html
            ┌────────────────────────┐
            │  lib/verify-matcher.js │   (matchers gap report)
            └────────────────────────┘
        Phase-4 matched layoutMap (post-dedupe)
            ┌──────────────────────────┐
            │  lib/verify-translator.js │  (translator gap report)
            └──────────────────────────┘
        Phase-4 emitted content.ts + index.astro
            Phase-5/6 build + deploy
            ┌──────────────────────────┐
            │  lib/verify-content.js   │  (rendered DOM vs live, structural)
            └──────────────────────────┘
        deployed pages.dev URL
            ┌──────────────────────────────────┐
            │  lib/verify-design.js            │  (Gemini, one region per call)
            └──────────────────────────────────┘
            ┌──────────────────────────────────┐
            │  TODO  lib/verify-interactions.js │ (Playwright assertions)
            └──────────────────────────────────┘
            ┌────────────────────────┐
            │  lib/visual-critic.js  │  (Gemini, full-page QA)
            └────────────────────────┘

What each verifier checks

Verifier Failure means Routes to
verify-matcher DOM signal not captured by the matching matcher lib/matchers/<name>.js
verify-translator matcher prop didn't survive translation OR variant doesn't accept it lib/assembler-fulldev/translate.js or lib/components-v3/blocks/*.astro
verify-content rendered HTML doesn't match live's section count / element types usually a component that doesn't render a prop, sometimes a missing matcher
verify-design colour / typography / spacing / hierarchy drift in one region single owner field on each issue: lib/components-v3/blocks/<block>.astro or lib/assembler-fulldev/theme.js
verify-interactions (TODO) dropdown/lightbox/scroll-reveal broken component JS
visual-critic (final QA) cross-region / hierarchy / flow issues escalate, may require new matcher or block variant

Why the perimeter matters more than the convergence loop

Each gate's failure has one owner. When verify-matcher reports an image-extraction-gap, you know the bug is in the matcher's extract() function, not in CSS, theme, or rendering. When verify-translator reports a prop-flow-loss, you know it's the translator or component contract.

This contrasts with the LLM convergence loop, which had Gemini observe a symptom (e.g. "service section image missing") and dispatch the agent to fix something somewhere — frequently the wrong layer (CSS), causing regressions while the real bug (a regex) was untouched.


7. The LLM agent loop (built but secondary now)

Architecture committed in lib/visual-critic.js, lib/visual-fix-agent.js, lib/converge.js. End-to-end works; the convergence runner closes the loop critic → fix-agent → assemble → build → deploy → critic. The retro documents why this is not the primary intervention path.

When the LLM loop is the right tool: - Final QA (visual-critic, scoped or full-page) — finds drift the deterministic gates cannot encode. - Catching brand colour mismatches the theme extractor missed. - Cross-region hierarchy / flow issues.

When the LLM loop is the wrong tool: - Anywhere a deterministic check can express the rule. "About should emit an image when the section has a bgMedia child with a background-image url" is a deterministic check, not a Gemini critique. - Site-by-site CSS patches that don't generalise. Fix the matcher / library component once instead.


8. The runner (scripts/run-fulldev-batch.sh)

The orchestration shell for batch operations. Reads a host list, runs:

quick-crawl → dom-pipeline (--no-build --max-pages 1) → capture-layout → theme-extractor → assembler-fulldev (--slug index)

per host, captures stdout/stderr/status.json, is resumable (skips hosts whose status records ok=true). Caches Phase-1 outputs aggressively — re-running with code changes only re-runs the assembler (and downstream).

Cache flag: pass --no-cache to force a full Playwright re-extract.

Designed to run on EC2 (cathals-demo) where the live-site fetches happen.


9. Operating principles (the rules we landed on)

  1. Build a verify gate before touching the LLM. If the rule can be expressed deterministically — DOM count, signal completeness, prop flow — encode it.

  2. Each gate's failure must route to one owner. "Fix something somewhere" is the failure mode the LLM loop kept hitting. Single-owner gates make blast radius small.

  3. Prefer fixing the matcher / library component before chasing CSS. Foundation fixes pay forward across the portfolio. Per-site CSS patches do not. The bgMedia-regex fix from this session likely helped 30-50% of FCR sites.

  4. Pre-decompose multi-file work into atomic single-file changes. When a fix legitimately needs a matcher signal AND a translator route AND a component prop, emit three issues (with shared parent_id so they're traceable to the same observation), not one.

  5. Don't trust the LLM's "looks fine" signal. Trust the deterministic gates' green/red. The convergence run hit "0 open issues" while the summary said "missing images, buttons, entire layout structures."

  6. The matcher's job is signal capture; the translator's job is shape coercion; the component's job is rendering. When in doubt about where to put logic, pick the layer that makes the contract obvious to the next developer.

  7. Treat the gate output as the source of truth, not the deploy. A clean verify-matcher.js + verify-translator.js is a stronger signal than a deploy that "looks OK" — the deploy can hide regressions the verifiers catch.


10. File-by-file reference

lib/
├── dom-pipeline.js              Phase 1 — DOM extraction + image self-host
├── capture-layout.js            Phase 2 — Playwright layout / palette snapshot
├── theme-extractor.js           Phase 3 — palette + fonts → theme.json
├── theme-generator.js           Phase 3 — theme.json → scoped theme.css
├── assembler-v3.js              v3 (bespoke) assembler — for components/ output
├── assembler-fulldev.js         Phase 4 — fulldev assembler entry point
├── assembler-fulldev/
│   ├── variant-picker.js        Dispatch matcher.name → translator
│   ├── translate.js             Per-matcher translators (Header, Hero, About, ...)
│   ├── dedupe.js                Ancestor + signature dedupe
│   ├── scaffold.js              Astro project files
│   ├── theme.js                 Palette + font CSS var emission
│   └── emit.js                  content.ts + index.astro + assets copy
├── matchers/                    Per-section recogniser modules
│   ├── _helpers.js              cleanText, isPhoneNumber, ...
│   ├── About.js                 Generic heading + paragraph + image fallback
│   ├── Hero.js                  First-section with bg-media or text-only hero
│   ├── Header.js                <header> nav extraction
│   ├── Footer.js                <footer> column extraction
│   ├── TopBar.js                pinned overlay phone/email/social strip
│   ├── ServiceGrid.js           Multi-image / multi-href service tiles
│   ├── InfoCards.js             Icon-only service tiles
│   ├── FAQ.js                   Accordion / Q&A
│   ├── Testimonials.js          Review cards
│   ├── CTAStrip.js              Standalone heading + button banner
│   ├── Gallery.js               Image grid (Pro Gallery)
│   ├── Contact.js               Form-bearing contact section
│   ├── USPBar.js                Pinned strip with bullet trust signals
│   ├── BookingWidget.js         Wix booking embed
│   ├── LocationMap.js           Map-bearing "find us" section
│   ├── VideoGallery.js          Video carousel
│   └── FloatingSocial.js        Pinned social-icon overlay
├── components-v3/
│   ├── blocks/                  35 fulldev blocks + 4 custom *-wcp
│   ├── ui/                      30 fulldev primitives
│   └── lib/utils.ts             cn() helper
├── verify-matcher.js            Gate 1 — matcher signal completeness
├── verify-translator.js         Gate 2 — translator prop-flow + variant fit
├── verify-content.js            Gate 3 — rendered DOM vs live structural diff
├── verify-design.js             Gate 4 — per-region Gemini design diff
├── visual-critic.js             LLM full-page diff with structured JSON
├── visual-fix-agent.js          Per-issue Claude tool-use loop
├── converge.js                  LLM-loop orchestrator (deterministic gates supersede)
├── placement-check.js           Predecessor of visual-critic (placement-only)
└── structure-matcher.js         loadMatchers + matchAllSections — the matcher dispatcher

scripts/
├── run-fulldev-batch.sh         Phase 1-4 orchestrator for batch + single-site runs
└── pull-from-ec2.sh             Pulls Phase 1-3 artifacts from EC2 for local work

docs/pipeline/
├── README.md                    Original pipeline overview
├── known-issues.md              Standing bug list, fixed items deleted not archived
├── retro-2026-05-05.md          The architecture-pivot retro
├── architecture.md              This file
├── phase3-batch-2026-05-04.md   Sample-run artefact
└── decisions/                   ADRs
    ├── 0001-component-library.md       fulldev as foundation
    ├── 0002-primitive-neutrality.md    primitive aesthetic fingerprint
    └── 0003-assembler-fulldev.md       assembler-fulldev design

11. Open architectural items (in priority order for next session)

  1. ~~Reduce the 35 unmatched-section count across the 20-site sample.~~ DONE 2026-05-05 — verifier on EC2 confirms unmatched=0 across all 20 sites; total matcher gaps 67 → 51 (residual is image/cta extraction-completeness, the next layer of work).
  2. Contact.js accepts a heading-led panel with tel/mailto OR a /contact* page link (score 0.7).
  3. CTAStrip.js allows ONE content image (single-card service tile pattern), with alt="bgImage" treated as background; nested-bgMedia threshold raised to 2 (single-tile photos as CSS bg pass through, true split-layout still routes elsewhere).
  4. About.js adds a last-resort fallback (0.15) for short text-only paragraph blocks (taglines, bios) that have no image / button / heading.
  5. ImageStrip.js (priority 55) catches 2+ image decorative rows AND 1-image full-bleed banners (including section-bg-only payloads).
  6. New FloatingCTA.js (priority 35) matches single-anchor pinned overlays; the translator no-ops scroll-to-top FABs and emits a deferred floating-cta-wcp block for real CTAs (ENQUIRE / Call us).
  7. Translator updates: Contact surfaces props.phone / props.email in the contact-1 slot; ImageStrip routes to gallery-wcp via the Gallery translator.

  8. ~~Build verify-design.js~~ Built 2026-05-05. lib/verify-design.js takes a region label + per-side CSS selectors, screenshots that one region from LIVE and OURS, and asks Gemini for a per-region score (0..10) and atomic issues categorised as colour | typography | spacing | hierarchy. Each issue carries a single owner repo path (block component or theme.js) so the fix-agent's blast radius stays small. Exits 1 if score < 8 or any issues found.

  9. Build verify-interactions.js — Playwright assertions: hover dropdowns open, click images open lightbox, scroll-reveal triggers, video plays. Pass/fail per interaction.

  10. Section splitter so nested widget containers (Wix Blog feed inside FAQ section, USP strip inside hero section) become enumerated sections in their own right and pick up matchers other than the parent's.

  11. Theme accent extraction — capture brand-secondary / brand-yellow colours used on CTAs but not in section backgrounds. Gemini critic flagged this on garvanbay; theme-extractor doesn't.

  12. Build-dir mismatch resolutiondom-pipeline.js writes to ~/replatform-dashboard/builds/ but capture-layout.js and theme-extractor.js write to ~/replatform/builds/. The runner cp's between them. Pick one canonical BUILD_DIR and have every phase honour it.


12. What "done" looks like

For a single site: - verify-matcher and verify-translator both report 0 gaps. - verify-content reports only counting-artifact gaps (logical-image dedupe, no-heading sections). - verify-design (when built) gives 8+/10 per region. - verify-interactions (when built) all-pass. - Final visual-critic full-page pass surfaces only minor cross-region issues.

For the portfolio: - Same threshold across the 20-site sample (or a chosen statistical bar — e.g. 80% of sites at 0 matcher gaps, 90% at < 3 translator gaps). - The batch runner can rebuild any site in ~30 seconds (cache hits) end-to-end.

We are at "garvanbay clean at 0/0" today; the same fixes likely move multiple other sites toward that threshold without further work, but the verifiers will tell us exactly which.