Skip to content

Replatform Pipeline v3

Migrates Wix sites to Cloudflare Pages using a deterministic, IP-safe assembler. Replaces v2 (Claude-in-the-hot-path).

1. DOM EXTRACTION      lib/dom-pipeline.js       Playwright → raw body.html per page + assets
2. LAYOUT CAPTURE      lib/capture-layout.js     Playwright → layout.json (comp x/y/w/h, slideshow imgs, Pro Gallery items)
3. THEME EXTRACTION    lib/theme-extractor.js    Playwright → theme.json (fonts, colors, dims)
4. ASSEMBLE            lib/assembler-v3.js       cheerio + matchers → *.astro page (deterministic)
   ├─ structure-matcher    lib/structure-matcher.js   picks best matcher per section
   ├─ matchers/*.js        lib/matchers/              17 hand-authored recognisers
   └─ theme-generator      lib/theme-generator.js     emits scoped theme.css from theme.json
5. BUILD + DEPLOY      npm run build + wrangler  CF Pages
6. GEMINI QA           lib/placement-check.js    placement-focused visual audit (report only)

Why v3

v2 had Claude generate the .astro page from sections.json. Problems: - $0.02/page × thousands of pages added up - Claude occasionally emitted invalid JSX; needed post-processing - Not deterministic — same input, different output between runs - Claude invented content (SEO meta, colours) instead of using live values

v3 pivot (Path A, 2026-04-12 session): hand-authored matchers + deterministic template stitch. Claude stays available as fallback for unmatched sections but the hot path is $0/page and fully reproducible.

Strategy: match first to prove visual fidelity, then re-engineer. The component library is hand-authored and will be modernised over time (vertical tabs, AI chatbots, etc.) — matching existing Wix structures is only the proof step to confirm we're not losing content.

Matcher pattern

Each matcher in lib/matchers/ is a single module:

module.exports = {
  name: 'About',
  component: 'About.astro',
  priority: 50,              // higher wins ties
  match(ctx) {               // returns 0..1 confidence
    // ctx: { $, $el, tag, position, compLayout, computedLayout, siteLayout, ... }
    return 0.4;
  },
  extract(ctx) {             // returns props for the Astro component
    return { heading, text, imagePosition: 'left' };
  }
};

Current matchers: Hero, TopBar, Header, Footer, About, ServiceGrid, InfoCards, FAQ, CTAStrip, Testimonials, Gallery, VideoGallery, LocationMap, USPBar, FloatingSocial, ContactForm, BookingWidget.

Key techniques learned

Layout-driven decisions beat DOM-only heuristics. Use computedLayout.comps[cid].x (from Playwright) to decide image/text side instead of guessing from document order. See About.js detectedPosition logic — counts image and paragraph comp x-centers vs section centre, picks majority side.

Slideshow support. capture-layout.js enumerates slideshow slides by clicking the next button in a loop and records each slide's image URL. Hero.js maps those URLs to the local /assets/images/<filename> that Phase 1 downloaded, sets backgroundImages: string[] on the Hero component, and forces contentPosition='center' + overlay for a proper hero banner.

Ancestor dedupe only when same matcher. Wix sections nest arbitrarily. If both outer and inner match the same matcher (e.g. both About with the same heading), drop the ancestor and keep the deeper/more-specific child. If outer matches Hero and inner matches About, keep both — they're semantically distinct. See assembler-v3.js dedupe block.

Signature dedupe as a safety net. After ancestor dedupe, collapse any remaining entries with the same matcher|heading|first-80-chars-of-text to a single representative (deepest wins).

FAQ guard against partner sections. Wix reuses the accordion widget for logo carousels. Any section whose h1/h2 contains "partner|trust|logo|brand" is excluded from FAQ matching regardless of accordion count.

Image filtering. About matcher filters out images whose alt contains "logo|wordmark|brand" when a video is present in the same section (avoids using a logo as video poster). When media side is known, keeps only images on that side (cuts decorative right-column imagery out of a media-left About).

Theme extraction recovers real Google Fonts from Wix's orig_<name> aliases. theme-extractor.js parses the font-family stack and pulls names from strings like orig_bai_jamjuree_boldBai Jamjuree. theme-generator.js emits a scoped theme.css with !important rules to beat Astro-scoped component styles. Never ship parastorage CSS or Wix private webfonts (IP concern).

Source of truth

Pipeline source (lib/, components/) lives in this repo. On cathals-demo, ~/replatform/lib and ~/replatform/components/src/components are symlinks into ~/replatform-dashboard/, so git pull on EC2 picks up any edits made from this repo. To edit on EC2 directly, cd ~/replatform-dashboard && git commit && git push — the pre-migration originals are preserved at ~/replatform/lib.premigration and ~/replatform/components/src.premigration if you ever need to diff.

Runbook (per site)

# Phase 1 — DOM extraction
node lib/dom-pipeline.js <domain>

# Phase 2 — layout capture (must run on live site)
node lib/capture-layout.js <domain>

# Phase 3 — theme extraction
node lib/theme-extractor.js <domain>
node lib/theme-generator.js <domain>

# Phase 4 — assemble each page
for p in $(ls builds/<domain>/src/data/*.body.html | xargs -n1 basename | sed 's/.body.html//'); do
  node lib/assembler-v3.js <domain> $p
done

# Sync to assembled-site
cp builds/<domain>/assembled/*.astro builds/<domain>/assembled-site/src/pages/

# Phase 5 — build + deploy
cd builds/<domain>/assembled-site
npm run build
find dist -type f -size +24M -delete    # CF Pages 25 MiB limit
npx wrangler pages deploy dist --project-name <project>-assembled --commit-dirty=true

# Phase 6 — placement audit (report only, no auto-fix)
node lib/placement-check.js https://www.<domain> https://<project>-assembled.pages.dev

Placement check is a report, not a loop

placement-check.js takes full-page scrolling screenshots of live + ours and asks Gemini to compare placement only — column order, grid layout, image/text side, element alignment. Ignores colours/fonts/spacing.

Output is a ranked list of placement issues. No CSS is written. Fixes go back to the matchers by hand, then assemble + deploy + re-run placement check. This is deliberate — chasing pixel-perfect via LLM-generated CSS overrides drifts away from the source-of-truth matchers and accumulates hack CSS. Gemini is the critic, matchers are the source of truth.

Cost

v3 assembler: $0/page (deterministic). Placement check: ~$0.02/site/run. 1800 sites ≈ <$50 total if QA runs once per site.

State (2026-04-12)

Site Extract Capture Theme Assemble Deploy Placement QA
waterfordcountypainters.ie ✓ wcp-assembled.pages.dev
medicalsupplies.ie ✓ medical-assembled.pages.dev
trimtech.ie ✓ (partial Pro Gallery) ✓ 19/19 pages ✓ trimtech-assembled.pages.dev ongoing — see known-issues

See known-issues.md for open gaps.