Replatform Pipeline v3¶
Migrates Wix sites to Cloudflare Pages using a deterministic, IP-safe assembler. Replaces v2 (Claude-in-the-hot-path).
1. DOM EXTRACTION lib/dom-pipeline.js Playwright → raw body.html per page + assets
2. LAYOUT CAPTURE lib/capture-layout.js Playwright → layout.json (comp x/y/w/h, slideshow imgs, Pro Gallery items)
3. THEME EXTRACTION lib/theme-extractor.js Playwright → theme.json (fonts, colors, dims)
4. ASSEMBLE lib/assembler-v3.js cheerio + matchers → *.astro page (deterministic)
├─ structure-matcher lib/structure-matcher.js picks best matcher per section
├─ matchers/*.js lib/matchers/ 17 hand-authored recognisers
└─ theme-generator lib/theme-generator.js emits scoped theme.css from theme.json
5. BUILD + DEPLOY npm run build + wrangler CF Pages
6. GEMINI QA lib/placement-check.js placement-focused visual audit (report only)
Why v3¶
v2 had Claude generate the .astro page from sections.json. Problems: - $0.02/page × thousands of pages added up - Claude occasionally emitted invalid JSX; needed post-processing - Not deterministic — same input, different output between runs - Claude invented content (SEO meta, colours) instead of using live values
v3 pivot (Path A, 2026-04-12 session): hand-authored matchers + deterministic template stitch. Claude stays available as fallback for unmatched sections but the hot path is $0/page and fully reproducible.
Strategy: match first to prove visual fidelity, then re-engineer. The component library is hand-authored and will be modernised over time (vertical tabs, AI chatbots, etc.) — matching existing Wix structures is only the proof step to confirm we're not losing content.
Matcher pattern¶
Each matcher in lib/matchers/ is a single module:
module.exports = {
name: 'About',
component: 'About.astro',
priority: 50, // higher wins ties
match(ctx) { // returns 0..1 confidence
// ctx: { $, $el, tag, position, compLayout, computedLayout, siteLayout, ... }
return 0.4;
},
extract(ctx) { // returns props for the Astro component
return { heading, text, imagePosition: 'left' };
}
};
Current matchers: Hero, TopBar, Header, Footer, About, ServiceGrid, InfoCards, FAQ, CTAStrip, Testimonials, Gallery, VideoGallery, LocationMap, USPBar, FloatingSocial, ContactForm, BookingWidget.
Key techniques learned¶
Layout-driven decisions beat DOM-only heuristics. Use computedLayout.comps[cid].x (from Playwright) to decide image/text side instead of guessing from document order. See About.js detectedPosition logic — counts image and paragraph comp x-centers vs section centre, picks majority side.
Slideshow support. capture-layout.js enumerates slideshow slides by clicking the next button in a loop and records each slide's image URL. Hero.js maps those URLs to the local /assets/images/<filename> that Phase 1 downloaded, sets backgroundImages: string[] on the Hero component, and forces contentPosition='center' + overlay for a proper hero banner.
Ancestor dedupe only when same matcher. Wix sections nest arbitrarily. If both outer and inner match the same matcher (e.g. both About with the same heading), drop the ancestor and keep the deeper/more-specific child. If outer matches Hero and inner matches About, keep both — they're semantically distinct. See assembler-v3.js dedupe block.
Signature dedupe as a safety net. After ancestor dedupe, collapse any remaining entries with the same matcher|heading|first-80-chars-of-text to a single representative (deepest wins).
FAQ guard against partner sections. Wix reuses the accordion widget for logo carousels. Any section whose h1/h2 contains "partner|trust|logo|brand" is excluded from FAQ matching regardless of accordion count.
Image filtering. About matcher filters out images whose alt contains "logo|wordmark|brand" when a video is present in the same section (avoids using a logo as video poster). When media side is known, keeps only images on that side (cuts decorative right-column imagery out of a media-left About).
Theme extraction recovers real Google Fonts from Wix's orig_<name> aliases. theme-extractor.js parses the font-family stack and pulls names from strings like orig_bai_jamjuree_bold → Bai Jamjuree. theme-generator.js emits a scoped theme.css with !important rules to beat Astro-scoped component styles. Never ship parastorage CSS or Wix private webfonts (IP concern).
Source of truth¶
Pipeline source (lib/, components/) lives in this repo. On cathals-demo,
~/replatform/lib and ~/replatform/components/src/components are symlinks
into ~/replatform-dashboard/, so git pull on EC2 picks up any edits made
from this repo. To edit on EC2 directly, cd ~/replatform-dashboard && git
commit && git push — the pre-migration originals are preserved at
~/replatform/lib.premigration and ~/replatform/components/src.premigration
if you ever need to diff.
Runbook (per site)¶
# Phase 1 — DOM extraction
node lib/dom-pipeline.js <domain>
# Phase 2 — layout capture (must run on live site)
node lib/capture-layout.js <domain>
# Phase 3 — theme extraction
node lib/theme-extractor.js <domain>
node lib/theme-generator.js <domain>
# Phase 4 — assemble each page
for p in $(ls builds/<domain>/src/data/*.body.html | xargs -n1 basename | sed 's/.body.html//'); do
node lib/assembler-v3.js <domain> $p
done
# Sync to assembled-site
cp builds/<domain>/assembled/*.astro builds/<domain>/assembled-site/src/pages/
# Phase 5 — build + deploy
cd builds/<domain>/assembled-site
npm run build
find dist -type f -size +24M -delete # CF Pages 25 MiB limit
npx wrangler pages deploy dist --project-name <project>-assembled --commit-dirty=true
# Phase 6 — placement audit (report only, no auto-fix)
node lib/placement-check.js https://www.<domain> https://<project>-assembled.pages.dev
Placement check is a report, not a loop¶
placement-check.js takes full-page scrolling screenshots of live + ours and asks Gemini to compare placement only — column order, grid layout, image/text side, element alignment. Ignores colours/fonts/spacing.
Output is a ranked list of placement issues. No CSS is written. Fixes go back to the matchers by hand, then assemble + deploy + re-run placement check. This is deliberate — chasing pixel-perfect via LLM-generated CSS overrides drifts away from the source-of-truth matchers and accumulates hack CSS. Gemini is the critic, matchers are the source of truth.
Cost¶
v3 assembler: $0/page (deterministic). Placement check: ~$0.02/site/run. 1800 sites ≈ <$50 total if QA runs once per site.
State (2026-04-12)¶
| Site | Extract | Capture | Theme | Assemble | Deploy | Placement QA |
|---|---|---|---|---|---|---|
| waterfordcountypainters.ie | ✓ | ✓ | ✓ | ✓ | ✓ wcp-assembled.pages.dev | — |
| medicalsupplies.ie | ✓ | ✓ | ✓ | ✓ | ✓ medical-assembled.pages.dev | — |
| trimtech.ie | ✓ | ✓ (partial Pro Gallery) | ✓ | ✓ 19/19 pages | ✓ trimtech-assembled.pages.dev | ongoing — see known-issues |
See known-issues.md for open gaps.