CMS Migration — Monad Studio

# CMS-to-Astro Migration Playbook

Production-tested methodology from the ActiveWizards migration: 937 MODx pages reduced to 147 Astro pages + 804 redirects. Zero traffic loss.

1. Export Strategy

JSONL Export from Source CMS

Export each content type as a separate JSONL file (one JSON object per line). JSONL is superior to monolithic JSON for large exports: it is streamable, diffable, and survives partial corruption.

MODx export (SQL to JSONL):

``sql


-- Blog posts
SELECT c.id, c.pagetitle, c.longtitle, c.description,
       c.alias, c.content, c.publishedon, c.parent, c.template
FROM modx_site_content c
WHERE c.template = 4 AND c.published = 1
INTO OUTFILE '/tmp/aw-blog-export.jsonl'
FIELDS TERMINATED BY '\n';

In practice, a Python script with pymysql gives better control:

`python


import json
from pathlib import Path

def export_to_jsonl(rows: list[dict], output: Path):
    with open(output, "w", encoding="utf-8") as f:
        for row in rows:
            f.write(json.dumps(row, ensure_ascii=False, default=str) + "\n")



WordPress export: Use WP-CLI

wp export to get WXR XML, then convert to JSONL with a script that parses elements. Alternatively, query wp_posts

 directly via MySQL.

Ghost export: Ghost Admin API returns JSON. Paginate with

?limit=15&page=N and write each post as a JSONL line. Ghost uses Lexical internally; request HTML via ?formats=html

.

Always export separately:

aw-blog-export.jsonl -- core content fields

aw-cases-export.jsonl -- case studies

aw-tv-export.jsonl -- template variables / custom fields (tags, images, related posts)



JSONL Loader Pattern

`python


def load_jsonl(path: Path) -> list[dict]:
    items = []
    with open(path, "r", encoding="utf-8", errors="replace") as f:
        for line in f:
            line = line.strip()
            if line:
                try:
                    items.append(json.loads(line))
                except json.JSONDecodeError:
                    logger.warning(f"Skipping unparseable line in {path.name}")
    return items





2. Content Audit: Traffic-Based Priority

Pull 90 days of GSC data to make keep/redirect/kill decisions based on actual traffic, not gut feel.

GSC Data Pull

`python


from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build

service = build("searchconsole", "v1", credentials=creds)
response = service.searchanalytics().query(
    siteUrl="sc-domain:yourdomain.com",
    body={
        "startDate": "2025-12-01",
        "endDate": "2026-02-28",
        "dimensions": ["page"],
        "rowLimit": 25000,
    }
).execute()



Decision Matrix

Condition Action

Blog post (any traffic) Migrate
Case study (any traffic) Migrate
Homepage Migrate
Service root page Migrate
Other page with >30 clicks/90 days (~10/mo) Migrate
Everything else 301 redirect to nearest parent

The AW migration used these exact rules in scripts/migration_decision.py:

`python


ALWAYS_MIGRATE_TEMPLATES = {"Home", "Blog", "BlogTest", "CaseStudy"}
ALWAYS_MIGRATE_PATTERNS = ["blog/", "case-stud"]
MIN_CLICKS_90D = 30  # ~10 clicks/month threshold



Output

Two CSV files:

pages-to-migrate.csv -- slug, title, traffic, content type

redirects.csv -- old URL, target URL, status code





3. URL Mapping

Old URLs to New Slugs

Create a mapping CSV that connects every old URL to its new destination:

`csv


old_url,new_url,status
/blog/old-post-title/,/blog/old-post-title,301
/case-study/old-case/,/case-studies/old-case,301
/services/ai-ml/,/services/data-science,301



Key decisions:
Drop trailing slashes if your new site does not use them (Astro default: no trailing slash)

Flatten nested URLs where possible (/blog/category/post/ becomes /blog/post)

Rename URL segments if the old CMS used bad patterns (/case-study/ to /case-studies/)



Redirect File Generation

Cloudflare Pages uses a _redirects file in public/:


# Format: old-path  new-path  status-code
/contact/  /contact-us/  301
/old-blog-post/  /blog/new-slug  301
/services/ai-ml/  /services/data-science  301



Generate it programmatically from your mapping CSV:

`python


def generate_redirects(mapping_csv: Path, output: Path):
    with open(mapping_csv) as f:
        reader = csv.DictReader(f)
        lines = ["# Redirects generated by migration script\n"]
        for row in reader:
            lines.append(f"{row['old_url']}  {row['new_url']}  {row['status']}")

    with open(output, "w") as f:
        f.write("\n".join(lines))



AW result: 804 redirect rules in

site/aw/public/_redirects. Internal fragments (CTAs, sliders, chunks) all redirect to /

 with 301.



4. Content Collection Schema Design

Astro content collections use Zod schemas to validate frontmatter at build time. Design schemas that are strict enough to catch errors but flexible enough to handle legacy content with missing fields.

Blog Schema (from AW production)

`typescript


// src/content.config.ts
import { defineCollection, z } from 'astro:content';
import { glob } from 'astro/loaders';

const blog = defineCollection({
  loader: glob({ pattern: '**/*.md', base: './src/content/blog' }),
  schema: z.object({
    title: z.string(),
    description: z.string(),
    publishedAt: z.coerce.date(),
    updatedAt: z.coerce.date().optional(),
    tags: z.array(z.string()).default([]),
    // GEO fields (Generative Engine Optimization)
    problem: z.string().optional(),
    technology: z.string().optional(),
    technologyVersion: z.string().optional(),
    persona: z.string().optional(),
    // SEO
    metaTitle: z.string().optional(),
    metaDescription: z.string().optional(),
    ogImage: z.string().optional(),
    // Content
    tldr: z.string().optional(),
    relatedPosts: z.array(z.string()).default([]),
    readingTime: z.number().optional(),
  }),
});



Case Study Schema

`typescript


const cases = defineCollection({
  loader: glob({ pattern: '**/*.md', base: './src/content/cases' }),
  schema: z.object({
    title: z.string(),
    description: z.string(),
    publishedAt: z.coerce.date(),
    client: z.string().optional(),
    industry: z.string().optional(),
    technologies: z.array(z.string()).default([]),
    metrics: z.array(z.object({
      label: z.string(),
      value: z.string(),
    })).default([]),
    image: z.string().optional(),
    metaTitle: z.string().optional(),
    metaDescription: z.string().optional(),
  }),
});



Schema Design Rules

Required fields for content that must exist: title, description, publishedAt

Optional fields for legacy content that may lack them: metaTitle, ogImage, tldr

Defaults for arrays and booleans: .default([]), .default(false)

Coerce dates: z.coerce.date() handles both ISO strings and Date objects

GEO fields are optional per-post but strongly encouraged: problem, technology, persona





5. Conversion Script Pattern

The conversion script reads JSONL exports and generates .md files with YAML frontmatter.



HTML to Markdown Conversion

`python


import re
from html import unescape

def html_to_markdown(html: str) -> str:
    text = html

    # Remove CMS-specific tags (MODx snippets, WP shortcodes)
    text = re.sub(r'\[\[.*?\]\]', '', text)       # MODx
    text = re.sub(r'\[/?[a-z_]+.*?\]', '', text)  # WordPress shortcodes

    # Remove inline style blocks
    text = re.sub(r']*>.*?', '', text, flags=re.DOTALL)

    # Headings
    for level in range(6, 0, -1):
        prefix = '#' * level
        text = re.sub(
            rf']*>(.*?)',
            lambda m: f'\n\n{prefix} {strip_tags(m.group(1)).strip()}\n\n',
            text, flags=re.DOTALL
        )

    # Bold, italic, links, code blocks, lists...
    text = re.sub(r']*>(.*?)', r'\1', text, flags=re.DOTALL)
    text = re.sub(r']*>(.*?)', r'*\1*', text, flags=re.DOTALL)
    text = re.sub(

r']*>]*>(.*?)',

lambda m: f'\n\n`\n{unescape(strip_tags(m.group(1)))}\n`\n\n',


        text, flags=re.DOTALL
    )

    # Clean up whitespace
    text = re.sub(r'\n{3,}', '\n\n', text)
    return text.strip()



Frontmatter Generation

`python


def build_frontmatter(page: dict, tv_values: dict) -> str:
    tvs = tv_values.get(page["id"], {})
    tags = [t.strip() for t in tvs.get("tags", "").split(",") if t.strip()]

    fm = {
        "title": page["pagetitle"],
        "description": page.get("description", ""),
        "publishedAt": page.get("publishedon", "2025-01-01"),
        "tags": tags,
        "metaDescription": page.get("description", ""),
    }

    lines = ["---"]
    for key, value in fm.items():
        if isinstance(value, list):
            lines.append(f"{key}:")
            for item in value:
                lines.append(f'  - "{item}"')
        elif isinstance(value, str) and (":" in value or '"' in value):
            lines.append(f'{key}: "{value}"')
        else:
            lines.append(f"{key}: {value}")
    lines.append("---")
    return "\n".join(lines)



Full Conversion Pipeline

`python


def convert_all():
    pages = load_jsonl(BLOG_EXPORT)
    tv_values = load_tv_values(TV_EXPORT)

    BLOG_OUTPUT.mkdir(parents=True, exist_ok=True)

    for page in pages:
        slug = page["alias"]
        frontmatter = build_frontmatter(page, tv_values)
        body = html_to_markdown(page.get("content", ""))

        output_path = BLOG_OUTPUT / f"{slug}.md"
        output_path.write_text(f"{frontmatter}\n\n{body}", encoding="utf-8")
        logger.info(f"Converted: {slug}")





6. Image Migration

Download and Organize

`python


import httpx
from pathlib import Path

async def download_images(image_urls: list[str], output_dir: Path):
    output_dir.mkdir(parents=True, exist_ok=True)
    async with httpx.AsyncClient(timeout=30.0) as client:
        for url in image_urls:
            filename = url.split("/")[-1]
            response = await client.get(url)
            if response.status_code == 200:
                (output_dir / filename).write_bytes(response.content)



Directory Structure


public/
  images/
    blog/
      2025/
        post-slug-banner.webp
        post-slug-diagram-1.svg
      2026/
        ...
    cases/
      case-slug-hero.webp
    services/
      service-icon.svg



Optimization

Convert PNG/JPG to WebP using sharp or squoosh-cli


Keep SVG diagrams as-is (they are already small)
Target: hero images <200KB, thumbnails <50KB

Always set width and height attributes to prevent CLS





7. Redirect File

Cloudflare Pages Format


# old-path  new-path  status-code
/old-page/  /new-page  301
/blog/old-slug/  /blog/new-slug  301



Bulk Generation from Migration Decision Output

`python


def generate_redirect_file(redirects_csv: Path, output: Path):
    lines = ["# Redirects for old CMS pages -> new Astro site",
             "# Generated by migration_decision.py\n"]

    with open(redirects_csv) as f:
        for row in csv.DictReader(f):
            old = row["old_url"].rstrip("/") + "/"  # ensure trailing slash
            new = row["new_url"]
            lines.append(f"{old}  {new}  301")

    output.write_text("\n".join(lines))



Testing Redirects

`bash


# Test all redirects resolve (no 404s)
while IFS= read -r line; do
    [[ "$line" =~ ^# ]] && continue
    [[ -z "$line" ]] && continue
    old=$(echo "$line" | awk '{print $1}')
    expected=$(echo "$line" | awk '{print $2}')
    actual=$(curl -sI "https://staging.example.com$old" | grep -i location | awk '{print $2}' | tr -d '\r')
    if [[ "$actual" != *"$expected"* ]]; then
        echo "FAIL: $old -> $actual (expected $expected)"
    fi
done < public/_redirects





8. Validation Checklist

Before launch, verify:

[ ] All migrated pages render without errors (astro build exits 0)


[ ] All old URLs return 301 (not 404)
[ ] No redirect chains (A->B->C -- should be A->C)
[ ] No redirect loops
[ ] Sitemap includes all new pages and excludes redirected ones

[ ] robots.txt allows crawling


[ ] JSON-LD validates on 5 sample pages (Google Rich Results Test)
[ ] OG tags present and correct (Facebook Sharing Debugger)
[ ] Images load on 10 random pages
[ ] Code blocks render with syntax highlighting
[ ] Contact form works on staging

Automated Validation

`bash


# Broken link check
npx linkinator https://staging.example.com --recurse --format json > broken-links.json

# Lighthouse audit
npx unlighthouse --site https://staging.example.com --reporter json

# Accessibility
npx pa11y-ci --sitemap https://staging.example.com/sitemap-index.xml





9. Traffic Preservation

Pre-Launch

Submit new sitemap to GSC
Request indexing for top 20 pages by traffic
Set up GSC property for new domain (if domain changes)

Post-Launch Monitoring

Day 1: Check GSC for crawl errors. Fix any 404s immediately.
Week 1: Compare impressions to pre-migration baseline. Drops of 10-20% are normal and recover within 2-4 weeks.
Week 2-4: Monitor position changes for top 50 keywords. Flag anything that drops more than 5 positions.
Month 2: Full traffic comparison. If traffic has not recovered, investigate: missing redirects, changed canonical URLs, or missing structured data.

IndexNow

Ping search engines immediately after deploying new content:

`python


import httpx

def ping_indexnow(urls: list[str], key: str):
    payload = {
        "host": "yourdomain.com",
        "key": key,
        "urlList": urls,
    }
    httpx.post("https://api.indexnow.org/indexnow", json=payload)





10. Rollback Plan

If traffic drops exceed 30% after 4 weeks:

DNS rollback: Point domain back to old CMS server (if still running)

Redirect reversal: Remove _redirects` file, restore old URL structure

Investigate: Compare old vs new canonical URLs, structured data, internal linking

Keep the old CMS server running (read-only) for at least 60 days post-migration. The AW migration kept the MODx server at 138.68.156.149 as a read-only fallback for 90 days.

AW Migration Summary

Metric

Value

Source CMS

MODx Revolution

Source pages

937 published

Migrated pages

147 (115 blog + 12 cases + 5 services + index pages)

Redirects

804 rules

Traffic impact

Zero loss after 4-week stabilization

Timeline

3 weeks (export, convert, audit, deploy)

Framework

Astro 5.x + Tailwind v4

Hosting

Cloudflare Pages