Operations

CMS Migration

# CMS-to-Astro Migration Playbook

Production-tested methodology from the ActiveWizards migration: 937 MODx pages reduced to 147 Astro pages + 804 redirects. Zero traffic loss.


1. Export Strategy

JSONL Export from Source CMS

Export each content type as a separate JSONL file (one JSON object per line). JSONL is superior to monolithic JSON for large exports: it is streamable, diffable, and survives partial corruption.

MODx export (SQL to JSONL):

``sql

-- Blog posts

SELECT c.id, c.pagetitle, c.longtitle, c.description,

c.alias, c.content, c.publishedon, c.parent, c.template

FROM modx_site_content c

WHERE c.template = 4 AND c.published = 1

INTO OUTFILE '/tmp/aw-blog-export.jsonl'

FIELDS TERMINATED BY '\n';

`

In practice, a Python script with pymysql gives better control:

`python

import json

from pathlib import Path

def export_to_jsonl(rows: list[dict], output: Path):

with open(output, "w", encoding="utf-8") as f:

for row in rows:

f.write(json.dumps(row, ensure_ascii=False, default=str) + "\n")

`

WordPress export: Use WP-CLI wp export to get WXR XML, then convert to JSONL with a script that parses elements. Alternatively, query wp_posts directly via MySQL. Ghost export: Ghost Admin API returns JSON. Paginate with ?limit=15&page=N and write each post as a JSONL line. Ghost uses Lexical internally; request HTML via ?formats=html. Always export separately:
  • aw-blog-export.jsonl -- core content fields
  • aw-cases-export.jsonl -- case studies
  • aw-tv-export.jsonl -- template variables / custom fields (tags, images, related posts)

JSONL Loader Pattern

`python

def load_jsonl(path: Path) -> list[dict]:

items = []

with open(path, "r", encoding="utf-8", errors="replace") as f:

for line in f:

line = line.strip()

if line:

try:

items.append(json.loads(line))

except json.JSONDecodeError:

logger.warning(f"Skipping unparseable line in {path.name}")

return items

`


2. Content Audit: Traffic-Based Priority

Pull 90 days of GSC data to make keep/redirect/kill decisions based on actual traffic, not gut feel.

GSC Data Pull

`python

from google.oauth2.credentials import Credentials

from googleapiclient.discovery import build

service = build("searchconsole", "v1", credentials=creds)

response = service.searchanalytics().query(

siteUrl="sc-domain:yourdomain.com",

body={

"startDate": "2025-12-01",

"endDate": "2026-02-28",

"dimensions": ["page"],

"rowLimit": 25000,

}

).execute()

`

Decision Matrix

ConditionAction
Blog post (any traffic)Migrate
Case study (any traffic)Migrate
HomepageMigrate
Service root pageMigrate
Other page with >30 clicks/90 days (~10/mo)Migrate
Everything else301 redirect to nearest parent

The AW migration used these exact rules in scripts/migration_decision.py:

`python

ALWAYS_MIGRATE_TEMPLATES = {"Home", "Blog", "BlogTest", "CaseStudy"}

ALWAYS_MIGRATE_PATTERNS = ["blog/", "case-stud"]

MIN_CLICKS_90D = 30 # ~10 clicks/month threshold

`

Output

Two CSV files:

  • pages-to-migrate.csv -- slug, title, traffic, content type
  • redirects.csv -- old URL, target URL, status code

3. URL Mapping

Old URLs to New Slugs

Create a mapping CSV that connects every old URL to its new destination:

`csv

old_url,new_url,status

/blog/old-post-title/,/blog/old-post-title,301

/case-study/old-case/,/case-studies/old-case,301

/services/ai-ml/,/services/data-science,301

`

Key decisions:

  • Drop trailing slashes if your new site does not use them (Astro default: no trailing slash)
  • Flatten nested URLs where possible (/blog/category/post/ becomes /blog/post)
  • Rename URL segments if the old CMS used bad patterns (/case-study/ to /case-studies/)

Redirect File Generation

Cloudflare Pages uses a _redirects file in public/:

`

# Format: old-path new-path status-code

/contact/ /contact-us/ 301

/old-blog-post/ /blog/new-slug 301

/services/ai-ml/ /services/data-science 301

`

Generate it programmatically from your mapping CSV:

`python

def generate_redirects(mapping_csv: Path, output: Path):

with open(mapping_csv) as f:

reader = csv.DictReader(f)

lines = ["# Redirects generated by migration script\n"]

for row in reader:

lines.append(f"{row['old_url']} {row['new_url']} {row['status']}")

with open(output, "w") as f:

f.write("\n".join(lines))

`

AW result: 804 redirect rules in site/aw/public/_redirects. Internal fragments (CTAs, sliders, chunks) all redirect to / with 301.

4. Content Collection Schema Design

Astro content collections use Zod schemas to validate frontmatter at build time. Design schemas that are strict enough to catch errors but flexible enough to handle legacy content with missing fields.

Blog Schema (from AW production)

`typescript

// src/content.config.ts

import { defineCollection, z } from 'astro:content';

import { glob } from 'astro/loaders';

const blog = defineCollection({

loader: glob({ pattern: '**/*.md', base: './src/content/blog' }),

schema: z.object({

title: z.string(),

description: z.string(),

publishedAt: z.coerce.date(),

updatedAt: z.coerce.date().optional(),

tags: z.array(z.string()).default([]),

// GEO fields (Generative Engine Optimization)

problem: z.string().optional(),

technology: z.string().optional(),

technologyVersion: z.string().optional(),

persona: z.string().optional(),

// SEO

metaTitle: z.string().optional(),

metaDescription: z.string().optional(),

ogImage: z.string().optional(),

// Content

tldr: z.string().optional(),

relatedPosts: z.array(z.string()).default([]),

readingTime: z.number().optional(),

}),

});

`

Case Study Schema

`typescript

const cases = defineCollection({

loader: glob({ pattern: '**/*.md', base: './src/content/cases' }),

schema: z.object({

title: z.string(),

description: z.string(),

publishedAt: z.coerce.date(),

client: z.string().optional(),

industry: z.string().optional(),

technologies: z.array(z.string()).default([]),

metrics: z.array(z.object({

label: z.string(),

value: z.string(),

})).default([]),

image: z.string().optional(),

metaTitle: z.string().optional(),

metaDescription: z.string().optional(),

}),

});

`

Schema Design Rules

  • Required fields for content that must exist: title, description, publishedAt
  • Optional fields for legacy content that may lack them: metaTitle, ogImage, tldr
  • Defaults for arrays and booleans: .default([]), .default(false)
  • Coerce dates: z.coerce.date() handles both ISO strings and Date objects
  • GEO fields are optional per-post but strongly encouraged: problem, technology, persona

  • 5. Conversion Script Pattern

    The conversion script reads JSONL exports and generates .md files with YAML frontmatter.

    HTML to Markdown Conversion

    `python

    import re

    from html import unescape

    def html_to_markdown(html: str) -> str:

    text = html

    # Remove CMS-specific tags (MODx snippets, WP shortcodes)

    text = re.sub(r'\[\[.*?\]\]', '', text) # MODx

    text = re.sub(r'\[/?[a-z_]+.*?\]', '', text) # WordPress shortcodes

    # Remove inline style blocks

    text = re.sub(r']*>.*?', '', text, flags=re.DOTALL)

    # Headings

    for level in range(6, 0, -1):

    prefix = '#' * level

    text = re.sub(

    rf']*>(.*?)',

    lambda m: f'\n\n{prefix} {strip_tags(m.group(1)).strip()}\n\n',

    text, flags=re.DOTALL

    )

    # Bold, italic, links, code blocks, lists...

    text = re.sub(r']*>(.*?)', r'\1', text, flags=re.DOTALL)

    text = re.sub(r']*>(.*?)', r'*\1*', text, flags=re.DOTALL)

    text = re.sub(

    r']*>]*>(.*?)',

    lambda m: f'\n\n`\n{unescape(strip_tags(m.group(1)))}\n`\n\n',

    text, flags=re.DOTALL

    )

    # Clean up whitespace

    text = re.sub(r'\n{3,}', '\n\n', text)

    return text.strip()

    `

    Frontmatter Generation

    `python

    def build_frontmatter(page: dict, tv_values: dict) -> str:

    tvs = tv_values.get(page["id"], {})

    tags = [t.strip() for t in tvs.get("tags", "").split(",") if t.strip()]

    fm = {

    "title": page["pagetitle"],

    "description": page.get("description", ""),

    "publishedAt": page.get("publishedon", "2025-01-01"),

    "tags": tags,

    "metaDescription": page.get("description", ""),

    }

    lines = ["---"]

    for key, value in fm.items():

    if isinstance(value, list):

    lines.append(f"{key}:")

    for item in value:

    lines.append(f' - "{item}"')

    elif isinstance(value, str) and (":" in value or '"' in value):

    lines.append(f'{key}: "{value}"')

    else:

    lines.append(f"{key}: {value}")

    lines.append("---")

    return "\n".join(lines)

    `

    Full Conversion Pipeline

    `python

    def convert_all():

    pages = load_jsonl(BLOG_EXPORT)

    tv_values = load_tv_values(TV_EXPORT)

    BLOG_OUTPUT.mkdir(parents=True, exist_ok=True)

    for page in pages:

    slug = page["alias"]

    frontmatter = build_frontmatter(page, tv_values)

    body = html_to_markdown(page.get("content", ""))

    output_path = BLOG_OUTPUT / f"{slug}.md"

    output_path.write_text(f"{frontmatter}\n\n{body}", encoding="utf-8")

    logger.info(f"Converted: {slug}")

    `


    6. Image Migration

    Download and Organize

    `python

    import httpx

    from pathlib import Path

    async def download_images(image_urls: list[str], output_dir: Path):

    output_dir.mkdir(parents=True, exist_ok=True)

    async with httpx.AsyncClient(timeout=30.0) as client:

    for url in image_urls:

    filename = url.split("/")[-1]

    response = await client.get(url)

    if response.status_code == 200:

    (output_dir / filename).write_bytes(response.content)

    `

    Directory Structure

    `

    public/

    images/

    blog/

    2025/

    post-slug-banner.webp

    post-slug-diagram-1.svg

    2026/

    ...

    cases/

    case-slug-hero.webp

    services/

    service-icon.svg

    `

    Optimization

    • Convert PNG/JPG to WebP using sharp or squoosh-cli
    • Keep SVG diagrams as-is (they are already small)
    • Target: hero images <200KB, thumbnails <50KB
    • Always set width and height attributes to prevent CLS

    7. Redirect File

    Cloudflare Pages Format

    `

    # old-path new-path status-code

    /old-page/ /new-page 301

    /blog/old-slug/ /blog/new-slug 301

    `

    Bulk Generation from Migration Decision Output

    `python

    def generate_redirect_file(redirects_csv: Path, output: Path):

    lines = ["# Redirects for old CMS pages -> new Astro site",

    "# Generated by migration_decision.py\n"]

    with open(redirects_csv) as f:

    for row in csv.DictReader(f):

    old = row["old_url"].rstrip("/") + "/" # ensure trailing slash

    new = row["new_url"]

    lines.append(f"{old} {new} 301")

    output.write_text("\n".join(lines))

    `

    Testing Redirects

    `bash

    # Test all redirects resolve (no 404s)

    while IFS= read -r line; do

    [[ "$line" =~ ^# ]] && continue

    [[ -z "$line" ]] && continue

    old=$(echo "$line" | awk '{print $1}')

    expected=$(echo "$line" | awk '{print $2}')

    actual=$(curl -sI "https://staging.example.com$old" | grep -i location | awk '{print $2}' | tr -d '\r')

    if [[ "$actual" != *"$expected"* ]]; then

    echo "FAIL: $old -> $actual (expected $expected)"

    fi

    done < public/_redirects

    `


    8. Validation Checklist

    Before launch, verify:

    • [ ] All migrated pages render without errors (astro build exits 0)
    • [ ] All old URLs return 301 (not 404)
    • [ ] No redirect chains (A->B->C -- should be A->C)
    • [ ] No redirect loops
    • [ ] Sitemap includes all new pages and excludes redirected ones
    • [ ] robots.txt allows crawling
    • [ ] JSON-LD validates on 5 sample pages (Google Rich Results Test)
    • [ ] OG tags present and correct (Facebook Sharing Debugger)
    • [ ] Images load on 10 random pages
    • [ ] Code blocks render with syntax highlighting
    • [ ] Contact form works on staging

    Automated Validation

    `bash

    # Broken link check

    npx linkinator https://staging.example.com --recurse --format json > broken-links.json

    # Lighthouse audit

    npx unlighthouse --site https://staging.example.com --reporter json

    # Accessibility

    npx pa11y-ci --sitemap https://staging.example.com/sitemap-index.xml

    `


    9. Traffic Preservation

    Pre-Launch

  • Submit new sitemap to GSC
  • Request indexing for top 20 pages by traffic
  • Set up GSC property for new domain (if domain changes)
  • Post-Launch Monitoring

    • Day 1: Check GSC for crawl errors. Fix any 404s immediately.
    • Week 1: Compare impressions to pre-migration baseline. Drops of 10-20% are normal and recover within 2-4 weeks.
    • Week 2-4: Monitor position changes for top 50 keywords. Flag anything that drops more than 5 positions.
    • Month 2: Full traffic comparison. If traffic has not recovered, investigate: missing redirects, changed canonical URLs, or missing structured data.

    IndexNow

    Ping search engines immediately after deploying new content:

    `python

    import httpx

    def ping_indexnow(urls: list[str], key: str):

    payload = {

    "host": "yourdomain.com",

    "key": key,

    "urlList": urls,

    }

    httpx.post("https://api.indexnow.org/indexnow", json=payload)

    `


    10. Rollback Plan

    If traffic drops exceed 30% after 4 weeks:

  • DNS rollback: Point domain back to old CMS server (if still running)
  • Redirect reversal: Remove _redirects` file, restore old URL structure
  • Investigate: Compare old vs new canonical URLs, structured data, internal linking
  • Keep the old CMS server running (read-only) for at least 60 days post-migration. The AW migration kept the MODx server at 138.68.156.149 as a read-only fallback for 90 days.


    AW Migration Summary

    MetricValue
    Source CMSMODx Revolution
    Source pages937 published
    Migrated pages147 (115 blog + 12 cases + 5 services + index pages)
    Redirects804 rules
    Traffic impactZero loss after 4-week stabilization
    Timeline3 weeks (export, convert, audit, deploy)
    FrameworkAstro 5.x + Tailwind v4
    HostingCloudflare Pages