# CMS-to-Astro Migration Playbook
Production-tested methodology from the ActiveWizards migration: 937 MODx pages reduced to 147 Astro pages + 804 redirects. Zero traffic loss.
1. Export Strategy
JSONL Export from Source CMS
Export each content type as a separate JSONL file (one JSON object per line). JSONL is superior to monolithic JSON for large exports: it is streamable, diffable, and survives partial corruption.
MODx export (SQL to JSONL):``sql
-- Blog posts
SELECT c.id, c.pagetitle, c.longtitle, c.description,
c.alias, c.content, c.publishedon, c.parent, c.template
FROM modx_site_content c
WHERE c.template = 4 AND c.published = 1
INTO OUTFILE '/tmp/aw-blog-export.jsonl'
FIELDS TERMINATED BY '\n';
`
In practice, a Python script with pymysql gives better control:
`python
import json
from pathlib import Path
def export_to_jsonl(rows: list[dict], output: Path):
with open(output, "w", encoding="utf-8") as f:
for row in rows:
f.write(json.dumps(row, ensure_ascii=False, default=str) + "\n")
`
to get WXR XML, then convert to JSONL with a script that parses elements. Alternatively, query wp_posts directly via MySQL.
Ghost export: Ghost Admin API returns JSON. Paginate with ?limit=15&page=N and write each post as a JSONL line. Ghost uses Lexical internally; request HTML via ?formats=html.
Always export separately:
aw-blog-export.jsonl -- core content fields
aw-cases-export.jsonl -- case studies
aw-tv-export.jsonl -- template variables / custom fields (tags, images, related posts)
JSONL Loader Pattern
`python
def load_jsonl(path: Path) -> list[dict]:
items = []
with open(path, "r", encoding="utf-8", errors="replace") as f:
for line in f:
line = line.strip()
if line:
try:
items.append(json.loads(line))
except json.JSONDecodeError:
logger.warning(f"Skipping unparseable line in {path.name}")
return items
`
2. Content Audit: Traffic-Based Priority
Pull 90 days of GSC data to make keep/redirect/kill decisions based on actual traffic, not gut feel.
GSC Data Pull
`python
from google.oauth2.credentials import Credentials
from googleapiclient.discovery import build
service = build("searchconsole", "v1", credentials=creds)
response = service.searchanalytics().query(
siteUrl="sc-domain:yourdomain.com",
body={
"startDate": "2025-12-01",
"endDate": "2026-02-28",
"dimensions": ["page"],
"rowLimit": 25000,
}
).execute()
`
Decision Matrix
Condition Action
Blog post (any traffic) Migrate
Case study (any traffic) Migrate
Homepage Migrate
Service root page Migrate
Other page with >30 clicks/90 days (~10/mo) Migrate
Everything else 301 redirect to nearest parent
The AW migration used these exact rules in
scripts/migration_decision.py:
`python
ALWAYS_MIGRATE_TEMPLATES = {"Home", "Blog", "BlogTest", "CaseStudy"}
ALWAYS_MIGRATE_PATTERNS = ["blog/", "case-stud"]
MIN_CLICKS_90D = 30 # ~10 clicks/month threshold
`
Output
Two CSV files:
pages-to-migrate.csv -- slug, title, traffic, content type
redirects.csv -- old URL, target URL, status code
3. URL Mapping
Old URLs to New Slugs
Create a mapping CSV that connects every old URL to its new destination:
`csv
old_url,new_url,status
/blog/old-post-title/,/blog/old-post-title,301
/case-study/old-case/,/case-studies/old-case,301
/services/ai-ml/,/services/data-science,301
`
Key decisions:
- Drop trailing slashes if your new site does not use them (Astro default: no trailing slash)
- Flatten nested URLs where possible (
/blog/category/post/ becomes /blog/post)
- Rename URL segments if the old CMS used bad patterns (
/case-study/ to /case-studies/)
Redirect File Generation
Cloudflare Pages uses a
_redirects file in public/:
`
# Format: old-path new-path status-code
/contact/ /contact-us/ 301
/old-blog-post/ /blog/new-slug 301
/services/ai-ml/ /services/data-science 301
`
Generate it programmatically from your mapping CSV:
`python
def generate_redirects(mapping_csv: Path, output: Path):
with open(mapping_csv) as f:
reader = csv.DictReader(f)
lines = ["# Redirects generated by migration script\n"]
for row in reader:
lines.append(f"{row['old_url']} {row['new_url']} {row['status']}")
with open(output, "w") as f:
f.write("\n".join(lines))
`
AW result: 804 redirect rules in site/aw/public/_redirects. Internal fragments (CTAs, sliders, chunks) all redirect to / with 301.
4. Content Collection Schema Design
Astro content collections use Zod schemas to validate frontmatter at build time. Design schemas that are strict enough to catch errors but flexible enough to handle legacy content with missing fields.
Blog Schema (from AW production)
`typescript
// src/content.config.ts
import { defineCollection, z } from 'astro:content';
import { glob } from 'astro/loaders';
const blog = defineCollection({
loader: glob({ pattern: '**/*.md', base: './src/content/blog' }),
schema: z.object({
title: z.string(),
description: z.string(),
publishedAt: z.coerce.date(),
updatedAt: z.coerce.date().optional(),
tags: z.array(z.string()).default([]),
// GEO fields (Generative Engine Optimization)
problem: z.string().optional(),
technology: z.string().optional(),
technologyVersion: z.string().optional(),
persona: z.string().optional(),
// SEO
metaTitle: z.string().optional(),
metaDescription: z.string().optional(),
ogImage: z.string().optional(),
// Content
tldr: z.string().optional(),
relatedPosts: z.array(z.string()).default([]),
readingTime: z.number().optional(),
}),
});
`
Case Study Schema
`typescript
const cases = defineCollection({
loader: glob({ pattern: '**/*.md', base: './src/content/cases' }),
schema: z.object({
title: z.string(),
description: z.string(),
publishedAt: z.coerce.date(),
client: z.string().optional(),
industry: z.string().optional(),
technologies: z.array(z.string()).default([]),
metrics: z.array(z.object({
label: z.string(),
value: z.string(),
})).default([]),
image: z.string().optional(),
metaTitle: z.string().optional(),
metaDescription: z.string().optional(),
}),
});
`
Schema Design Rules
Required fields for content that must exist: title, description, publishedAt
Optional fields for legacy content that may lack them: metaTitle, ogImage, tldr
Defaults for arrays and booleans: .default([]), .default(false)
Coerce dates: z.coerce.date() handles both ISO strings and Date objects
GEO fields are optional per-post but strongly encouraged: problem, technology, persona
5. Conversion Script Pattern
The conversion script reads JSONL exports and generates
.md files with YAML frontmatter.
HTML to Markdown Conversion
`python
import re
from html import unescape
def html_to_markdown(html: str) -> str:
text = html
# Remove CMS-specific tags (MODx snippets, WP shortcodes)
text = re.sub(r'\[\[.*?\]\]', '', text) # MODx
text = re.sub(r'\[/?[a-z_]+.*?\]', '', text) # WordPress shortcodes
# Remove inline style blocks
text = re.sub(r'', '', text, flags=re.DOTALL)
# Headings
for level in range(6, 0, -1):
prefix = '#' * level
text = re.sub(
rf']*>(.*?) ',
lambda m: f'\n\n{prefix} {strip_tags(m.group(1)).strip()}\n\n',
text, flags=re.DOTALL
)
# Bold, italic, links, code blocks, lists...
text = re.sub(r']*>(.*?)', r'\1', text, flags=re.DOTALL)
text = re.sub(r']*>(.*?)', r'*\1*', text, flags=re.DOTALL)
text = re.sub(
r'
]*>]*>(.*?)
',
lambda m: f'\n\n
`\n{unescape(strip_tags(m.group(1)))}\n`\n\n',
text, flags=re.DOTALL
)
# Clean up whitespace
text = re.sub(r'\n{3,}', '\n\n', text)
return text.strip()
`
Frontmatter Generation
`python
def build_frontmatter(page: dict, tv_values: dict) -> str:
tvs = tv_values.get(page["id"], {})
tags = [t.strip() for t in tvs.get("tags", "").split(",") if t.strip()]
fm = {
"title": page["pagetitle"],
"description": page.get("description", ""),
"publishedAt": page.get("publishedon", "2025-01-01"),
"tags": tags,
"metaDescription": page.get("description", ""),
}
lines = ["---"]
for key, value in fm.items():
if isinstance(value, list):
lines.append(f"{key}:")
for item in value:
lines.append(f' - "{item}"')
elif isinstance(value, str) and (":" in value or '"' in value):
lines.append(f'{key}: "{value}"')
else:
lines.append(f"{key}: {value}")
lines.append("---")
return "\n".join(lines)
`
Full Conversion Pipeline
`python
def convert_all():
pages = load_jsonl(BLOG_EXPORT)
tv_values = load_tv_values(TV_EXPORT)
BLOG_OUTPUT.mkdir(parents=True, exist_ok=True)
for page in pages:
slug = page["alias"]
frontmatter = build_frontmatter(page, tv_values)
body = html_to_markdown(page.get("content", ""))
output_path = BLOG_OUTPUT / f"{slug}.md"
output_path.write_text(f"{frontmatter}\n\n{body}", encoding="utf-8")
logger.info(f"Converted: {slug}")
`
6. Image Migration
Download and Organize
`python
import httpx
from pathlib import Path
async def download_images(image_urls: list[str], output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
async with httpx.AsyncClient(timeout=30.0) as client:
for url in image_urls:
filename = url.split("/")[-1]
response = await client.get(url)
if response.status_code == 200:
(output_dir / filename).write_bytes(response.content)
`
Directory Structure
`
public/
images/
blog/
2025/
post-slug-banner.webp
post-slug-diagram-1.svg
2026/
...
cases/
case-slug-hero.webp
services/
service-icon.svg
`
Optimization
- Convert PNG/JPG to WebP using
sharp or squoosh-cli
- Keep SVG diagrams as-is (they are already small)
- Target: hero images <200KB, thumbnails <50KB
- Always set
width and height attributes to prevent CLS
7. Redirect File
Cloudflare Pages Format
`
# old-path new-path status-code
/old-page/ /new-page 301
/blog/old-slug/ /blog/new-slug 301
`
Bulk Generation from Migration Decision Output
`python
def generate_redirect_file(redirects_csv: Path, output: Path):
lines = ["# Redirects for old CMS pages -> new Astro site",
"# Generated by migration_decision.py\n"]
with open(redirects_csv) as f:
for row in csv.DictReader(f):
old = row["old_url"].rstrip("/") + "/" # ensure trailing slash
new = row["new_url"]
lines.append(f"{old} {new} 301")
output.write_text("\n".join(lines))
`
Testing Redirects
`bash
# Test all redirects resolve (no 404s)
while IFS= read -r line; do
[[ "$line" =~ ^# ]] && continue
[[ -z "$line" ]] && continue
old=$(echo "$line" | awk '{print $1}')
expected=$(echo "$line" | awk '{print $2}')
actual=$(curl -sI "https://staging.example.com$old" | grep -i location | awk '{print $2}' | tr -d '\r')
if [[ "$actual" != *"$expected"* ]]; then
echo "FAIL: $old -> $actual (expected $expected)"
fi
done < public/_redirects
`
8. Validation Checklist
Before launch, verify:
- [ ] All migrated pages render without errors (
astro build exits 0)
- [ ] All old URLs return 301 (not 404)
- [ ] No redirect chains (A->B->C -- should be A->C)
- [ ] No redirect loops
- [ ] Sitemap includes all new pages and excludes redirected ones
- [ ]
robots.txt allows crawling
- [ ] JSON-LD validates on 5 sample pages (Google Rich Results Test)
- [ ] OG tags present and correct (Facebook Sharing Debugger)
- [ ] Images load on 10 random pages
- [ ] Code blocks render with syntax highlighting
- [ ] Contact form works on staging
Automated Validation
`bash
# Broken link check
npx linkinator https://staging.example.com --recurse --format json > broken-links.json
# Lighthouse audit
npx unlighthouse --site https://staging.example.com --reporter json
# Accessibility
npx pa11y-ci --sitemap https://staging.example.com/sitemap-index.xml
`
9. Traffic Preservation
Pre-Launch
Submit new sitemap to GSC
Request indexing for top 20 pages by traffic
Set up GSC property for new domain (if domain changes)
Post-Launch Monitoring
- Day 1: Check GSC for crawl errors. Fix any 404s immediately.
- Week 1: Compare impressions to pre-migration baseline. Drops of 10-20% are normal and recover within 2-4 weeks.
- Week 2-4: Monitor position changes for top 50 keywords. Flag anything that drops more than 5 positions.
- Month 2: Full traffic comparison. If traffic has not recovered, investigate: missing redirects, changed canonical URLs, or missing structured data.
IndexNow
Ping search engines immediately after deploying new content:
`python
import httpx
def ping_indexnow(urls: list[str], key: str):
payload = {
"host": "yourdomain.com",
"key": key,
"urlList": urls,
}
httpx.post("https://api.indexnow.org/indexnow", json=payload)
`
10. Rollback Plan
If traffic drops exceed 30% after 4 weeks:
DNS rollback: Point domain back to old CMS server (if still running)
Redirect reversal: Remove _redirects` file, restore old URL structure
Keep the old CMS server running (read-only) for at least 60 days post-migration. The AW migration kept the MODx server at 138.68.156.149 as a read-only fallback for 90 days.
AW Migration Summary
| Metric | Value |
| Source CMS | MODx Revolution |
| Source pages | 937 published |
| Migrated pages | 147 (115 blog + 12 cases + 5 services + index pages) |
| Redirects | 804 rules |
| Traffic impact | Zero loss after 4-week stabilization |
| Timeline | 3 weeks (export, convert, audit, deploy) |
| Framework | Astro 5.x + Tailwind v4 |
| Hosting | Cloudflare Pages |