Production-tested content generation and quality system from Pagezilla. Covers research, generation, validation, diagrams, publishing, and ongoing maintenance for a programmatic SEO blog.
1. Content Calendar: Topic Research via GSC Semantic Silence
Striking Distance Keywords
Pull keywords from GSC where you rank positions 4-20 with high impressions. These are the easiest wins: you are already visible but not clicking.
class PagezillaResearchEngine:
async def ingest_gsc_data(self) -> list[dict]:
from src.tools.gsc_connector import GSCConnector
gsc = GSCConnector()
# Positions 4-20 with high impressions = "striking distance"
self.gsc_data = gsc.get_striking_distance(days=90)
return self.gsc_data
Semantic Silence Discovery
"Semantic silence" is the gap between what users search for and what your blog covers. Use an LLM to compare GSC demand against your existing content inventory:
async def discover_semantic_silence(self) -> list[dict]:
"""Compare GSC demand vs. existing content to find uncovered topics."""
existing = await self.ingest_existing_posts()
gsc_keywords = await self.ingest_gsc_data()
prompt = f"""
Here are our existing blog posts: {json.dumps(existing)}
Here are high-impression keywords we rank for but have no dedicated content:
{json.dumps(gsc_keywords)}
Identify 10 topics where we have search demand but no matching article.
For each: title, primary keyword, intent, rationale.
"""
opportunities = await llm.reason(prompt)
return opportunities
Anti-Cannibalization
Before creating new content, check if an existing post already targets the same keyword. Cannibalization splits ranking signals and hurts both pages.
def is_cannibalized(new_keyword: str, existing_posts: list[dict]) -> bool:
"""Check if any existing post already targets this keyword."""
for post in existing_posts:
if new_keyword.lower() in post.get("title", "").lower():
return True
if new_keyword.lower() in post.get("description", "").lower():
return True
return False
Content Calendar Format
CSV with tiered priorities. Tier 1 = highest impact, publish first.
tier,priority,pillar,title,slug,primary_keyword,status,scheduled_date,posted_url,notes
1,1,AI Agents,Mastering LangGraph,mastering-langgraph,langgraph stateful workflows,published,,https://...
1,2,RAG,Pinecone Tuning Guide,pinecone-tuning,pinecone performance,published,,https://...
2,1,MLOps,Kubernetes for LLMs,k8s-llm-deploy,kubernetes llm deployment,to_do,,,
AW calendar: 75 topics across 5 pillars (AI Agents, RAG & Knowledge, MLOps, Multi-Agent, Real-Time Data), tiered 1-5.
2. Article Generation: LLM Prompt System
Model Routing
Different models for different tasks. Use the best tool for each job:
| Task | Model | Reason |
| Topic research / reasoning | Gemini 2.5 Pro | Deep analytical thinking |
| Article writing | Claude Sonnet 4 | Best prose quality, follows structured output |
| Validation / fast checks | Gemini Flash | Fast, cheap, good for classification |
| Image generation | Gemini 2.0 Flash Image | Best API-accessible image generation |
| D2 syntax fixing | Gemini Flash | Fast turnaround on simple code fixes |
All models accessed through a single OpenRouter key (except Google image API).
System Prompt
The system prompt defines voice, rules, and structure. Key elements:
WRITER_SYSTEM_PROMPT = """
You are a Principal AI/Data Engineer writing technical articles for ActiveWizards.
WRITING RULES (non-negotiable):
1. Tone: Expert, specific, opinionated. Written by a Senior Staff Engineer.
2. BANNED words (instant rejection): delve, robust, seamlessly, demystify,
landscape, unleash, transformative, revolutionize, game-changer.
3. Depth: Assume the reader is a Senior Engineer or technical CTO.
4. Code: All code snippets must be complete, correct, and well-commented.
5. Diagrams: D2 code provided separately -- embed as <img> tags.
6. Tables: EVERY <table> MUST be wrapped in <div class="table-responsive-wrapper">.
7. CTA: Every article ends with <div class="article-cta-section">.
GEO RULES (Generative Engine Optimization):
9. TL;DR BLOCK: 5-8 specific, numeric takeaways for AI citation.
10. FAQ SECTION: 3-5 Q&A pairs matching real search queries.
11. CITATION ANCHORS: At least 3 bold, opinionated one-liners.
12. SPECIFICITY MANDATE: Every claim must be grounded with numbers.
"""
Article Prompt Structure
1. Problem-first intro (2-3 paragraphs, no "In this article we will...")
TL;DR block (immediately after intro)
4-6 H2 sections: concepts, architecture, implementation, trade-offs
Minimum 2 code snippets
At least 1 comparison table
At least 1 pro-tip callout
FAQ section
Further reading section
CTA section (topic-specific, not generic)
3. Quality Benchmark: 10-Point Rubric
Score every article against this rubric before publishing. Minimum passing score: 8/10.
| # | Criterion | Pass Condition |
| 1 | Word count | >= 1,800 words of substantive content |
| 2 | Diagrams | >= 2 D2 diagrams rendered as SVG |
| 3 | Code snippets | >= 2 complete, runnable code blocks |
| 4 | TL;DR block | 5-8 specific, numeric bullet points |
| 5 | FAQ section | 3-5 Q&A pairs matching real search queries |
| 6 | Citation anchors | >= 3 bold, opinionated claims |
| 7 | Banned words | Zero instances of banned words list |
| 8 | CTA section | Present, topic-specific (not generic) |
| 9 | Table | At least 1 comparison table, wrapped in responsive div |
| 10 | Pro-tip callout | At least 1 callout with practical advice |
4. Pydantic Validation
Every generated article is validated through a Pydantic schema before it can proceed to review. This catches LLM output quality issues at generation time, not after human review.
Article Schema (Key Validators)
class PagezillaArticle(BaseModel):
pagetitle: str # Max 60 chars
description: str # Max 155 chars
summary: str # Max 120 chars
alias: str # URL slug
html_body: str # Complete article HTML
tags: list[str] # 5-10 relevant tags
diagrams: list[DiagramSpec] # Min 2 diagrams
tl_dr_bullets: list[str] # 5-8 specific takeaways
faq_items: list[dict] # Min 3 Q&A pairs
schema_json_ld: str # TechArticle + FAQPage JSON-LD
@field_validator("pagetitle")
@classmethod
def check_title_length(cls, v: str) -> str:
if len(v) > 60:
return v[:60].rsplit(" ", 1)[0].rstrip(".,: ")
return v
@field_validator("html_body")
@classmethod
def check_no_llm_speak(cls, v: str) -> str:
found = [w for w in BANNED_WORDS if w.lower() in v.lower()]
if found:
raise ValueError(f"LLM-speak detected: {found}")
return v
@field_validator("diagrams")
@classmethod
def check_diagrams(cls, v: list[DiagramSpec]) -> list[DiagramSpec]:
if len(v) < 2:
raise ValueError(f"Need at least 2 diagrams, got {len(v)}")
return v
@model_validator(mode="after")
def check_word_count(self) -> "PagezillaArticle":
text = re.sub(r"<[^>]+>", " ", self.html_body)
word_count = len(text.split())
if word_count < 1200:
raise ValueError(f"Word count ({word_count}) too low. Target 1,800+.")
return self
Banned Words List
BANNED_WORDS = [
"delve", "robust", "seamlessly", "demystify", "landscape",
"unleash", "transformative", "revolutionize", "game-changer",
"it's worth noting", "it is worth noting", "having said that",
"in conclusion", "in summary",
]
Validation Chain
The schema enforces 12 validators in sequence:
If any validator fails, the article is rejected and the LLM is re-prompted with the specific error message.
5. D2 Diagram Pipeline
Source to SVG
D2 code in DiagramSpec
|
v
Kroki API (POST https://kroki.io/d2/svg)
|
v
SVG file saved to artifact directory
|
v
Referenced in HTML as <img src="assets/blog_visuals/slug_diagram_N.svg">
D2 Rules for Kroki Compatibility
# Validated at schema level
@field_validator("d2_code")
@classmethod
def no_unsupported_blocks(cls, v: str) -> str:
for blocked in ("vars:", "d2-config:"):
if blocked in v:
raise ValueError(f"D2 contains unsupported block '{blocked}'")
return v
D2 best practices for Kroki:
- Use
direction: rightordirection: down - Quote strings with spaces:
"API Gateway" - Arrow syntax:
A -> B: "label" - Containers:
container { child1; child2 } - NO
vars:blocks, NOd2-config:blocks - Sanitize
$characters (Kroki treats them as substitution)
Auto-Fix on Failure
async def render_with_fallback(self, renderer, diagram, filename, artifact_dir):
# Attempt 1: render as-is
path = await renderer.render(diagram.d2_code, filename)
if path:
return Path(path)
# Attempt 2: ask LLM to fix syntax
fixed_code = await llm.fix_d2_syntax(diagram.d2_code, error_hint="Kroki 400")
diagram.d2_code = fixed_code
path = await renderer.render(fixed_code, filename)
if path:
return Path(path)
# Fallback: log failure, leave placeholder
_log_diagram_failure(diagram)
return None
6. Lead Magnets
Catalog
Maintain a JSON catalog of lead magnets, each mapped to content pillars:
{
"lead_magnets": [
{
"id": "rag-checklist",
"title": "Production RAG Engineering Checklist",
"pillars": ["RAG & Knowledge", "MLOps"],
"position_in_article": "after_section_3",
"embed_html": "<div class='lead-magnet-cta'>...</div>"
}
]
}
Automatic Selection
The content factory selects the best lead magnet based on article pillar and tags:
catalog = get_catalog()
magnet = catalog.select_for_article(
pillar=topic.get("pillar", ""),
tags=article.tags,
)
if magnet:
article.lead_magnet_id = magnet.lead_magnet_id
Inline CTA Injection
Lead magnet CTAs are injected at specific positions in the article HTML:
| Position | Description |
before_cta | Before the article CTA section (bottom) |
after_section_3 | After the 3rd H2 section (mid-article) |
after_intro | After the first paragraph (top) |
7. Publishing Workflow
Folder-Based Pipeline
data/content/
to_review/ # Generated, awaiting human review
post-slug/
article.json # Complete structured data
article.html # Raw HTML body
meta.json # Metadata summary
diagram_1.svg # Rendered diagrams
diagram_2.svg
banner.png # Generated banner image
to_publish/ # Approved by reviewer
post-slug/
... (same files)
published/ # Published to CMS
post-slug/
... (same files + publish receipt)
HITL Review Process
python main.py generate --topic "Topic Title"data/content/to_review/post-slug/article.html in browserto_review/ to to_publish/python main.py publish --slug post-slugpublished/ after successful publishPublish Action
Publishing creates the page in the CMS as a draft, then optionally publishes:
def to_modx_payload(self) -> dict:
return {
"pagetitle": self.pagetitle,
"description": self.description,
"content": self.html_body,
"alias": self.alias,
"published": 0, # always draft first -- HITL approves
"tvs": {
"summary": self.summary,
"image": self.image_path,
"tags": ",".join(self.tags),
"related_posts": self.related_posts_ids,
"service_list": self.service_list_ids,
},
}
8. Post-Publish Actions
IndexNow Ping
Immediately notify search engines of new content:
async def ping_indexnow(urls: list[str], key: str):
payload = {
"host": "yourdomain.com",
"key": key,
"urlList": urls,
}
async with httpx.AsyncClient() as client:
await client.post("https://api.indexnow.org/indexnow", json=payload)
GSC Inspection
Request indexing for the new URL via GSC API:
service.urlInspection().index().inspect(body={
"inspectionUrl": f"https://yourdomain.com/blog/{slug}",
"siteUrl": "sc-domain:yourdomain.com",
}).execute()
Social Sharing
For each new post, generate a social media snippet:
- LinkedIn: 3-5 line hook + link (see Arizen LinkedIn voice guide for format)
- Twitter/X: Key insight in <280 chars + link
Post-Publish Checklist
- [ ] Page loads correctly on staging
- [ ] Meta title and description render in browser tab
- [ ] OG image appears in social sharing preview
- [ ] Diagrams render (all SVGs load)
- [ ] Code blocks have syntax highlighting
- [ ] IndexNow ping sent
- [ ] GSC indexing requested
- [ ] Social post drafted/published
9. Audit Existing Posts
Scoring Process
Score every existing post against the 10-point rubric. Export scores to CSV:
slug,word_count,diagrams,code_blocks,tldr,faq,citations,banned_words,cta,table,protip,score,action
old-post-slug,1200,0,1,no,no,0,yes,no,0,no,2,rewrite
good-post-slug,2400,3,4,yes,yes,5,no,yes,2,yes,10,keep
Action Tiers
| Score | Action |
| 8-10 | Keep as-is |
| 5-7 | Enhance: add TL;DR, FAQ, diagrams |
| 1-4 | Full rewrite with Pagezilla pipeline |
| 0 | Kill: redirect to closest relevant page |
Automated Scoring Script
def score_post(html: str) -> dict:
text = re.sub(r"<[^>]+>", " ", html)
words = len(text.split())
scores = {
"word_count": 1 if words >= 1800 else 0,
"diagrams": 1 if html.count("<img") >= 2 else 0,
"code_blocks": 1 if html.count("<pre") >= 2 else 0,
"tldr": 1 if "tl-dr-block" in html else 0,
"faq": 1 if "faq-section" in html else 0,
"citations": 1 if html.count("citation-anchor") >= 3 else 0,
"banned_words": 1 if not any(w in text.lower() for w in BANNED_WORDS) else 0,
"cta": 1 if "article-cta-section" in html else 0,
"table": 1 if "<table" in html else 0,
"protip": 1 if "pro-tip" in html else 0,
}
scores["total"] = sum(scores.values())
return scores
10. Content Freshness: Quarterly Review Cycle
Quarter Cadence
| Month | Action |
| Month 1 | Pull GSC data, identify declining posts (>20% traffic drop) |
| Month 2 | Update declining posts: refresh code examples, update versions, add new sections |
| Month 3 | Generate new posts for newly discovered semantic silence topics |
Freshness Signals to Update
Automated Staleness Detection
from datetime import datetime, timedelta
def find_stale_posts(posts: list[dict], months: int = 6) -> list[dict]:
cutoff = datetime.now() - timedelta(days=months * 30)
return [
p for p in posts
if datetime.fromisoformat(p["publishedAt"]) < cutoff
and p.get("updatedAt") is None
]
Update vs Rewrite Decision
| Signal | Action |
| Traffic stable, content outdated | Update: refresh examples and versions |
| Traffic declining, content thin | Rewrite: full pipeline with new research |
| Traffic gone, topic irrelevant | Kill: redirect to closest relevant page |
| Traffic growing, content strong | Leave it alone |
Pipeline Summary
GSC Data Pull
|
v
Semantic Silence Discovery (LLM analysis)
|
v
Content Calendar (prioritized CSV)
|
v
Article Generation (LLM + structured output)
|
v
Pydantic Validation (12 validators)
|
v
D2 Diagram Rendering (Kroki API)
|
v
Banner Image Generation (Google API)
|
v
Lead Magnet Selection + Injection
|
v
Human Review (to_review/ folder)
|
v
Publish to CMS (draft -> publish)
|
v
Post-Publish (IndexNow, GSC, social)
|
v
Quarterly Audit + Freshness Cycle