Blog Pipeline — Monad Studio

Production-tested content generation and quality system from Pagezilla. Covers research, generation, validation, diagrams, publishing, and ongoing maintenance for a programmatic SEO blog.

1. Content Calendar: Topic Research via GSC Semantic Silence

Striking Distance Keywords

Pull keywords from GSC where you rank positions 4-20 with high impressions. These are the easiest wins: you are already visible but not clicking.

class PagezillaResearchEngine:
    async def ingest_gsc_data(self) -> list[dict]:
        from src.tools.gsc_connector import GSCConnector
        gsc = GSCConnector()
        # Positions 4-20 with high impressions = "striking distance"
        self.gsc_data = gsc.get_striking_distance(days=90)
        return self.gsc_data

Semantic Silence Discovery

"Semantic silence" is the gap between what users search for and what your blog covers. Use an LLM to compare GSC demand against your existing content inventory:

async def discover_semantic_silence(self) -> list[dict]:
    """Compare GSC demand vs. existing content to find uncovered topics."""
    existing = await self.ingest_existing_posts()
    gsc_keywords = await self.ingest_gsc_data()

    prompt = f"""
    Here are our existing blog posts: {json.dumps(existing)}
    Here are high-impression keywords we rank for but have no dedicated content:
    {json.dumps(gsc_keywords)}

    Identify 10 topics where we have search demand but no matching article.
    For each: title, primary keyword, intent, rationale.
    """
    opportunities = await llm.reason(prompt)
    return opportunities

Anti-Cannibalization

Before creating new content, check if an existing post already targets the same keyword. Cannibalization splits ranking signals and hurts both pages.

def is_cannibalized(new_keyword: str, existing_posts: list[dict]) -> bool:
    """Check if any existing post already targets this keyword."""
    for post in existing_posts:
        if new_keyword.lower() in post.get("title", "").lower():
            return True
        if new_keyword.lower() in post.get("description", "").lower():
            return True
    return False

Content Calendar Format

CSV with tiered priorities. Tier 1 = highest impact, publish first.

tier,priority,pillar,title,slug,primary_keyword,status,scheduled_date,posted_url,notes 1,1,AI Agents,Mastering LangGraph,mastering-langgraph,langgraph stateful workflows,published,,https://... 1,2,RAG,Pinecone Tuning Guide,pinecone-tuning,pinecone performance,published,,https://...

2,1,MLOps,Kubernetes for LLMs,k8s-llm-deploy,kubernetes llm deployment,to_do,,,

AW calendar: 75 topics across 5 pillars (AI Agents, RAG & Knowledge, MLOps, Multi-Agent, Real-Time Data), tiered 1-5.

2. Article Generation: LLM Prompt System

Model Routing

Different models for different tasks. Use the best tool for each job:

Task

Model

Reason

Topic research / reasoning	Gemini 2.5 Pro	Deep analytical thinking
Article writing	Claude Sonnet 4	Best prose quality, follows structured output
Validation / fast checks	Gemini Flash	Fast, cheap, good for classification
Image generation	Gemini 2.0 Flash Image	Best API-accessible image generation
D2 syntax fixing	Gemini Flash	Fast turnaround on simple code fixes

All models accessed through a single OpenRouter key (except Google image API).

System Prompt

The system prompt defines voice, rules, and structure. Key elements:

WRITER_SYSTEM_PROMPT = """
You are a Principal AI/Data Engineer writing technical articles for ActiveWizards.

WRITING RULES (non-negotiable):
1. Tone: Expert, specific, opinionated. Written by a Senior Staff Engineer.
2. BANNED words (instant rejection): delve, robust, seamlessly, demystify,
   landscape, unleash, transformative, revolutionize, game-changer.
3. Depth: Assume the reader is a Senior Engineer or technical CTO.
4. Code: All code snippets must be complete, correct, and well-commented.
5. Diagrams: D2 code provided separately -- embed as <img> tags.
6. Tables: EVERY <table> MUST be wrapped in <div class="table-responsive-wrapper">.
7. CTA: Every article ends with <div class="article-cta-section">.

GEO RULES (Generative Engine Optimization):
9. TL;DR BLOCK: 5-8 specific, numeric takeaways for AI citation.
10. FAQ SECTION: 3-5 Q&A pairs matching real search queries.
11. CITATION ANCHORS: At least 3 bold, opinionated one-liners.
12. SPECIFICITY MANDATE: Every claim must be grounded with numbers.
"""

Article Prompt Structure

1. Problem-first intro (2-3 paragraphs, no "In this article we will...")
TL;DR block (immediately after intro)
4-6 H2 sections: concepts, architecture, implementation, trade-offs
Minimum 2 code snippets
At least 1 comparison table
At least 1 pro-tip callout
FAQ section
Further reading section
CTA section (topic-specific, not generic)

3. Quality Benchmark: 10-Point Rubric

Score every article against this rubric before publishing. Minimum passing score: 8/10.

Criterion

Pass Condition

1	Word count	>= 1,800 words of substantive content
2	Diagrams	>= 2 D2 diagrams rendered as SVG
3	Code snippets	>= 2 complete, runnable code blocks
4	TL;DR block	5-8 specific, numeric bullet points
5	FAQ section	3-5 Q&A pairs matching real search queries
6	Citation anchors	>= 3 bold, opinionated claims
7	Banned words	Zero instances of banned words list
8	CTA section	Present, topic-specific (not generic)
9	Table	At least 1 comparison table, wrapped in responsive div
10	Pro-tip callout	At least 1 callout with practical advice

4. Pydantic Validation

Every generated article is validated through a Pydantic schema before it can proceed to review. This catches LLM output quality issues at generation time, not after human review.

Article Schema (Key Validators)

class PagezillaArticle(BaseModel):
    pagetitle: str          # Max 60 chars
    description: str        # Max 155 chars
    summary: str            # Max 120 chars
    alias: str              # URL slug
    html_body: str          # Complete article HTML
    tags: list[str]         # 5-10 relevant tags
    diagrams: list[DiagramSpec]  # Min 2 diagrams
    tl_dr_bullets: list[str]     # 5-8 specific takeaways
    faq_items: list[dict]        # Min 3 Q&A pairs
    schema_json_ld: str          # TechArticle + FAQPage JSON-LD

    @field_validator("pagetitle")
    @classmethod
    def check_title_length(cls, v: str) -> str:
        if len(v) > 60:
            return v[:60].rsplit(" ", 1)[0].rstrip(".,: ")
        return v

    @field_validator("html_body")
    @classmethod
    def check_no_llm_speak(cls, v: str) -> str:
        found = [w for w in BANNED_WORDS if w.lower() in v.lower()]
        if found:
            raise ValueError(f"LLM-speak detected: {found}")
        return v

    @field_validator("diagrams")
    @classmethod
    def check_diagrams(cls, v: list[DiagramSpec]) -> list[DiagramSpec]:
        if len(v) < 2:
            raise ValueError(f"Need at least 2 diagrams, got {len(v)}")
        return v

    @model_validator(mode="after")
    def check_word_count(self) -> "PagezillaArticle":
        text = re.sub(r"<[^>]+>", " ", self.html_body)
        word_count = len(text.split())
        if word_count < 1200:
            raise ValueError(f"Word count ({word_count}) too low. Target 1,800+.")
        return self

Banned Words List

BANNED_WORDS = [
    "delve", "robust", "seamlessly", "demystify", "landscape",
    "unleash", "transformative", "revolutionize", "game-changer",
    "it's worth noting", "it is worth noting", "having said that",
    "in conclusion", "in summary",
]

Validation Chain

The schema enforces 12 validators in sequence:

Title length (<= 60 chars, auto-truncate)

Description length (<= 155 chars, auto-truncate)

Summary length (<= 120 chars, auto-truncate)

Tag count (5-10 tags)

Diagram count (>= 2)

Banned words check (reject on detection)

CTA section presence

Table wrapper check (all tables wrapped)

TL;DR bullet quality (>= 5, each >= 20 chars)

FAQ quality (>= 3 Q&A, answers >= 40 chars)

TL;DR block in HTML

FAQ section in HTML

Word count (>= 1,200 after HTML stripping)

If any validator fails, the article is rejected and the LLM is re-prompted with the specific error message.

5. D2 Diagram Pipeline

Source to SVG

D2 code in DiagramSpec
  |
  v
Kroki API (POST https://kroki.io/d2/svg)
  |
  v
SVG file saved to artifact directory
  |
  v
Referenced in HTML as <img src="assets/blog_visuals/slug_diagram_N.svg">

D2 Rules for Kroki Compatibility

# Validated at schema level
@field_validator("d2_code")
@classmethod
def no_unsupported_blocks(cls, v: str) -> str:
    for blocked in ("vars:", "d2-config:"):
        if blocked in v:
            raise ValueError(f"D2 contains unsupported block '{blocked}'")
    return v

D2 best practices for Kroki:

Use direction: right or direction: down
Quote strings with spaces: "API Gateway"
Arrow syntax: A -> B: "label"
Containers: container { child1; child2 }
NO vars: blocks, NO d2-config: blocks
Sanitize $ characters (Kroki treats them as substitution)

Auto-Fix on Failure

async def render_with_fallback(self, renderer, diagram, filename, artifact_dir):
    # Attempt 1: render as-is
    path = await renderer.render(diagram.d2_code, filename)
    if path:
        return Path(path)

    # Attempt 2: ask LLM to fix syntax
    fixed_code = await llm.fix_d2_syntax(diagram.d2_code, error_hint="Kroki 400")
    diagram.d2_code = fixed_code
    path = await renderer.render(fixed_code, filename)
    if path:
        return Path(path)

    # Fallback: log failure, leave placeholder
    _log_diagram_failure(diagram)
    return None

6. Lead Magnets

Catalog

Maintain a JSON catalog of lead magnets, each mapped to content pillars:

{
  "lead_magnets": [
    {
      "id": "rag-checklist",
      "title": "Production RAG Engineering Checklist",
      "pillars": ["RAG & Knowledge", "MLOps"],
      "position_in_article": "after_section_3",
      "embed_html": "<div class='lead-magnet-cta'>...</div>"
    }
  ]
}

Automatic Selection

The content factory selects the best lead magnet based on article pillar and tags:

catalog = get_catalog()
magnet = catalog.select_for_article(
    pillar=topic.get("pillar", ""),
    tags=article.tags,
)
if magnet:
    article.lead_magnet_id = magnet.lead_magnet_id

Inline CTA Injection

Lead magnet CTAs are injected at specific positions in the article HTML:

Position

Description

`before_cta`	Before the article CTA section (bottom)
`after_section_3`	After the 3rd H2 section (mid-article)
`after_intro`	After the first paragraph (top)

7. Publishing Workflow

Folder-Based Pipeline

data/content/ to_review/ # Generated, awaiting human review post-slug/ article.json # Complete structured data article.html # Raw HTML body meta.json # Metadata summary diagram_1.svg # Rendered diagrams diagram_2.svg banner.png # Generated banner image to_publish/ # Approved by reviewer post-slug/ ... (same files) published/ # Published to CMS post-slug/

... (same files + publish receipt)

HITL Review Process

Generate: python main.py generate --topic "Topic Title"

Review: Human opens data/content/to_review/post-slug/article.html in browser

Approve: Move folder from to_review/ to to_publish/

Publish: python main.py publish --slug post-slug

Archive: Folder automatically moves to published/ after successful publish

Publish Action

Publishing creates the page in the CMS as a draft, then optionally publishes:

def to_modx_payload(self) -> dict:
    return {
        "pagetitle": self.pagetitle,
        "description": self.description,
        "content": self.html_body,
        "alias": self.alias,
        "published": 0,  # always draft first -- HITL approves
        "tvs": {
            "summary": self.summary,
            "image": self.image_path,
            "tags": ",".join(self.tags),
            "related_posts": self.related_posts_ids,
            "service_list": self.service_list_ids,
        },
    }

8. Post-Publish Actions

IndexNow Ping

Immediately notify search engines of new content:

async def ping_indexnow(urls: list[str], key: str):
    payload = {
        "host": "yourdomain.com",
        "key": key,
        "urlList": urls,
    }
    async with httpx.AsyncClient() as client:
        await client.post("https://api.indexnow.org/indexnow", json=payload)

GSC Inspection

Request indexing for the new URL via GSC API:

service.urlInspection().index().inspect(body={
    "inspectionUrl": f"https://yourdomain.com/blog/{slug}",
    "siteUrl": "sc-domain:yourdomain.com",
}).execute()

Social Sharing

For each new post, generate a social media snippet:

LinkedIn: 3-5 line hook + link (see Arizen LinkedIn voice guide for format)
Twitter/X: Key insight in <280 chars + link

Post-Publish Checklist

[ ] Page loads correctly on staging
[ ] Meta title and description render in browser tab
[ ] OG image appears in social sharing preview
[ ] Diagrams render (all SVGs load)
[ ] Code blocks have syntax highlighting
[ ] IndexNow ping sent
[ ] GSC indexing requested
[ ] Social post drafted/published

9. Audit Existing Posts

Scoring Process

Score every existing post against the 10-point rubric. Export scores to CSV:

slug,word_count,diagrams,code_blocks,tldr,faq,citations,banned_words,cta,table,protip,score,action old-post-slug,1200,0,1,no,no,0,yes,no,0,no,2,rewrite

good-post-slug,2400,3,4,yes,yes,5,no,yes,2,yes,10,keep

Action Tiers

Score

Action

8-10	Keep as-is
5-7	Enhance: add TL;DR, FAQ, diagrams
1-4	Full rewrite with Pagezilla pipeline
0	Kill: redirect to closest relevant page

Automated Scoring Script

def score_post(html: str) -> dict:
    text = re.sub(r"<[^>]+>", " ", html)
    words = len(text.split())
    scores = {
        "word_count": 1 if words >= 1800 else 0,
        "diagrams": 1 if html.count("<img") >= 2 else 0,
        "code_blocks": 1 if html.count("<pre") >= 2 else 0,
        "tldr": 1 if "tl-dr-block" in html else 0,
        "faq": 1 if "faq-section" in html else 0,
        "citations": 1 if html.count("citation-anchor") >= 3 else 0,
        "banned_words": 1 if not any(w in text.lower() for w in BANNED_WORDS) else 0,
        "cta": 1 if "article-cta-section" in html else 0,
        "table": 1 if "<table" in html else 0,
        "protip": 1 if "pro-tip" in html else 0,
    }
    scores["total"] = sum(scores.values())
    return scores

10. Content Freshness: Quarterly Review Cycle

Quarter Cadence

Month

Action

Month 1	Pull GSC data, identify declining posts (>20% traffic drop)
Month 2	Update declining posts: refresh code examples, update versions, add new sections
Month 3	Generate new posts for newly discovered semantic silence topics

Freshness Signals to Update

Version numbers: Update framework/library versions in code examples

Benchmarks: Refresh performance numbers if newer data available

Links: Fix broken external links, add new relevant references

TL;DR: Update bullet points with current metrics

FAQ: Add new questions based on GSC query data

Automated Staleness Detection

from datetime import datetime, timedelta

def find_stale_posts(posts: list[dict], months: int = 6) -> list[dict]:
    cutoff = datetime.now() - timedelta(days=months * 30)
    return [
        p for p in posts
        if datetime.fromisoformat(p["publishedAt"]) < cutoff
        and p.get("updatedAt") is None
    ]

Update vs Rewrite Decision

Signal

Action

Traffic stable, content outdated	Update: refresh examples and versions
Traffic declining, content thin	Rewrite: full pipeline with new research
Traffic gone, topic irrelevant	Kill: redirect to closest relevant page
Traffic growing, content strong	Leave it alone

Pipeline Summary

GSC Data Pull
  |
  v
Semantic Silence Discovery (LLM analysis)
  |
  v
Content Calendar (prioritized CSV)
  |
  v
Article Generation (LLM + structured output)
  |
  v
Pydantic Validation (12 validators)
  |
  v
D2 Diagram Rendering (Kroki API)
  |
  v
Banner Image Generation (Google API)
  |
  v
Lead Magnet Selection + Injection
  |
  v
Human Review (to_review/ folder)
  |
  v
Publish to CMS (draft -> publish)
  |
  v
Post-Publish (IndexNow, GSC, social)
  |
  v
Quarterly Audit + Freshness Cycle