Content

Blog Pipeline

Production-tested content generation and quality system from Pagezilla. Covers research, generation, validation, diagrams, publishing, and ongoing maintenance for a programmatic SEO blog.


1. Content Calendar: Topic Research via GSC Semantic Silence

Striking Distance Keywords

Pull keywords from GSC where you rank positions 4-20 with high impressions. These are the easiest wins: you are already visible but not clicking.

class PagezillaResearchEngine:
    async def ingest_gsc_data(self) -> list[dict]:
        from src.tools.gsc_connector import GSCConnector
        gsc = GSCConnector()
        # Positions 4-20 with high impressions = "striking distance"
        self.gsc_data = gsc.get_striking_distance(days=90)
        return self.gsc_data

Semantic Silence Discovery

"Semantic silence" is the gap between what users search for and what your blog covers. Use an LLM to compare GSC demand against your existing content inventory:

async def discover_semantic_silence(self) -> list[dict]:
    """Compare GSC demand vs. existing content to find uncovered topics."""
    existing = await self.ingest_existing_posts()
    gsc_keywords = await self.ingest_gsc_data()

    prompt = f"""
    Here are our existing blog posts: {json.dumps(existing)}
    Here are high-impression keywords we rank for but have no dedicated content:
    {json.dumps(gsc_keywords)}

    Identify 10 topics where we have search demand but no matching article.
    For each: title, primary keyword, intent, rationale.
    """
    opportunities = await llm.reason(prompt)
    return opportunities

Anti-Cannibalization

Before creating new content, check if an existing post already targets the same keyword. Cannibalization splits ranking signals and hurts both pages.

def is_cannibalized(new_keyword: str, existing_posts: list[dict]) -> bool:
    """Check if any existing post already targets this keyword."""
    for post in existing_posts:
        if new_keyword.lower() in post.get("title", "").lower():
            return True
        if new_keyword.lower() in post.get("description", "").lower():
            return True
    return False

Content Calendar Format

CSV with tiered priorities. Tier 1 = highest impact, publish first.

tier,priority,pillar,title,slug,primary_keyword,status,scheduled_date,posted_url,notes

1,1,AI Agents,Mastering LangGraph,mastering-langgraph,langgraph stateful workflows,published,,https://...

1,2,RAG,Pinecone Tuning Guide,pinecone-tuning,pinecone performance,published,,https://...

2,1,MLOps,Kubernetes for LLMs,k8s-llm-deploy,kubernetes llm deployment,to_do,,,

AW calendar: 75 topics across 5 pillars (AI Agents, RAG & Knowledge, MLOps, Multi-Agent, Real-Time Data), tiered 1-5.


2. Article Generation: LLM Prompt System

Model Routing

Different models for different tasks. Use the best tool for each job:

TaskModelReason

Topic research / reasoningGemini 2.5 ProDeep analytical thinking
Article writingClaude Sonnet 4Best prose quality, follows structured output
Validation / fast checksGemini FlashFast, cheap, good for classification
Image generationGemini 2.0 Flash ImageBest API-accessible image generation
D2 syntax fixingGemini FlashFast turnaround on simple code fixes

All models accessed through a single OpenRouter key (except Google image API).

System Prompt

The system prompt defines voice, rules, and structure. Key elements:

WRITER_SYSTEM_PROMPT = """
You are a Principal AI/Data Engineer writing technical articles for ActiveWizards.

WRITING RULES (non-negotiable):
1. Tone: Expert, specific, opinionated. Written by a Senior Staff Engineer.
2. BANNED words (instant rejection): delve, robust, seamlessly, demystify,
   landscape, unleash, transformative, revolutionize, game-changer.
3. Depth: Assume the reader is a Senior Engineer or technical CTO.
4. Code: All code snippets must be complete, correct, and well-commented.
5. Diagrams: D2 code provided separately -- embed as <img> tags.
6. Tables: EVERY <table> MUST be wrapped in <div class="table-responsive-wrapper">.
7. CTA: Every article ends with <div class="article-cta-section">.

GEO RULES (Generative Engine Optimization):
9. TL;DR BLOCK: 5-8 specific, numeric takeaways for AI citation.
10. FAQ SECTION: 3-5 Q&A pairs matching real search queries.
11. CITATION ANCHORS: At least 3 bold, opinionated one-liners.
12. SPECIFICITY MANDATE: Every claim must be grounded with numbers.
"""

Article Prompt Structure

1. Problem-first intro (2-3 paragraphs, no "In this article we will...")
  • TL;DR block (immediately after intro)
  • 4-6 H2 sections: concepts, architecture, implementation, trade-offs
  • Minimum 2 code snippets
  • At least 1 comparison table
  • At least 1 pro-tip callout
  • FAQ section
  • Further reading section
  • CTA section (topic-specific, not generic)

  • 3. Quality Benchmark: 10-Point Rubric

    Score every article against this rubric before publishing. Minimum passing score: 8/10.

    #CriterionPass Condition

    1Word count>= 1,800 words of substantive content
    2Diagrams>= 2 D2 diagrams rendered as SVG
    3Code snippets>= 2 complete, runnable code blocks
    4TL;DR block5-8 specific, numeric bullet points
    5FAQ section3-5 Q&A pairs matching real search queries
    6Citation anchors>= 3 bold, opinionated claims
    7Banned wordsZero instances of banned words list
    8CTA sectionPresent, topic-specific (not generic)
    9TableAt least 1 comparison table, wrapped in responsive div
    10Pro-tip calloutAt least 1 callout with practical advice


    4. Pydantic Validation

    Every generated article is validated through a Pydantic schema before it can proceed to review. This catches LLM output quality issues at generation time, not after human review.

    Article Schema (Key Validators)

    class PagezillaArticle(BaseModel):
        pagetitle: str          # Max 60 chars
        description: str        # Max 155 chars
        summary: str            # Max 120 chars
        alias: str              # URL slug
        html_body: str          # Complete article HTML
        tags: list[str]         # 5-10 relevant tags
        diagrams: list[DiagramSpec]  # Min 2 diagrams
        tl_dr_bullets: list[str]     # 5-8 specific takeaways
        faq_items: list[dict]        # Min 3 Q&A pairs
        schema_json_ld: str          # TechArticle + FAQPage JSON-LD
    
        @field_validator("pagetitle")
        @classmethod
        def check_title_length(cls, v: str) -> str:
            if len(v) > 60:
                return v[:60].rsplit(" ", 1)[0].rstrip(".,: ")
            return v
    
        @field_validator("html_body")
        @classmethod
        def check_no_llm_speak(cls, v: str) -> str:
            found = [w for w in BANNED_WORDS if w.lower() in v.lower()]
            if found:
                raise ValueError(f"LLM-speak detected: {found}")
            return v
    
        @field_validator("diagrams")
        @classmethod
        def check_diagrams(cls, v: list[DiagramSpec]) -> list[DiagramSpec]:
            if len(v) < 2:
                raise ValueError(f"Need at least 2 diagrams, got {len(v)}")
            return v
    
        @model_validator(mode="after")
        def check_word_count(self) -> "PagezillaArticle":
            text = re.sub(r"<[^>]+>", " ", self.html_body)
            word_count = len(text.split())
            if word_count < 1200:
                raise ValueError(f"Word count ({word_count}) too low. Target 1,800+.")
            return self

    Banned Words List

    BANNED_WORDS = [
        "delve", "robust", "seamlessly", "demystify", "landscape",
        "unleash", "transformative", "revolutionize", "game-changer",
        "it's worth noting", "it is worth noting", "having said that",
        "in conclusion", "in summary",
    ]

    Validation Chain

    The schema enforces 12 validators in sequence:

  • Title length (<= 60 chars, auto-truncate)
  • Description length (<= 155 chars, auto-truncate)
  • Summary length (<= 120 chars, auto-truncate)
  • Tag count (5-10 tags)
  • Diagram count (>= 2)
  • Banned words check (reject on detection)
  • CTA section presence
  • Table wrapper check (all tables wrapped)
  • TL;DR bullet quality (>= 5, each >= 20 chars)
  • FAQ quality (>= 3 Q&A, answers >= 40 chars)
  • TL;DR block in HTML
  • FAQ section in HTML
  • Word count (>= 1,200 after HTML stripping)
  • If any validator fails, the article is rejected and the LLM is re-prompted with the specific error message.


    5. D2 Diagram Pipeline

    Source to SVG

    D2 code in DiagramSpec
    

    |

    v

    Kroki API (POST https://kroki.io/d2/svg)

    |

    v

    SVG file saved to artifact directory

    |

    v

    Referenced in HTML as <img src="assets/blog_visuals/slug_diagram_N.svg">

    D2 Rules for Kroki Compatibility

    # Validated at schema level
    @field_validator("d2_code")
    @classmethod
    def no_unsupported_blocks(cls, v: str) -> str:
        for blocked in ("vars:", "d2-config:"):
            if blocked in v:
                raise ValueError(f"D2 contains unsupported block '{blocked}'")
        return v

    D2 best practices for Kroki:

    • Use direction: right or direction: down
    • Quote strings with spaces: "API Gateway"
    • Arrow syntax: A -> B: "label"
    • Containers: container { child1; child2 }
    • NO vars: blocks, NO d2-config: blocks
    • Sanitize $ characters (Kroki treats them as substitution)

    Auto-Fix on Failure

    async def render_with_fallback(self, renderer, diagram, filename, artifact_dir):
        # Attempt 1: render as-is
        path = await renderer.render(diagram.d2_code, filename)
        if path:
            return Path(path)
    
        # Attempt 2: ask LLM to fix syntax
        fixed_code = await llm.fix_d2_syntax(diagram.d2_code, error_hint="Kroki 400")
        diagram.d2_code = fixed_code
        path = await renderer.render(fixed_code, filename)
        if path:
            return Path(path)
    
        # Fallback: log failure, leave placeholder
        _log_diagram_failure(diagram)
        return None

    6. Lead Magnets

    Catalog

    Maintain a JSON catalog of lead magnets, each mapped to content pillars:

    {
      "lead_magnets": [
        {
          "id": "rag-checklist",
          "title": "Production RAG Engineering Checklist",
          "pillars": ["RAG & Knowledge", "MLOps"],
          "position_in_article": "after_section_3",
          "embed_html": "<div class='lead-magnet-cta'>...</div>"
        }
      ]
    }

    Automatic Selection

    The content factory selects the best lead magnet based on article pillar and tags:

    catalog = get_catalog()
    magnet = catalog.select_for_article(
        pillar=topic.get("pillar", ""),
        tags=article.tags,
    )
    if magnet:
        article.lead_magnet_id = magnet.lead_magnet_id

    Inline CTA Injection

    Lead magnet CTAs are injected at specific positions in the article HTML:

    PositionDescription

    before_ctaBefore the article CTA section (bottom)
    after_section_3After the 3rd H2 section (mid-article)
    after_introAfter the first paragraph (top)


    7. Publishing Workflow

    Folder-Based Pipeline

    data/content/
    

    to_review/ # Generated, awaiting human review

    post-slug/

    article.json # Complete structured data

    article.html # Raw HTML body

    meta.json # Metadata summary

    diagram_1.svg # Rendered diagrams

    diagram_2.svg

    banner.png # Generated banner image

    to_publish/ # Approved by reviewer

    post-slug/

    ... (same files)

    published/ # Published to CMS

    post-slug/

    ... (same files + publish receipt)

    HITL Review Process

  • Generate: python main.py generate --topic "Topic Title"
  • Review: Human opens data/content/to_review/post-slug/article.html in browser
  • Approve: Move folder from to_review/ to to_publish/
  • Publish: python main.py publish --slug post-slug
  • Archive: Folder automatically moves to published/ after successful publish
  • Publish Action

    Publishing creates the page in the CMS as a draft, then optionally publishes:

    def to_modx_payload(self) -> dict:
        return {
            "pagetitle": self.pagetitle,
            "description": self.description,
            "content": self.html_body,
            "alias": self.alias,
            "published": 0,  # always draft first -- HITL approves
            "tvs": {
                "summary": self.summary,
                "image": self.image_path,
                "tags": ",".join(self.tags),
                "related_posts": self.related_posts_ids,
                "service_list": self.service_list_ids,
            },
        }

    8. Post-Publish Actions

    IndexNow Ping

    Immediately notify search engines of new content:

    async def ping_indexnow(urls: list[str], key: str):
        payload = {
            "host": "yourdomain.com",
            "key": key,
            "urlList": urls,
        }
        async with httpx.AsyncClient() as client:
            await client.post("https://api.indexnow.org/indexnow", json=payload)

    GSC Inspection

    Request indexing for the new URL via GSC API:

    service.urlInspection().index().inspect(body={
        "inspectionUrl": f"https://yourdomain.com/blog/{slug}",
        "siteUrl": "sc-domain:yourdomain.com",
    }).execute()

    Social Sharing

    For each new post, generate a social media snippet:

    • LinkedIn: 3-5 line hook + link (see Arizen LinkedIn voice guide for format)
    • Twitter/X: Key insight in <280 chars + link

    Post-Publish Checklist

    • [ ] Page loads correctly on staging
    • [ ] Meta title and description render in browser tab
    • [ ] OG image appears in social sharing preview
    • [ ] Diagrams render (all SVGs load)
    • [ ] Code blocks have syntax highlighting
    • [ ] IndexNow ping sent
    • [ ] GSC indexing requested
    • [ ] Social post drafted/published


    9. Audit Existing Posts

    Scoring Process

    Score every existing post against the 10-point rubric. Export scores to CSV:

    slug,word_count,diagrams,code_blocks,tldr,faq,citations,banned_words,cta,table,protip,score,action
    

    old-post-slug,1200,0,1,no,no,0,yes,no,0,no,2,rewrite

    good-post-slug,2400,3,4,yes,yes,5,no,yes,2,yes,10,keep

    Action Tiers

    ScoreAction

    8-10Keep as-is
    5-7Enhance: add TL;DR, FAQ, diagrams
    1-4Full rewrite with Pagezilla pipeline
    0Kill: redirect to closest relevant page

    Automated Scoring Script

    def score_post(html: str) -> dict:
        text = re.sub(r"<[^>]+>", " ", html)
        words = len(text.split())
        scores = {
            "word_count": 1 if words >= 1800 else 0,
            "diagrams": 1 if html.count("<img") >= 2 else 0,
            "code_blocks": 1 if html.count("<pre") >= 2 else 0,
            "tldr": 1 if "tl-dr-block" in html else 0,
            "faq": 1 if "faq-section" in html else 0,
            "citations": 1 if html.count("citation-anchor") >= 3 else 0,
            "banned_words": 1 if not any(w in text.lower() for w in BANNED_WORDS) else 0,
            "cta": 1 if "article-cta-section" in html else 0,
            "table": 1 if "<table" in html else 0,
            "protip": 1 if "pro-tip" in html else 0,
        }
        scores["total"] = sum(scores.values())
        return scores

    10. Content Freshness: Quarterly Review Cycle

    Quarter Cadence

    MonthAction

    Month 1Pull GSC data, identify declining posts (>20% traffic drop)
    Month 2Update declining posts: refresh code examples, update versions, add new sections
    Month 3Generate new posts for newly discovered semantic silence topics

    Freshness Signals to Update

  • Version numbers: Update framework/library versions in code examples
  • Benchmarks: Refresh performance numbers if newer data available
  • Links: Fix broken external links, add new relevant references
  • TL;DR: Update bullet points with current metrics
  • FAQ: Add new questions based on GSC query data
  • Automated Staleness Detection

    from datetime import datetime, timedelta
    
    def find_stale_posts(posts: list[dict], months: int = 6) -> list[dict]:
        cutoff = datetime.now() - timedelta(days=months * 30)
        return [
            p for p in posts
            if datetime.fromisoformat(p["publishedAt"]) < cutoff
            and p.get("updatedAt") is None
        ]

    Update vs Rewrite Decision

    SignalAction

    Traffic stable, content outdatedUpdate: refresh examples and versions
    Traffic declining, content thinRewrite: full pipeline with new research
    Traffic gone, topic irrelevantKill: redirect to closest relevant page
    Traffic growing, content strongLeave it alone


    Pipeline Summary

    GSC Data Pull
    

    |

    v

    Semantic Silence Discovery (LLM analysis)

    |

    v

    Content Calendar (prioritized CSV)

    |

    v

    Article Generation (LLM + structured output)

    |

    v

    Pydantic Validation (12 validators)

    |

    v

    D2 Diagram Rendering (Kroki API)

    |

    v

    Banner Image Generation (Google API)

    |

    v

    Lead Magnet Selection + Injection

    |

    v

    Human Review (to_review/ folder)

    |

    v

    Publish to CMS (draft -> publish)

    |

    v

    Post-Publish (IndexNow, GSC, social)

    |

    v

    Quarterly Audit + Freshness Cycle