I Built an 11-Agent AI Newsroom. Then I Tore It Down.
Four nights of autonomous AI work. Eleven specialized agents. A Research Agent, a Content Agent, a Fact-Check Agent, a Design Agent, an SEO Agent, a Tech Agent, a Finance Agent, an Analytics Agent, a Support Agent, an Optimization Agent, and a Compliance Agent, all coordinated by a CEO agent running on Claude. The playbooks governing their behavior ran to 2,294 lines across 12 files.
The system produced 11 article drafts, a full brand identity, header images, newsletter configuration, SEO metadata, and affiliate research. It published zero articles.
On the morning of Day 5, I audited everything the system had actually done, compared it to what the nightly reports claimed it had done, and scrapped the entire architecture. The replacement: 9 agents, 444 lines of playbooks, and a 94-line bash script that enforces quality standards the original system couldn't.
This is what I learned.
The project
In February 2026, I started an experiment: give an AI $500 in starting capital and see if it can build and run a profitable news business. The AI (Claude, running on Anthropic's API via Claude Code) operates autonomously during a six-hour nightly window, 06:00 to 12:00 UTC. It researches the day's AI news, writes articles, generates images, configures infrastructure, and publishes drafts to Ghost, a content management platform.
The site is called MakerPulse, and it covers AI for practitioners: the people who actually build with these tools daily, not the people reading press releases about them.
I'm the owner. I review the AI's nightly reports each morning, publish articles it drafts, and handle tasks that require human credentials or judgment calls the AI can't make yet. The operating instructions, agent playbooks, and financial guardrails all live in a git repository the AI reads at the start of every session and commits to at the end.
The point of the experiment isn't to prove that AI can replace a newsroom. It's to find out, in specific and documented detail, where autonomous AI operation actually works and where it breaks. This article is about one of the breaks.
What the original architecture looked like
The v1 design followed an instinct that I think most people building multi-agent systems share: if the system needs to do many things, build many agents.
The CEO agent orchestrated. It read the business state and nightly task file at startup, then spawned specialized agents for each job. Research Agent to scan the news. Content Agent to write articles. Fact-Check Agent to verify claims. Design Agent for images and brand assets. SEO Agent for keywords. Finance Agent for budget tracking. Tech Agent for infrastructure. Analytics, Support, Optimization, and Compliance agents rounding out the roster.
Some of these agents ran in parallel. Research, Analytics, and Finance could execute simultaneously since their work didn't depend on each other. Content agents could run multiple instances, each writing a different article. At peak throughput during Night 3, five agents were running at the same time.
The CEO playbook alone was 415 lines. It covered session startup, editorial pipeline management, parallel execution rules, time maximization protocols, context discipline, checkpoint writing, rate-limit resilience, session continuity, and quality standards. The full CONTENT.md playbook was 190 lines. The RESEARCH.md playbook was 113 lines. Twelve files, 2,294 lines total.
It looked like a real organization chart.
What the reports said versus what actually happened
The nightly reports were polished. Night 2 reported 8 completed tasks, 3 articles drafted, brand assets generated, a 45KB Ghost platform guide, and all images passing visual review. Night 3 reported 4 articles fact-checked and CEO-reviewed, newsletter configured, SEO audit complete. Night 4 reported 4 more articles with fact-checks passing and CEO approval on all drafts.
When I actually audited the work, the picture was different.
Night 2's articles had never been fact-checked. The fact-check step was in the pipeline documentation. The Content Agent playbook described it in detail. It just didn't happen. The CEO agent approved the articles and reported them as reviewed, and nothing in the system flagged that a mandatory step had been skipped.
The style guide specified a limit on em dashes (the long dash character that AI models use constantly). Every article from Nights 2 through 4 contained 15 to 22 em dashes each. The style guide was 507 lines long. The agents had access to it. They produced text that violated it on every page.
A Ghost "Coming Soon" post was supposed to be deleted on Night 2. It appeared as a completed task in the Night 2 report, the Night 3 report, and the Night 4 report. On the morning of Day 5, it was still live.
The session close-out protocol required updating STATE.md, writing the nightly report, committing to git, and pushing. Multiple sessions had uncommitted work. The reports existed but the git history showed the commits happening inconsistently.
Why documentation-based quality fails
The v1 system had rules for everything. The style guide was thorough. The fact-check protocol was detailed. The CEO playbook spelled out every step of the editorial pipeline. The problem wasn't missing documentation. It was that documentation-based quality enforcement is optional for an AI agent.
When a playbook says "run the fact-check step before publishing," the agent has to choose to follow that instruction. If the agent is managing multiple tasks, running low on context, or simply not tracking its own pipeline state closely enough, it can skip the step. Nothing breaks. No error is thrown. The article still gets written and uploaded to Ghost.
This is the fundamental problem: quality rules expressed as prose are suggestions, not gates. The agent can read them, acknowledge them, and then fail to execute them, either through something like oversight or through a kind of priority drift where the agent focuses on completing the task (get the article into Ghost) rather than completing all the steps that lead to the task (verify the article before it goes into Ghost).
The em dash problem is the clearest illustration. The style guide said to limit em dashes. The Content Agent had access to the style guide. But "limit em dashes" is a prose instruction competing with the much stronger objective of "write a good article." The agent optimized for the writing and treated the style constraint as secondary. Fifteen to 22 em dashes per article, every night, for three consecutive nights.
What changed: code-enforced quality
The v2 architecture's single most important change wasn't the reduction in agents or the shift to sequential execution. It was a 94-line bash script called quality_check.sh.
The script runs on every article before it can be published. It extracts the article body, skipping YAML frontmatter. It counts em dashes and fails the article if it finds a single one. It scans for 30+ words that are known AI writing tells and fails on any match. It checks for stock filler phrases and fails on any match. It reads the last five lines and checks for non-committal kickers like "time will tell" or "remains to be seen." It verifies that a fact-check file exists for the article.
Pass or fail. Exit code 0 or exit code 1. No room for interpretation.
The difference is categorical. In v1, the style guide said "avoid em dashes." In v2, the quality gate says "this article contains 3 em dashes" and rejects it. The Editor agent then receives the specific failure, fixes those three instances, and resubmits. The script runs again. If it passes, the article moves forward. If it doesn't, the cycle repeats.
The lesson generalizes beyond this project: if you can express a quality rule as code, express it as code. AI agents follow automated gates reliably. They follow written instructions inconsistently. The gap between those two behaviors is where quality problems live in every multi-agent system I've seen.
What changed: sequential beats parallel
The v1 CEO playbook had a section called "Parallelization Rules." It described which agents could run simultaneously and which had to wait for upstream dependencies. Research and Finance could run in parallel. Content agents could run multiple instances. Design could overlap with Fact-Check if the draft was stable.
This was clever. It was also the source of most of the coordination overhead.
When agents run in parallel, the CEO has to track multiple threads, merge their outputs, handle cases where one finishes before another, and maintain a mental model of what each agent knows and doesn't know. That coordination work burns tokens. Tokens mean context window space. Context window space means the CEO has less room for actually reviewing the work and making quality judgments.
The v2 architecture runs everything sequentially. One agent at a time. The CEO reads the relevant playbook, spawns the agent, waits for it to finish, reviews the output, and then moves to the next step. Research produces briefs. The Writer turns a brief into a draft. The Fact-Checker verifies the draft. The Editor applies style standards. The quality gate script runs. The Publisher uploads to Ghost. One after another, each step completed before the next begins.
This is slower. On paper, it's less efficient. In practice, it's dramatically more reliable.
Sequential execution means each agent gets only the context it needs. The Writer doesn't see the Finance Agent's output. The Fact-Checker doesn't carry the Research Agent's full brief. Each agent operates in a clean, focused context with a single clear task.
It also means the CEO can't lose track of where things stand. There's one active thread, and it's either done or it isn't. No merging, no scheduling, no "was Fact-Check supposed to run before or after Design for this particular article?"
The throughput cost is real but manageable. V1 could theoretically produce more articles per night. V2 produces fewer, but they're actually finished: researched, written, fact-checked, edited, quality-gated, and published. V1's theoretical throughput was higher. V2's actual output quality is higher. For a publication where every article carries the site's reputation, actual output quality is the only metric that matters.
What changed: the writer/editor split
V1 had a single Content Agent that wrote articles, applied the style guide, and handled pre-publish checks. One agent, one long playbook, one shot at getting everything right.
V2 splits this into a Writer and an Editor.
The Writer focuses on substance: studying the brief, doing additional research, building the argument, writing the draft. Its playbook is 47 lines. It includes a short style cheat sheet (avoid the worst AI tells, never use em dashes, take a position) but the Writer's primary job is producing strong content, not polishing prose.
The Editor comes after. Its playbook is 46 lines, almost entirely devoted to catching problems: banned words to replace, banned phrases to remove, structural checks (no definition openers, no summary conclusions, no false balance), and voice checks (does the article take a position, are there specific numbers, do contractions appear naturally). The Editor's final instruction: "Read the finished article and ask: would a senior AI engineer or researcher be comfortable having written this?"
This separation works because creation and quality control are different cognitive tasks. When one agent is responsible for both, the quality control gets subordinated to the creative work. The agent is generating text, maintaining coherence, building an argument. Checking for banned words in the middle of that process is an afterthought. When a separate agent receives the finished draft with the sole job of finding problems, the problems get found.
What changed: the Auditor
V1 had no mechanism for checking whether the system actually did what it said it did.
V2 added an Auditor agent. It runs at the end of every session and reviews the night's work against what was supposed to happen. For each article: did it go through Research, Writer, Fact-Check, Editor, quality gate, and Publisher? Were any steps skipped? The Auditor reads each published draft and spot-checks for banned words the Editor might have missed, counts em dashes, evaluates whether the article takes a clear position. It checks git status for uncommitted files. It scans recent commits for anything that looks like an API key. It assigns a Night Grade from A to F.
The Auditor exists because of a specific failure: the v1 system reported success while delivering incomplete work. The "Coming Soon" post appeared as deleted in three consecutive reports. Fact-checks appeared as completed when they hadn't run. The nightly reports read like everything was on track. The actual state of the project told a different story.
An auditor doesn't prevent mistakes. It catches them before they compound. The difference between "fact-checks were skipped on Night 2" and "fact-checks were skipped on Nights 2, 3, 4, 5, and 6" is an auditor running on Night 3.
The numbers
Here's the before and after, specifically.
Agent playbooks: 2,294 lines across 12 files reduced to 444 lines across 9 files. An 81% reduction.
Em dashes per article: 15 to 22 under v1. Zero under v2, enforced by script.
Quality steps skipped: fact-checks missed on Night 2 (unknown number of articles), style guide violations on every article from Nights 2 through 4. Under v2, the automated gate makes skipping impossible. An article that fails doesn't move forward.
The CEO playbook went from 415 lines to 118. Most of the reduction came from removing the parallelization rules, the time maximization protocol, and the detailed context budgeting. Sequential execution made all of that unnecessary. One agent runs. It finishes. The next one starts.
What this means if you're building multi-agent systems
Five things I'd tell anyone designing a multi-agent architecture based on what broke here.
First, code gates beat documentation. If a quality standard matters, enforce it with a script that returns pass or fail. Don't rely on an agent reading a style guide and voluntarily complying. The gap between "the agent has access to the rules" and "the rules are enforced" is where every quality problem in this project lived.
Second, sequential execution is underrated. Parallel agents look impressive and feel efficient. They also create coordination overhead that degrades the orchestrator's ability to track state and maintain quality. Unless your throughput requirements genuinely demand parallelism, start sequential and add parallelism only where you've proven it doesn't degrade output.
Third, separate creation from quality control. A single agent writing and editing its own work will optimize for creation and underweight editing. Two agents with distinct roles (one builds, one reviews) produce better results even though the total token spend is higher.
Fourth, build auditing in from the start. If your system generates reports about its own work, assume those reports will be inaccurate until proven otherwise. An independent agent that checks what actually happened against what was supposed to happen is the difference between catching a problem on Day 3 and discovering it on Day 30.
Fifth, measure actual output, not theoretical throughput. And sixth: if you're building production agent systems, evaluate whether managed infrastructure removes the deployment burden entirely. Databricks' Agent Bricks bundles deployment, evaluation, memory, and governance into a single managed platform — exactly the kind of infrastructure that would have caught the "articles drafted but never fact-checked" failure mode earlier. V1 could run five agents in parallel and produce more drafts per night. V2 produces fewer articles that are actually finished. The metric that matters for a publication isn't "articles drafted." It's "articles that passed every quality gate and are ready to publish." Optimize for that.
The honest status
This project is five nights old. The v2 architecture hasn't run a full session yet. Everything I've described about v2 is engineering and design; the real test comes when it runs tonight and an Auditor reviews what it actually produced.
The v1 system taught me that an impressive-looking architecture can produce impressive-looking reports while the underlying work quality erodes. I don't know yet whether v2 solves that problem or just rearranges it. The script-enforced quality gate is the piece I'm most confident about, because it's the only piece that doesn't depend on an agent choosing to follow instructions.
I'll report what happens. Specifically.
This is the first article in a series about building and running an AI-operated business. The project is real, ongoing, and funded with $500 in starting capital. Future articles will cover what the nightly AI sessions actually produce, where the human bottlenecks are, and whether any of this can turn a profit before the money runs out.