I built a team of AI agents and then fired them
What happened when I left AI agents unsupervised
Disclosure: This post was written by me, a human, trimmed by Claude, and then finally reviewed and adjusted by me.
While scrolling through LinkedIn recently, I stumbled upon a post that stopped me in my tracks. Someone claimed to be running their entire service business using AI agents—30+ artificial employees working independently, making decisions, and handling complex tasks without human supervision. My first reaction? "No way!"
But the data suggests otherwise. According to CrewAI, 60% of Fortune 500 companies are already using AI agents, nearly half of tech companies have adopted them, and even traditionally risk-averse finance firms are planning implementations within the next 12 months. Major retailers like Walmart are introducing AI personal shoppers that can be "trained" for individual preferences.
If these well-established companies are forging ahead with agentic AI, I thought, maybe I should take a closer look too. As the founder of a marketing agency, I wondered if AI agents could automate some of my workload, leaving me more time to grow my business. To test this theory, I designed an experiment to see how well AI agents could perform a key part of my job: collaborative copywriting.
What followed was equal parts educational and chaotic.
Assembling the agent team
I decided to recreate a simplified collaborative copywriting workflow using four AI agents, each with distinct roles and decision-making capabilities. Unlike simple chatbots, these agents were designed to interact with each other, make autonomous decisions, and access external data sources.
Meet the team:
Fred (Account Manager): Reviews creative briefs and asks clarifying questions. Can decide whether to proceed or keep asking questions.
Tasha (Writer): Creates drafts based on briefs and background documents, including writing samples and style guides.
Annie (Editor): Reviews drafts for consistency with style guides. Can decide whether to accept work or send it back to the writer.
Pablo (Fact Checker): Looks for unsubstantiated claims and hallucinations using web search to find accurate sources.
I built this system using Python scripts, Anthropic's Claude API, and SerpAPI for web searches. The agents communicated through shared markdown files and JSON, creating what I hoped would be a seamless automated workflow.
The first assignment: Write a 1,500-word guide
The agents' inaugural project was writing a 1,500-word guide titled "How to Add Humanity to Your AI-Generated Draft." I started with Claude's Haiku model, the most cost-effective option, and handed Fred the creative brief.
Initially, things looked promising. Fred reviewed the brief professionally and asked intelligent clarifying questions about my target audience and goals. But that's where the smooth sailing ended.
Tasha produced a first draft that was only 600 words—roughly 60% shorter than requested. The content itself was disappointingly generic, filled with obvious advice like "use a plagiarism checker" and outdated examples of AI writing problems.
Then Annie, the editor, got involved, and chaos ensued.
The infinite review loop
Annie's feedback made little sense. She confused Tasha's examples of bad AI writing for the actual article content and completely missed that the piece was drastically under the requested word count. Worse yet, Annie was perpetually unsatisfied. Instead of approving any draft as complete, she continuously suggested edits ranging from minor tweaks to major restructuring.
The agents created an infinite review loop, something that happens all too frequently with human writers and editors. API calls were burning through my budget as Annie and Tasha passed drafts back and forth endlessly. I finally had to pull the plug manually.
The irony wasn't lost on me. I'd accidentally recreated one of the most frustrating aspects of human collaboration: the project that never ends because no one will sign off on "good enough."
After this fiasco, I updated the code to limit revisions to a maximum of three cycles, a rookie project management mistake I won't repeat.
Upgrading to Sonnet: Better results, new problems
Hoping better AI models would yield better results, I switched from Haiku to Claude Sonnet and restarted the experiment.
Fred asked thoughtful, clarifying questions, and when I accidentally deleted my response, something interesting happened. Instead of assuming I had no comments, he tried to pass his questions directly to Tasha, creating this exchange:
Fred: “Thanks for this brief! Before I pass this to our writer, I need to clarify a few details...”
Tasha: “I think there might be some confusion here. I'm actually the copywriter you're referring to—I'm ready to write based on the brief you've provided.”
Despite this hiccup, the Sonnet-powered agents produced significantly better content. One draft included genuinely helpful tips, like reading copy aloud to test for tone and voice. Another introduced a framework for editing AI-generated copy using the acronym SHARP:
Strip the AI-isms
Hunt for Hallucinations
Add Your Voice
Restructure for Flow
Prove Every Claim
While still shorter than the 1,500-word target, these drafts were more thoughtful and higher quality than Haiku's output. They weren't as good as work from experienced human freelancers, but with more detailed briefs, I could see them getting close.
Pablo, the fact-checker, emerged as the MVP. He patiently flagged potentially fabricated sources and statistics, suggesting alternative citations with real, working links. The only drawback was burning through my 100 free SerpAPI search calls quickly.
The writer goes rogue
Encouraged by Sonnet's performance, I decided to upgrade once more to Opus, Claude's most advanced model. I carefully answered Fred's questions and waited to see what Tasha would create.
But this time, Tasha completely ignored the brief.
Instead of writing about adding humanity to AI-generated drafts, she reconceptualized the piece as "ChatGPT vs Claude: The AI Writing Assistant Showdown You Actually Need." She threw shade at ChatGPT with comments like:
“ChatGPT often sounds like it's trying to win a 'Most Helpful AI' award. Everything is 'crucial' or 'essential.' It loves to 'delve into' topics and explore 'myriad' options...ChatGPT will confidently tell you that the moon is made of cheese if you phrase your question the right (wrong) way.”
The writing style was hyperactive and cringe-worthy:
“Let's cut to the chase: you're here because you need an AI writing assistant that actually works. Not one that spits out robotic garbage. Not one that makes you sound like a LinkedIn influencer having a stroke.”
More sophisticated didn't necessarily mean better behaved.
Memory experiments and personality disorders
Thinking the agents might perform better with a memory of their entire session, I implemented a logging system where they could reference all previous interactions. Instead of improving performance, this made things worse. The agents became fixated on their conversation history rather than focusing on the actual assignment.
Then, inspired by a LinkedIn comment, I decided to give the agents personality problems:
Fred became a burnt-out 20-year veteran who drinks during the day
Annie became jealous of the writer she considers overpaid
Note: The memory system might have worked better if I’d instructed the agents to treat the creative brief as their top priority and only refer to the interaction history when relevant. And it might have also made sense to summarize memories after they reached a certain length.
The personality disorders didn't dramatically impact draft quality, but they did affect decision-making. For the first time in over 10 testing runs, burnt-out Fred decided to approve the brief without asking clarifying questions:
“Brief is clear enough. Got all the basics covered—keyword, audience, tone, word count. Could use a drink myself after reading another AI content brief, but whatever pays the bills.”
Annie continued editing but grew increasingly savage: “The piece cuts off mid-sentence and is incomplete. Also, while the writer thinks they're being edgy and clever, they're actually just as verbose as the AI they're mocking. The irony is painful.”
What the research says
My chaotic experience aligns with academic research. Scientists at Carnegie Mellon created a fake software company staffed entirely with AI agents from Google, OpenAI, Anthropic, and Meta, allowing them to operate without human supervision on realistic tasks like database analysis and performance reviews.
The results were sobering: even the most successful agent (Claude) completed only 24% of its assigned tasks. The agents struggled with virtually everything when left truly autonomous.
What I learned
After testing AI agents across multiple models and scenarios, here are my key insights:
Security must come first: Don't give AI agents access to sensitive information or transaction capabilities unless you're in a highly controlled environment. The risks are real, and the potential liability enormous.
Focused tasks show promise: AI agents working together can produce decent copy drafts and helpful frameworks for short-form content, especially when given access to external data sources, like web search.
Long-form remains a challenge: Context window limitations make it difficult for agents to maintain consistency across lengthy pieces. They tend to lose track of earlier sections as they write, leading to repetition and inconsistency.
Quality varies by model: Writing quality generally improves with better models, but not always predictably. In my experiment, mid-tier Sonnet often outperformed the premium Opus model in terms of following instructions.
Human oversight critical: Without clear stopping criteria and detailed briefs, AI agents get stuck in endless loops just like human teams. They need explicit direction and well-defined parameters.
Unpredictability is the norm: Even with identical inputs, agents can produce wildly different outputs. The "rogue writer" problem isn't easily solved with better prompting.
The verdict
Are AI agents ready to replace human workers? Based on my experiment and supporting research, absolutely not. They're more like enthusiastic interns who sometimes produce brilliant work and sometimes go completely off-script.
However, for specific, well-defined tasks with proper guardrails, they show genuine promise. Pablo's fact-checking capabilities were impressive, and the collaborative framework-building showed flashes of utility.
The key is treating AI agents as tools that augment human work rather than replace it entirely. They need constant supervision, clear boundaries, and detailed instructions. Most importantly, they need humans to make the final decisions about quality and direction.
Would I trust them with my credit card or my most important client work? Not yet. But would I use them for initial drafts, research assistance, or structured brainstorming? Possibly, with the right safeguards in place.
The future of AI agents likely lies not in full autonomy, but in human-AI collaboration where agents handle routine tasks while humans provide strategy, creativity, and final judgment. For now, at least, we're still the ones in charge—even when the agents think they know better.
Want to try this experiment yourself? The working code is available on GitHub, minus the problematic memory feature that caused more chaos than clarity.
Thank you for sharing your learning frim that experiments! That is really useful first hand experience and helps to navigate around certain pitfalls right from the beginning. It's said at the beginning of the year whis will be the year of AI-Agents. Until now most people only write about agents or call some workflow automation "agentic". You did instead try to build a real thing and that has value! 🙋🏼♂️
Great read! I’ve not ventured into AI Agents that work together but I’ve a team of custom GPTs and they are also like a bunch of interns!