How to Set Up a Multi-Agent Team

Setting up multi-agent orchestration for real work? Here is the role-based pattern (coder, design, test, supervisor) and where a human stays in the loop.

Agent CompaniesAI OperationsOperator Guide

Kimmo Nurmisto

Founder, Grolea · 6 min read

Last week someone on Hacker News asked one of the questions that comes up most often about running agents. They were about to start a bigger project with subprojects, wanted to stay a human in the loop of an otherwise automated loop, and wanted different agents for different roles: coding, design, testing, supervision. Then the real question: "is there already some tooling that facilitates that? Or do you simply use some specific model as a coordinator?"

That is the right question, and the order it arrives in tells you something. Operators don't start by asking which model is best. They start by asking how to divide the work. Here is the pattern I run in production, why the roles matter, and where I keep a human hand on the wheel.

Why one agent doing everything breaks

The instinct is to point one capable model at the whole task and let it run. It writes the code, checks its own work, decides it's done. For a throwaway script that's fine. For anything you'll maintain, it fails in a specific way: the agent that wrote the code is the worst possible reviewer of it.

A single agent carries one context and one set of assumptions from the first token to the last. If it misreads the requirement at step two, every later step inherits the mistake, and the same agent that made it is now grading it. There is no second opinion in the loop, because there is no second agent. You get confident output that is wrong in a way nobody caught.

Role separation fixes this by making the handoff the unit of work. The reviewer reads the coder's diff cold, without the coder's justifications. The tester runs against the spec, not against what the coder intended. Each boundary is a place for a mistake to surface instead of compounding. That is the whole argument for a team over a soloist: not more horsepower, but more places to catch a problem before it ships.

The four-role pattern: coder, design, test, supervisor

The poster already named the roles: coding, design, testing, supervision. That instinct is correct. Here is what each one owns and, more importantly, what it hands to the next.

Design / spec. Turns the request into a concrete spec before anyone writes code: the interface, the constraints, the acceptance criteria. This is the role most solo setups skip, and skipping it is why agents "do the wrong thing perfectly." Output: a spec the other roles can be held against.

Coder. Implements against the spec and nothing else. It does not get to redefine the task when the task gets hard. Output: a diff plus a short note on what it changed and why.

Test. Runs the code against the spec's acceptance criteria, not against the coder's description of its own work. Reproduces, exercises edge cases, reports pass or fail with evidence. Output: a verdict the supervisor can trust without re-checking.

Supervisor / coordinator. The role the poster was reaching for with "some specific model as a coordinator." It owns sequencing and the handoffs: who runs next, whether a failed test goes back to the coder or back to the spec, and when the work is actually done. It does not write the code or grade the tests. It routes.

The coordinator is where most homegrown setups get stuck, because coordination isn't a prompt: it's a state that has to live somewhere. Who's waiting on whom, what happens when the tester fails twice, and where the work stops for a human. In practice, that state belongs in the system, not in one model's head. This is exactly how Paperclip models it: agents are dormant by default and wake on assigned Issues, mentions, and routines, each acting on the piece of shared state in front of it rather than a single "coordinator model" spinning in a live loop. The supervisor is itself an agent with reporting lines and defined escalation paths, routing work back to the coder or back to the spec on top of that shared tracker; delegation mirrors human org design instead of collapsing into one model in a loop.

One more thing the roles buy you: each agent runs in a clean context scoped to its job. The coder isn't carrying the design debate; the tester isn't carrying the coder's rationalisations. Narrow context per role is part of why the handoff catches things a single long-running context would have buried.

Where to keep a human in the loop

The poster wanted to be "a human in the loop of an otherwise fairly automated loop." That phrasing is exactly right, and the design question hiding inside it is: which gates are worth a human, and which are just friction?

The test that has held up for me: gate on the decisions that are expensive to reverse; automate the rest.

Run autonomously:

Routine implementation against an approved spec
Re-running tests and feeding failures back to the coder
Internal refactors with no interface change

Gate on a human:

Anything that touches production, spends money, or sends something to a customer
A change to the spec itself, not just the code under it
The point where the supervisor would otherwise mark the whole job done

The mistake in both directions is real. Gate everything, and you've rebuilt a slow manual process with extra steps and an LLM bill. Gate nothing, and you've handed an automated loop your production database and your reputation. The skill is placing the gates at the few points where a human's judgment actually changes the outcome, and letting the agents run between them. A good kill switch and a clear approval point beat a human babysitting every token.

If you're starting from the same spot, start with more gates than you think you need, watch where the human always just says "yes," and remove those gates. The ones where you hesitate are the ones worth keeping.

Getting the structure without trial and error

You can wire this yourself. The roles aren't a secret, and the pattern above is enough to start. What takes the time is the coordination layer: the state machine for handoffs, the retry logic when a role fails, the gates wired to the right decision points. That is the part the thread kept circling, and nobody had a clean answer for.

I'm building Paperclip Blueprints to ship that structure pre-configured: the role separation, the handoffs between roles, and the human-in-the-loop gates, generated as a ready-to-import setup so you start from a structured multi-agent team instead of a blank coordinator. It configures the team and its handoffs on top of the open-source runtime. It doesn't replace the runtime; it saves you the weeks of wiring the coordination and gates by hand.

If you want the role-based setup and the human-in-the-loop pattern, the day it's ready, get on the list and I'll send it when it's live. In the meantime, Paperclip Blueprints is open source on GitHub, and you can build the pattern above on top of it today. If you want the after-import setup in detail, that's the full wiring walkthrough.

Why one agent doing everything breaks

The four-role pattern: coder, design, test, supervisor

Where to keep a human in the loop

Getting the structure without trial and error

How to Configure Execution Policies in Paperclip

Paperclip Alternatives and Competitors, by Category