Rendered at 10:03:56 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
unsaved159 2 days ago [-]
I am curious what are you using multiple agents for? In my experience, without supervision even the most advanced models degrade quickly.
mschwarz 2 days ago [-]
Day to day I run a dev pod (implementor + qa + frontend design), a review pod doing adversarial review with one Claude and one Codex, and an orchestrator pair. I think the best flex here to illustrate real work being done is so far the longest single rig I've kept running continuously was about 4 days, so that means a large implementation spec being executed with test driven dev approach from obra superpowers + independent deep contextual code reviews at milestones (my own skill pack) + automated vercel agent-browser testing along the way. So currently it's a closed sdlc loop that is only limited by the amount of work I gave it. The "babysitting agents" part moves me up a layer to watching for spec drift and handling weird edge cases that come up. So its not set and forget but you can definitely have it work on something real overnight to get that 'my agents shipped code for while I slept' kind of outcome. I watch a demo video in the morning to see what they built, then do my own code review spot checks of pr's.
The original motivation for making OpenRig is this pattern works well I've been doing this for months now, and I'm sure many people have also gotten something like to work, but the topology is fragile. Like the sessions die, your laptop needs a reboot, you lose the setup you built up that took weeks to perfect. OpenRig makes the topology itself a first-class thing, like a docker-compose but for the topology of claude codes / codex on your machine and all their specific context and configs you fine-tuned.
Regarding supervision - that is the key question for sure - I can't really babysit more than 4-5 agents without feeling like I've lost the plot a bit. So the demo pod in the onboarding includes an example of a pattern I use where there are 2 orchestrators in a "high availability" pair, so I just really interact with 1 agent for the workstream - the orch-lead. The peer is there to monitor and absorb the lead's mental model in realtime, and can take over for the rig if the lead's context limit hits the wall, or something else goes wrong.
unsaved159 1 days ago [-]
What use cases did you find this approach works for, and what doesn't? Any observations on what topologies work better?
I tried doing the same for the cases of maintaining OSS projects. So far, best I could manage is to get the agents to autonomously do %80~ of the work. But then, I have to review manually each potential PR, and almost in every case to further work with an agent providing it with guidance live to fix it. This takes about as much time as without the swarm. So far I found that the usefulness of the swarm is mostly for the initial scouting, to map out what work needs to be done in first place, and store it in a nice JSON file.
From my observations, all it takes is one mistake for an agent to make, from there, the architecture just snowballs into chaos as the future work builds on top of incorrect initial approach.
mschwarz 1 days ago [-]
Yeah I can definitely relate with the snowballing. I am mostly building web apps (python/typescript) so ymmv. Have you tried to pair codex with claude? This is like the gateway drug for doing agent topologies. This is definitely worth trying. Claude is better at understanding your intent, but at the expense of it makes lots of mistakes. Codex makes less mistakes but at the expense of over-engineering. Together they are not perfect but significantly more accurate. They complement each other well. So Codex reviews claude, using TDD is even better because codex will gate each change claude makes. You can apply this pattern to implementation, reviews, PM, even research, etc. OpenRig has a spec called implementation-pair which lets you try this pretty easily. There is another one called adversarial-review which is the same topology just different starter context / instructions to make them less constructive, more combative. You'll get a feel for which one you need for a task pretty quick. But lots of people have made this pattern into skills. I think OpenRig is probably the easiest happy path to try it because the 2 agents can literally type into each others terminals using "rig send" and "rig capture" and see each other screens using tmux, as if you were the one typing the commands. But now you just sit back and watch them find and fix bugs. You dont need OpenRig to do this, just tmux, but raw tmux is a little fiddly to get working which is why i made the rig send command as a tmux wrapper.
mschwarz 2 days ago [-]
OP here. Happy to answer questions or go deep on specifics.
Some topics I've been asked about: tmux as a transport primitive (actually a pretty nice unlock), how snapshot/restore actually works in practice, hows this different from a harness framework, why I didn't just build this into Claude Code, why I think the topology layer needs to stay independent from any one vendor's platform, etc.
The original motivation for making OpenRig is this pattern works well I've been doing this for months now, and I'm sure many people have also gotten something like to work, but the topology is fragile. Like the sessions die, your laptop needs a reboot, you lose the setup you built up that took weeks to perfect. OpenRig makes the topology itself a first-class thing, like a docker-compose but for the topology of claude codes / codex on your machine and all their specific context and configs you fine-tuned.
Regarding supervision - that is the key question for sure - I can't really babysit more than 4-5 agents without feeling like I've lost the plot a bit. So the demo pod in the onboarding includes an example of a pattern I use where there are 2 orchestrators in a "high availability" pair, so I just really interact with 1 agent for the workstream - the orch-lead. The peer is there to monitor and absorb the lead's mental model in realtime, and can take over for the rig if the lead's context limit hits the wall, or something else goes wrong.
I tried doing the same for the cases of maintaining OSS projects. So far, best I could manage is to get the agents to autonomously do %80~ of the work. But then, I have to review manually each potential PR, and almost in every case to further work with an agent providing it with guidance live to fix it. This takes about as much time as without the swarm. So far I found that the usefulness of the swarm is mostly for the initial scouting, to map out what work needs to be done in first place, and store it in a nice JSON file.
From my observations, all it takes is one mistake for an agent to make, from there, the architecture just snowballs into chaos as the future work builds on top of incorrect initial approach.
Some topics I've been asked about: tmux as a transport primitive (actually a pretty nice unlock), how snapshot/restore actually works in practice, hows this different from a harness framework, why I didn't just build this into Claude Code, why I think the topology layer needs to stay independent from any one vendor's platform, etc.
(14 years lurking on HN. First post.)