Craft · 9 min read · June 8, 2026

What we learned building AI workflows.

Durable, multi-step work that runs without a human triggering each step is mostly an exercise in runtime, restraint, and refusing to trust your own demo. Here is what held up.

The Winsen Team

Published June 8, 2026

There is a version of building AI workflows that demos beautifully and dies on contact with a Tuesday. You wire up a clever prompt, it does the thing once on stage, and then it runs unattended for a week and you discover all the ways the world is messier than your test case. We have shipped the version that demos great and falls over by Wednesday more than once. This post is about what we learned building the version that survives the week.

By workflow we mean the hard kind: durable, multi-step work that runs without a human pressing go on each step. An AI employee works a review, waits for new signal, updates a record, escalates the one thing that needs you. Not a chat turn. A process that has to be alive whether or not anyone is watching. Almost everything interesting about building these lives below the model, in the parts nobody screenshots.

The runtime matters more than the model

The instinct when something goes wrong is to reach for a smarter model. Usually that is the wrong knob. A better model writes a better draft. It does not make the work cheap enough to leave running, and it does not give the work the right shape. Those two problems are runtime problems, and a runtime is the thing we kept under-investing in until it hurt.

Take cost. An always-on workflow is not one expensive call. It is a thousand cheap ones, most of which find nothing worth doing. Most wake-ups should never touch a frontier model. A cheap check on a small model decides whether anything actually happened; only the rare wake-up that finds real work pays for real reasoning. Do it the naive way and a single worker that wakes every fifteen minutes can burn a few dollars a day finding nothing. Do it the tiered way and that same idle day costs a few cents. Our runtime budgets work the way a body budgets calories: trivial steps cost almost nothing, so the rare expensive step is affordable. The point is not frugality for its own sake. The point is that autonomy is only useful if it is affordable to leave on, and most autonomy is priced to be turned off.

Then there is structure. The honest truth about a multi-step task is that you cannot draw the org chart in advance. A static pipeline of step one, step two, step three is great until the input that does not fit, and then it either crashes or does the wrong thing with great confidence. So the runtime plans the steps when it sees the actual input, not three months ahead. Depending on what the task turns out to need, it can fan out to specialists, collapse to a single call, or loop. The shape is an output, not a constraint. That sounds abstract until you have watched a rigid pipeline route a refund request into the onboarding flow because that was step two.

You cannot prompt your way out of a runtime problem.

Durable, observable, retryable beats a prompt and a prayer

The most expensive lesson was the cheapest to state. A long-running workflow is a distributed system, and you do not get to opt out of that just because the steps are written in English. The model call that times out at step seven, the API that rate-limits at the worst moment, the process that dies because the instance got recycled mid-deploy: these are not edge cases you will harden later. They are Tuesday.

Early on we ran workflows as scripts that held state in memory and hoped. When a run failed at step six of nine, it failed all the way back to step one, redoing the five steps that worked, sometimes with side effects, occasionally emailing someone twice. The fix was not a smarter prompt. It was treating execution as the real product. We run durable workflows on infrastructure where every step is checkpointed, retries are first-class, and a run that dies resumes from where it stopped instead of from the beginning.

Three properties paid for themselves, and we would not ship a workflow without them now:

→Durable: state lives outside the process, so a crash, a deploy, or a recycled instance is a pause, not a loss. The run picks up where it was, not where it started.
→Observable: you can see every step a workflow took, what it received, what it decided, and where it stalled. A workflow you cannot inspect is a workflow you cannot trust, and you will eventually turn it off out of fear.
→Retryable: a failed step retries on its own with backoff, and the steps that already succeeded are not re-run. Idempotency stops being a nice-to-have and becomes the difference between a glitch and a customer getting charged twice.

None of this is glamorous. None of it shows up in a demo. All of it is the difference between a workflow you trust to run while you sleep and one you babysit until you kill it.

workflow · proposal-from-rfp

run #2291

Trigger

New RFP hits the inbox

Research

Pull the 3 closest matters

Draft

SOW against real rates

Human · waiting on you

Approve the pricing

Send

Out the door

A run that dies at step seven resumes at step seven, then stops for you before anything leaves the building.

Approval gates are a feature, not a limitation

The loudest version of autonomy is the one that does everything by itself and tells you afterward. We build the other one on purpose. AI drafts and proposes; the consequential calls wait for a human. People assume this is a phase we will outgrow once the models get good enough. They have the asymmetry backwards.

The cost of pausing for approval is a few seconds of a human's attention. The cost of not pausing is the outbound email that should never have been sent, the record updated wrong across forty downstream places, the thing you cannot unsend. A smarter model makes a more convincing mistake, not a smaller one. So in our workflows the boring, reversible steps just happen, and anything outward-facing or irreversible stops and shows you exactly what it is about to do. The gate is not friction bolted onto autonomy. It is the part that makes autonomy something a real company will actually turn on.

The surprise was how the gates teach the system. Every approval is a labeled example: this draft was good, that escalation was noise, this one you would have wanted handled silently. Run a workflow for a month and the gates tell you precisely which steps have earned the right to stop asking. Trust gets promoted to autopilot one boring confirmation at a time, on the operator's terms, which is the only way anyone sane hands a process the keys.

DECISIONS.log

14:02Phased cutover, not big-bangthread #eng-sso

13:30Redis over in-memory sessionsADR-12

11:18Hold the docs PR until the fix landsPR #1286

Every approval is a labeled example, and the paper trail is how trust gets promoted to autopilot one boring confirmation at a time.

The lessons that cost us the most

Three things we learned the hard way, in the order they hurt.

Relevance over recall. The first instinct with a retrieval layer feeding a workflow is to stuff the context window with everything that might be related. More is more. It is not. A workflow drowning in forty loosely-relevant documents makes worse decisions than one handed the three that matter, and it costs more to do it. When we cut the context from forty documents to three, decision quality went up and the per-run cost dropped by roughly an order of magnitude. The skill that moved those numbers was not retrieving more. It was retrieving less, better, with a source on every fact so the workflow could tell what it actually knew from what it was pattern-matching. Recall is easy and feels like progress. Relevance is hard and is the actual job.

Scope before autonomy. The failed workflows were almost always the over-scoped ones: handle all of support, own the whole pipeline, do everything a person in this role does. Ambition with no edges does not survive a week. The ones that worked had a job small enough to define, succeed at, and verify. An employee with a tight scope that nails it beats one with a vague scope that is sometimes brilliant and sometimes inexplicable. Widen the scope after the narrow version proves it out, never before. We have never once regretted scoping too tight, and we have a graveyard of workflows we scoped too wide.

Evals over vibes. The most dangerous phrase in this work is it seems to be working well. It seems fine because you watched the three runs that went well and inferred the other three hundred did too. They did not. The only thing that told us the truth was a real evaluation set: known inputs, known good outputs, a score that moves when we change a prompt or a model, run before anything ships. Evals are tedious to build and unglamorous to maintain, and they are the single thing that separates a workflow you can defend from one you are guessing about. Vibes scale to a demo. Evals scale to production.

We did this to ourselves first

None of this is theory we sell and do not eat. We learned most of it the same way we tell customers to: by making AI-built software hold up under its own runs. Rocketman, our open-source project hub, exists because we got tired of AI coding work that demoed great and fell over on the second click. The fix is to spend the tokens, but with a plan, a memory, and a paper trail, so the output ships instead of just looking good in a video. The hub itself is one self-contained offline HTML file in the repo, zero dependencies, because context should live where the work happens. An AI coding teammate reads a file. No API token, no OAuth wall, no rate limiter, no the integration is down.

We leaned on what Thariq Shihipar called the unreasonable effectiveness of HTML: agents are remarkably good at producing it, and it turns a doc you would skim into one you would actually read. A kanban, a diff, a decision tree, visible at a glance. The board is a real work queue. A parent agent allocates ready tasks to a fleet of sub-agents, every item is attributed to a human or a specific model, and the whole thing version-controls alongside the code so project state never drifts away from the code the way a board in a separate tool does. Same lessons, smaller blast radius: durable, observable, scoped, and never trusted on vibes. If you want to take the runtime ideas for a walk before betting a company on them, it is one command, npx @winsendotai/rocketman, and it dogfoods itself, so its own roadmap lives in its own hub.

PR #1284 · session-pool refactor

✓ 142 checks passed

src/auth/session.ts

-const pool = new Map()

+const pool = new RedisPool(env.REDIS_URL)

return pool.acquire(sessionId)

ENG

Dev: Matches your pool conventions and the tests are green. One flag: line 42 drops the retry on a cold connection. I left a comment. Not merging, that call is yours.

MergeRequest changeswaiting on a human

A diff you would actually read, with tests green and a human on the approve row, because the consequential call still waits for a person.

The short version, for the founder deciding where to spend the next month of engineering: do not start with the model. Start with the runtime that makes the work cheap enough to leave on and durable enough to survive a recycled instance. Put a human on the irreversible calls and call it a feature, because it is one. Retrieve less and better. Scope tight, then widen. And build the boring eval set before you trust a single run you did not personally watch. The model gets the applause. The runtime, the gates, the evals, the parts nobody screenshots, are what decide whether your workflow is still running a month from now. That is where the engineering goes.

Hire an AI employee for one role, watch it work a visible queue, and approve every output before it counts.

Get in touch

What we learned building AI workflows.

The runtime matters more than the model

Durable, observable, retryable beats a prompt and a prayer

Approval gates are a feature, not a limitation

The lessons that cost us the most

We did this to ourselves first

Approval-first isn't a feature. It's the whole philosophy.

What does the future of work look like?

Why we build AI employees, not copilots.

The command center for AI employees.

Work is better with Winsen.