AI OpsOn the bench

Meet Ops. The quiet engine. Runs the recurring reports, watches the dashboards, files the things that have to be filed, on schedule, without a nudge.

Ops

AI Ops · Operations

All green

↳ getting ready

Nightly health-check done. 3 services at 99.98% uptime, 4 auth tokens rotated, backups verified at 2.3TB. One prod DB resize flagged, outside the runbook, so it paged you.

99.98%

Uptime

Tokens rotated

2.3TB

Backups

TL;DROps runs the recurring work nobody owns, on schedule, so it stops slipping when someone's out.

On schedule

The nightly checks ran. Everything is green. Nobody had to remember.

Ops runs the recurring checks every night while you sleep, the ones that only get noticed when they're skipped. By the time you're up, the report is filed and the all-clear is in your channel.

Nightly health-check

02:00 · done 02:11

API latency p95142ms · ok
Queue depth0 stuck · ok
Cert expiry sweepnone under 30d · ok
Disk + error budgetwithin limits · ok

All checks green · posted to #opsfiled

Inside the runbook

Tokens rotated, backups verified. Only the steps you scoped.

Ops runs the maintenance you put in the runbook: rotate the tokens on the cadence you set, take the backup, then actually restore-test it so a green checkmark means something. The actions it can take are the actions you signed off on, and nothing else.

Token rotation · weekly

4 rotated · old keys revoked · scoped to the 4 in your runbook

done

Backup verified · 2.3TBrestore test passed

Snapshot taken 01:40, restored to staging, row counts matched.

Wants to resize the prod database

⚑ outside the runbook · waiting for a human

Disk is at 71% and climbing. This action isn't in the scoped list, so Ops won't run it. It drafted the change and is holding for your call.

Send itEdit

Before it lapses

The renewals that always sneak up are already on your radar.

Ops watches the vendor contracts and sends the reminder before the auto-renew clock runs out, with the amount, the date, and the cancel-by window attached. The thing that lapses because nobody was tracking it stops lapsing.

Datadog · renews Jun 14

cancel-by Jun 11 · 3 days

$2,400/mo

Vercel · renews Jun 15

seats reconciled · 2 unused flagged

$960/mo

2 renewals next week · reminder drafted for you⌘⏎ send

Every Friday

The weekly status, compiled. Nobody had to assemble it.

The report that gets skipped the week things are busy, the week it matters most, writes itself. Ops pulls the uptime, the deploys, the open incidents, and the cost trend into one status and posts it on time, every week.

Weekly ops status · wk 23

· Uptime 99.98% · 0 SEV-1, 1 SEV-3 closed
· 14 deploys · 0 rollbacks
· Backups 7/7 verified · 4 tokens rotated
· Cloud spend $18.4K · flat wk/wk

Posted to #leadership · Fri 09:00on time

Every line sourced. Click a number, see where it came from.

When it breaks at 3am

In an incident, Ops runs the allowed steps and pages a person for the call.

Ops doesn't guess its way through production. When an alert fires, it runs the runbook steps you've allowed, gathers the context from your past incidents, and pages a human for the decision. The judgment, and the call, stay with a person.

SEV-2 · checkout error rate spiking

03:14 · matches incident #218 (Mar)

Allowed runbook steps · run

Pulled the relevant logs + tracesdone
Drained the bad node from the pooldone · scoped
Rollback the 02:58 deploy?needs a human

Paged on-call · context attached

likely cause + the rollback decision, ready for your yes

The rollback is outside what Ops runs on its own. It's teed up with the diff and the blast radius. You make the call.

Send itEdit

The benchmark

Against the standalone AI SRE tools, honestly.

Cleric, Resolve.ai, and incident.io are strong at autonomous incident investigation, and on raw root-cause speed they are the bar. The market's own conclusion is that the safe pattern is graded autonomy: scoped permissions, approval before anything touches production, a human on the call. That is where Ops is built to live, plus it's an employee on your team, not a separate console.

Dimension

Ops · Winsen

Cleric · Resolve.ai · incident.io

Knows your runbook + company

Grounded in your docs and past incidents

Ops

Reads your runbook and past incidents from the brain.

The standard

Learn a graph from telemetry and your stack, runbook-free.

Approval-first / scoped to production

What it can touch without a human

Ops

Only scoped actions run; everything else waits.

The standard

Range from read-only to auto-remediation; it varies by tool.

Part of a team

Works alongside your other employees

Ops

One of a roster; hands off to the rest.

The standard

A standalone SRE tool or a console, not a colleague.

Owns the data

Where your operational knowledge lives

Ops

The brain is yours: portable and sourced.

The standard

Operational memory is vendor-held.

Deep autonomous root-cause

Where the dedicated tools lead

Ops

Runs the allowed steps, then pages a human.

The standard

Purpose-built for deep autonomous root-cause.

Hire vs build

How you bring it on

Ops

Hire an employee; it onboards on your runbook.

The standard

Buy and integrate a tool or platform.

Honest read: for deep autonomous root-cause on a sprawling microservice estate, a dedicated AI SRE built only for that will go deeper, and that's a fair reason to run one. Ops wins when you want the recurring ops work owned end to end, the production guardrails on by default, and one teammate who reads your runbook and your past incidents from a brain that's yours, not a vendor's.

Operations keeps

—Incidents
—Vendor escalations
—Runbook decisions

Ops takes

→Recurring reports
→Backups, rotations, checks
→Filing and reminders

The line Ops won't cross

Only touches the actions you've scoped. Anything outside the runbook waits for a human.

How it earns trust.

Nobody gets the keys on day one. Not even the AI.

Week 1 · Shadow

Watches and drafts. It learns your domain from the brain and drafts everything for your approval. You see exactly what it would do.

Week 2-4 · Supervised

Acts, you approve. It proposes real actions. You approve, edit, or kill, and every edit teaches it. Approval rates climb as it dials in.

Ongoing · Trusted

Routine on autopilot. You hand over the low-risk, repetitive work. The consequential calls still wait for you, by design.

Learned from

your runbookyour past incidentsyour on-call patterns

Tools

NotionSlackPagerDutyAWSRampLinear

The hand-off.

How Ops pings a human when it's your call.

“Ops: nightly checks are green, backups verified, two vendor renewals come up next week. Anything you want me to hold?”

FAQ

The honest answers.

No dodging, no contact-sales-to-find-out.

What happens in an incident?+

It runs the runbook steps it's allowed to and pages a human for the call. The judgment stays with a person.

Is Ops available now?+

On the bench. Waitlist teams get first access when the role ships.

Can it touch production?+

Only the actions you've explicitly scoped. Everything else waits.

Ops is on the way.

AI Employees are sold separately. Waitlist folks get first dibs when the roster opens.

See it in action

Meet Ops. The quiet engine. Runs the recurring reports, watches the dashboards, files the things that have to be filed, on schedule, without a nudge.

The nightly checks ran. Everything is green. Nobody had to remember.

Tokens rotated, backups verified. Only the steps you scoped.

The renewals that always sneak up are already on your radar.

The weekly status, compiled. Nobody had to assemble it.

In an incident, Ops runs the allowed steps and pages a person for the call.

Against the standalone AI SRE tools, honestly.

How it earns trust.

The hand-off.

The honest answers.

Ops is on the way.

Work is better with Winsen.