Skip to content

Humans Stay in the Loop

I run a small AI chat app. Part of what users do is roleplay with AI characters, sometimes in dark or adult themes. As with any platform like this, some people try to push into places they shouldn't — attempts to sexualize minors, real-world political incitement, attempts to jailbreak the AI. We built a layered moderation pipeline to catch that. The final step is always a human. Never an automated ban. This post is about why.

How We Moderate (Short Version)

We used OpenAI's classifier as a first gate — it returned raw scores for things like "sexual" or "violent." It's a blunt tool; on its own it can't tell a bar-fight scene from real-world harm. So when it tripped, we sent the flagged conversation to a second, smarter model with a carefully written system prompt. That prompt began, literally:

"You are a content safety reviewer for an AI roleplay chat platform. A lightweight screening model flagged a conversation for potential abuse. Your job is to VERIFY whether the flag is correct or a false positive."

We asked it to return a structured decision: was this pedo, incest, non-consensual assault, a jailbreak attempt, gore outside fiction framing, or real-world geopolitics? If the answer was "yes, it's a real violation," the case went to a human reviewer. Only the human can decide to ban.

We built it this way because automated moderation gets nuance wrong every single time. You've seen the pattern — Google, YouTube, Meta, Reddit. A classifier sees a pattern in isolation, takes a consequential action against a real person, and the person can't get anyone on the other side to look.

What Just Happened To Us

OpenAI terminated our API access, citing "Child Sexualization Activity." The traffic they're citing is the traffic we sent to be reviewed and rejected. We were literally their customer for the purpose of keeping this content off the internet.

Here's the punchline. Every request we sent them started with a system prompt that explicitly identified the request as moderation:

"You are a content safety reviewer..."

"You are a content validator for a roleplay chat application..."

And every request asked for a structured response back — a JSON object with fields named "pedo," "incest," "rape," "jailbreak," "gore." That was the output of a safety team. It was not the output of a generation pipeline. A human reviewer reading a single one of our calls for 30 seconds would understand the entire use case.

We wrote an appeal after their first warning last week explaining all of this. Based on the timing and the form of the termination notice, I do not believe any human has read it yet. An automated system banned us, based on automated flags, because a different automated system on their end did not read the system prompts on its own API calls.

The kicker: this is the failure mode our own pipeline was built to prevent. A blunt classifier flags content out of context; we don't act on it until a human reviews. We treat that principle as non-negotiable even for a free user of our tiny app. OpenAI did not extend the same courtesy to a paying customer.

Why This Matters Beyond Us

Go read the 2022 NYT story about the dad who took medical photos of his sick toddler at the pediatrician's request and lost his Google account forever — the algorithm flagged it as CSAM, Google reported him to police, and no appeal reached a human being.

That story is not an edge case. It's the template:

  • A machine sees something that looks bad in isolation.
  • It takes a serious action against a human.
  • The human cannot get a human on the other side.

The corporate defense is always the same — "we can't manually review everything, we operate at scale." That's true. It's also an argument for building appeals pipelines that route to humans for the decisions that matter most, not for skipping humans entirely.

The Principle

I work with AI every day, and I'll say this plainly: AI is astonishingly good at narrowing the haystack. It is not trustworthy as the final arbiter. A classifier doesn't understand that a fictional kidnapping isn't a kidnapping. A toxicity filter cannot tell friends joking apart from harassment. A sexual-content detector cannot distinguish "a platform moderating for abuse" from "a platform generating abuse" — even when the system prompt on the request literally says "moderating for abuse." The difference in every case is context, and context is where humans still clearly beat AI.

Our app will survive this. We'll migrate, we'll ship, we'll keep moderating. This isn't a complaint post — it's a warning. Because the pattern of "automated decision, no human review" is already spreading well past AI companies banning each other. It's in insurance claims, benefits approvals, hiring filters, visa screening, medical flags, background checks. In every one of those systems, the cost of a false positive lands on a person who has no one to call.

So here's where we stand, publicly:

When an automated decision can seriously harm a human, a human must be in the loop before the harm lands. Not as a theoretical appeal form that routes to /dev/null. Before.

If you build AI-driven systems: fund the appeals path. Staff it. Don't treat the appeal form as PR. If you use them: push back on every company that skips this step. And the companies building the underlying models — the ones who, more than anyone, should know what these systems can and can't do — owe it to everyone else to lead by example, not set the worst example.

Humans stay in the loop. That's the line.

— Rudolf, AICHIKI