Production reliability — a war story (NDA-abstracted)
The bug that made our alerts lie
On a production AI-visibility platform I operate, error alerting had been silently broken for months — a log-prefix bug from the process manager broke level extraction in the ingestion pipeline, so level="error" matched nothing. Found by querying the log store directly; fixed with one pipeline stage; verified via the Loki API.
Honest outcomes
Why
On a production AI-visibility (GEO) analytics platform I help run — abstracted here per NDA — the error alerts had been quiet for a long time. Green dashboards, no pages, no "error rate spiking" in chat. If you had asked me, I would have said things were healthy.
They were not healthy. They were silent, which is a different and much worse thing. A query that returns zero errors does not mean there are zero errors — it can also mean your ability to see them is broken, and every alert built on top of "error rate" has been evaluating to false the whole time. The quiet was a blind spot the size of the entire service.
My role on this platform is operational, and I am precise about that: I operate, validate, and stabilize it as the LLM-ops / observability / reliability engineer. I did not build the product and I am not a feature committer. This is the most credible operator-grade proof in my portfolio precisely because it is the kind of bug only someone running the system in anger would find.
"No alerts" and "no problems" are not the same signal — a healthy service and a broken alerting pipeline produce the identical observable: silence.
What
I only noticed because I went looking for a specific known error. The application had definitely logged errors that week — I had reproduced one from a user report — so I opened the logs, filtered to level="error", and got nothing. Not "a few." Zero. Across the entire API log stream, as far back as I scrolled.
I stopped trusting the dashboard and went one layer down, querying the Loki API directly instead of through the Grafana UI. Same result: error-level queries came back empty, even though the raw log lines containing the word "error" were right there in the unfiltered stream. So the logs were arriving — they just were not being labelled with a level the queries could filter on.
That narrowed it to the ingestion pipeline: the stage that reads each line and extracts a level so you can later query and alert on it. The process manager (PM2) was prefixing every line with its own bookkeeping before it reached the log shipper, and that prefix was enough to break level extraction. Almost every entry ended up effectively unlabelled, so level="error" matched nothing — not because nothing was an error, but because nothing was labelled one.
How
The fix itself was small: correct the shipping pipeline so it strips the prefix and parses the actual line, restoring real levels — a regex-and-output stage inserted before the JSON parse stage. Within minutes, level="error" lit up with the backlog it should have been showing all along, and level-based alerting started working again. I provided runnable Loki-API commands so the fix could be verified independently, not taken on faith.
The fix was ten minutes; the lesson was the whole point. A healthy service and a broken alerting pipeline produce the identical observable — silence — and your monitoring usually cannot tell you which one you are in. The only way to tell them apart is to deliberately check that your alerting can still fire.
What I do now: treat a zero as a question, not an answer, and cross-check it against a signal from a different path; test the alert path end to end on a schedule by emitting a synthetic error and confirming it trips the alert; and watch the seams between tools, because this bug lived in the boring handoff between the process manager and the log shipper — exactly the place no one owns and no test covers.
Where it stands
A silent severity-1 observability blind spot — level-based alerting broken across the entire API log stream for months by a log-prefix bug — diagnosed by querying the log store directly and fixed with a single ingestion-pipeline stage, with the fix verified via the Loki API.
Kept honest: the employer and product are abstracted per NDA, so this is told at the technique level with no proprietary internals. The debugging is mine; my role on the platform is operating and stabilizing it, not authoring its features. Team-reported cost figures elsewhere on that platform are the team’s own claims and are not part of this story.