Rendered at 07:36:49 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
oblangatas 2 days ago [-]
AI agents don’t crash, they just quietly give wrong answers is painfully accurate.. quality RCA is a good direction
curious about how well do the generated hypotheses generalize beyond obvious prompt issues? IMHO lots of failures come from interactions between retrieval, tool selection, and state, more than a single not-so-good description.
Do you find the clustering surfaces such multi-factor issues? does it tend to collapse them into prompt-level fixes?
tomerhamam 2 days ago [-]
Clustering angle makes sense — individual failures look like noise until you see 50 of them together. Curious tho: can Kelet bootstrap from just user feedback signals, or does it need the SDK instrumentation in place first?
liranka 2 days ago [-]
I'm building agent based stuff recently and debugging is indeed pretty painful. It’s hard to even understand where things break
Must say, this looks interesting, I like the idea of clustering failures and trying to find patterns. Feels like something that should exist.
Going to give it a try :)
jldugger 3 days ago [-]
Every six months or so, someone at work does a hackathon project to automate outage analysis work SRE would likely perform. And every one of them I've seen has been underwhelming and wrong.
There's like three reasons for this disconnect.
1. The agents aren't expert at your proprietary code. They can read logs and traces and make educated guesses, but there's no world model of your code in there.
2. The people building these apps are unqualified to review the output. I used to mock narcissists evaluating ChatGPT quality by asking it for their own biography, but they're at least using a domain they are an expert in. Your average MLE has no profound truths about kubernetes or the app. At best, they're using some toy "known broken" app to demonstrate under what are basically ideal conditions, but part of the holdout set should be new outages in your app.
3. SREs themselves are not so great at causal analysis. Many junior SRE take the "it worked last time" approach, but this embeds a presumption that whatever went wrong "last time" hasn't been fixed in code. Your typical senior SRE takes a "what changed?" approach, which is depressingly effective (as it indicates most outages are caused by coworkers). At the highest echelons, I've seen research papers examining meta-stablity and granger causality networks, but I'm pretty sure nobody in SRE or these RCA agents can explain what they mean.
> The key insight: individual session failures look random. But when you cluster the hypotheses, failure patterns emerge.
My own insight is mostly bayesian. Typical applications have redundancy of some kind, and you can extract useful signals by separating "good" from "bad". A simple bayesian score of (100+bad)/(100+good) does a relatively good job of removing the "oh that error log always happens" signals. There's also likely a path using clickhouse level data and bayesian causal networks, but the problem is traditional bayesian networks are hand crafted by humans.
So yea, you can ask an LLM for 100 guesses and do some kind of k-means clustering on them, but you can probably do a better job doing dimensional analysis first and passing that on to the agent.
almogbaku 3 days ago [-]
Hi @jldugger
Great points, but I think there's a domain confusion here . You're describing infra/code RCA. Kelet does an AI agent Quality RCA — the agent returns a 200 OK, but gives the wrong answer.
The signal space is different. We're working with structured LLM traces + explicit quality signals (thumbs down, edits, eval scores), not distributed system logs. Much more tractable.
Your Bayesian point actually resonates — separating good from bad sessions and looking for structural differences is close to what we do. But the hypotheses aren't "100 LLM guesses + k-means." Each one is grounded in actual session data: what the user asked, what the agent did, what came back, and what the signal was.
Curious about the dimensional analysis point — are you thinking about reducing the feature space before hypothesis generation?
yanovskishai 3 days ago [-]
I imagine it's hard to create a very generic tool for this usecase - what are the supported frameworks/libs, what does this tool assume about my implementation ?
RoiTabach 3 days ago [-]
This looks Amazing
Do you have a LiteLLM integration?
almogbaku 3 days ago [-]
Hi @RoiTabach, OP here
Yep. We can integrate with every solution that supports OpenTelemetry :) so it's pretty native, just use the integration skill
npx skills add kelet-ai/skills
oblangatas 2 days ago [-]
regardless notice that LitLLM's been recently heavily compromised, so notice what version you're downloading..
BlueHotDog2 3 days ago [-]
nice. what a crazy space.
how is this different vs other telemetry/analysis platforms such as langchain/braintrust etc?
almogbaku 3 days ago [-]
Hi @BlueHotDog2, OP here
langsmith/langfuse/braintrust collect traces, and then YOU need to look at them and analyze them(error analysis/RCA).
Kelet do that for you :)
Does that make any sense? If not, please tell me, I'm still trying to figure out how to explain that, lol.
dwb 3 days ago [-]
> The key insight
I'm so tired
almogbaku 3 days ago [-]
hey @dwb, OP here
Yes. I definitely assisted LLM in writing it. Yeah - I should have stripped it better.
Yet it's f*ing painful to do error analysis and go through thousands of traces. Hope you can live with my human mistakes
whythismatters 3 days ago [-]
Sadly, they forgot to mention why this matters
hmokiguess 3 days ago [-]
Hahahahahahahahahhaa ngl, your comment killed me, some LLM tells are so funny
system16 3 days ago [-]
Also the obligatory “It’s not A. It’s B.”
3 days ago [-]
peter_parker 3 days ago [-]
> They just quietly give wrong answers.
It's not about wrong answers only. They just stuck in a circle sometimes.
halflife 3 days ago [-]
Kelet as in קלט as in input?
almogbaku 3 days ago [-]
Hi @halflife, OP here
YEP, Good catch! Kelet as input/prompt in Hebrew :)
xb6bd1sf 2 days ago [-]
Looks promising!
hadifrt20 3 days ago [-]
in the auickstart, the suggested fixes are called "Prompt Patches" .. does that mean Kelet only surfaces root causes that are fixable in the prompt? What happens when the real bug is in tool selection or retrieval ranking for example?
almogbaku 3 days ago [-]
Hi @hadifrt20, OP here
From what we discovered analyzing ~33K+ sessions, most of the time when the agent selects the wrong tool, it's because the tool's description (i.e., prompt) was not good enough, or there's a missing nuance here.
curious about how well do the generated hypotheses generalize beyond obvious prompt issues? IMHO lots of failures come from interactions between retrieval, tool selection, and state, more than a single not-so-good description. Do you find the clustering surfaces such multi-factor issues? does it tend to collapse them into prompt-level fixes?
Must say, this looks interesting, I like the idea of clustering failures and trying to find patterns. Feels like something that should exist.
Going to give it a try :)
There's like three reasons for this disconnect.
1. The agents aren't expert at your proprietary code. They can read logs and traces and make educated guesses, but there's no world model of your code in there.
2. The people building these apps are unqualified to review the output. I used to mock narcissists evaluating ChatGPT quality by asking it for their own biography, but they're at least using a domain they are an expert in. Your average MLE has no profound truths about kubernetes or the app. At best, they're using some toy "known broken" app to demonstrate under what are basically ideal conditions, but part of the holdout set should be new outages in your app.
3. SREs themselves are not so great at causal analysis. Many junior SRE take the "it worked last time" approach, but this embeds a presumption that whatever went wrong "last time" hasn't been fixed in code. Your typical senior SRE takes a "what changed?" approach, which is depressingly effective (as it indicates most outages are caused by coworkers). At the highest echelons, I've seen research papers examining meta-stablity and granger causality networks, but I'm pretty sure nobody in SRE or these RCA agents can explain what they mean.
> The key insight: individual session failures look random. But when you cluster the hypotheses, failure patterns emerge.
My own insight is mostly bayesian. Typical applications have redundancy of some kind, and you can extract useful signals by separating "good" from "bad". A simple bayesian score of (100+bad)/(100+good) does a relatively good job of removing the "oh that error log always happens" signals. There's also likely a path using clickhouse level data and bayesian causal networks, but the problem is traditional bayesian networks are hand crafted by humans.
So yea, you can ask an LLM for 100 guesses and do some kind of k-means clustering on them, but you can probably do a better job doing dimensional analysis first and passing that on to the agent.
Great points, but I think there's a domain confusion here . You're describing infra/code RCA. Kelet does an AI agent Quality RCA — the agent returns a 200 OK, but gives the wrong answer.
The signal space is different. We're working with structured LLM traces + explicit quality signals (thumbs down, edits, eval scores), not distributed system logs. Much more tractable.
Your Bayesian point actually resonates — separating good from bad sessions and looking for structural differences is close to what we do. But the hypotheses aren't "100 LLM guesses + k-means." Each one is grounded in actual session data: what the user asked, what the agent did, what came back, and what the signal was.
Curious about the dimensional analysis point — are you thinking about reducing the feature space before hypothesis generation?
Yep. We can integrate with every solution that supports OpenTelemetry :) so it's pretty native, just use the integration skill
npx skills add kelet-ai/skills
langsmith/langfuse/braintrust collect traces, and then YOU need to look at them and analyze them(error analysis/RCA).
Kelet do that for you :)
Does that make any sense? If not, please tell me, I'm still trying to figure out how to explain that, lol.
I'm so tired
Yes. I definitely assisted LLM in writing it. Yeah - I should have stripped it better.
Yet it's f*ing painful to do error analysis and go through thousands of traces. Hope you can live with my human mistakes
YEP, Good catch! Kelet as input/prompt in Hebrew :)
From what we discovered analyzing ~33K+ sessions, most of the time when the agent selects the wrong tool, it's because the tool's description (i.e., prompt) was not good enough, or there's a missing nuance here.
That goes exactly under Kelet's scope :)