Rendered at 04:30:36 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
satisfice 2 minutes ago [-]
It’s called testing. And from the reports and comments, there doesn’t seem to be much of it happening. The reason is: it’s quite expensive to do well.
I find that for every hypothesis I might have to run a thousand prompts to collect enough data for a conclusion. For instance, to discover how reliably different models can extract noun phrases from a text: hours of grinding. Even so that was for a small text. I haven’t yet run the process on a large text.
alexhans 19 hours ago [-]
Very, very heterogenous and fast moving space.
Depending on how they're made up, different teams do vastly different things.
No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.
It's definitely an afterthought for most teams although we are starting to see increased interest.
My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.
What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.
One thing I’ve been noticing while building AI tooling is that most “agents” focus on doing work for the user — writing code, sending emails, managing tasks, etc.
But there’s another category that might become just as important: agents that simulate other humans instead of automating tasks.
For example, before shipping a landing page change or pricing update, it’s surprisingly useful to simulate how different types of visitors might react psychologically — where they hesitate, what signals reduce trust, what makes them bounce.
Traditional analytics only shows what happened after users interact. A lot of decisions happen earlier, in the first few seconds, before anything measurable occurs.
I wouldn’t be surprised if we start seeing “human simulation agents” alongside task agents, especially for product, marketing, and UX decisions.
kelseyfrog 4 hours ago [-]
Automated benchmarking.
We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.
From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.
We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.
maxalbarello 4 hours ago [-]
Also wondering how to evals agentic pipelines. For instance, I generated memories from my chatGPT conversation history, how do I know whether they are accurate or not?
I would like a single number that I would use to optimize the pipeline with but I find it hard to figure out what that number should be measuring.
bisonbear 8 hours ago [-]
assume you're referencing coding agents - I don't think people are. If they are, it's likely using
- AI to evaluate itself (eg ask claude to test out its own skill)
- custom built platform (I see interest in this space)
I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?
celestialcheese 4 hours ago [-]
mix of promptfoo and ad-hoc python scripts, with langfuse observability.
Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.
I find that for every hypothesis I might have to run a thousand prompts to collect enough data for a conclusion. For instance, to discover how reliably different models can extract noun phrases from a text: hours of grinding. Even so that was for a small text. I haven’t yet run the process on a large text.
Depending on how they're made up, different teams do vastly different things.
No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.
It's definitely an afterthought for most teams although we are starting to see increased interest.
My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.
What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.
- [1] https://ai-evals.io/ (practical examples https://github.com/Alexhans/eval-ception)
But there’s another category that might become just as important: agents that simulate other humans instead of automating tasks.
For example, before shipping a landing page change or pricing update, it’s surprisingly useful to simulate how different types of visitors might react psychologically — where they hesitate, what signals reduce trust, what makes them bounce.
Traditional analytics only shows what happened after users interact. A lot of decisions happen earlier, in the first few seconds, before anything measurable occurs.
I wouldn’t be surprised if we start seeing “human simulation agents” alongside task agents, especially for product, marketing, and UX decisions.
We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.
From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.
We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.
I would like a single number that I would use to optimize the pipeline with but I find it hard to figure out what that number should be measuring.
- AI to evaluate itself (eg ask claude to test out its own skill) - custom built platform (I see interest in this space)
I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?
Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.