为何大多数 AI 产品会失败:来自 OpenAI、Google 和 Amazon 的 50 多次 AI 部署经验教训
Why most AI products fail: Lessons from 50+ AI deployments at OpenAI, Google & Amazon
Meet the Guest
Lenny Rachitsky: We worked on a guest post together. They had this really key insight that building AI products is very different from building non-AI products.
Aishwarya Naresh Reganti: Most people tend to ignore the non-determinism. You don’t know how the user might behave with your product, and you also don’t know how the LLM might respond to that. The second difference is the agency control trade-off. Every time you hand over decision-making capabilities to agentic systems, you’re kind of relinquishing some amount of control on your end.
State of AI Products
Lenny Rachitsky: This significantly changes the way you should be building product.
Kiriti Badam: So we recommend building step-by-step. When you start small, it forces you to think about what is the problem that I’m going to solve. In all this advancements of the AI, one easy, slippery slope is to keep thinking about complexities of the solution and forget the problem that you’re trying to solve.
Aishwarya Naresh Reganti: It’s not about being the first company to have an agent among your competitors. It’s about have you built the right flywheels in place so that you can improve over time.
Two Key Differences From Traditional Software
Lenny Rachitsky: What kind of ways of working do you see in companies that build AI products successfully?
Aishwarya Naresh Reganti: I used to work with the CEO of now Rackspace. He would have this block every day in the morning, which would say catching up with AI 4:00 to 6:00 AM. Leaders have to get back to being hands-on. You must be comfortable with the fact that your intuitions might not be right. And you probably are the dumbest person in the room and you want to learn from everyone.
Starting With Low Autonomy
Lenny Rachitsky: What do you think the next year of AI is going to look like?
Progressive Path in Customer Support
Kiriti Badam: Persistence is extremely valuable. Successful companies right now building in any new area, they are going through the pain of learning this, implementing this and understanding what works and what doesn’t work. Pain is the new moat.
Lenny Rachitsky: Today, my guests are Aishwarya Reganti and Kiriti Badam. Kiriti works on Kodex at OpenAI and has spent the last decade building AI and ML infrastructure at Google and at Kumo. Ash was an early AI researcher at Alexa and Microsoft and has published over 35 research papers. Together, they’ve led and supported over 50 AI product deployments across companies like Amazon, Databricks, OpenAI, Google, and both startups and large enterprises. Together, they also teach the number one rated AI course on Maven, where they teach product leaders all of the key lessons they’ve learned about building successful AI products. The goal of this episode is to save you and your team a lot of pain and suffering and wasted time trying to build your AI product. Whether you are already struggling to make your product work or want to avoid that struggle, this episode is for you. If you enjoy this podcast, don’t forget to subscribe and follow to your favorite podcasting app or YouTube.
It helps tremendously. And if you become an annual subscriber of my newsletter, you get a year free of a ton of incredible products, including a year free of Lovable, Replit, Bold, Gamma, NA, and Linear Dev and Posttalk, Superhuman, Descript, Whisper Flow, Perplexity, Warp, Granola, Magic [inaudible 00:02:38] Mobbin, and Stripe Atlas. Head on over to lennysnewsletter.com and click product pass. With that, I bring you Aishwarya, Reganti, and Kiriti Badam after a short word from our sponsors.
By the time the results are in, the moment to act has passed. Strella changes that. It’s the first platform that uses AI to run and analyze in depth interviews automatically, bringing fast and continuous user research to every team. Strella’s AI moderator asks real follow-up questions, probing deeper when answers are vague, and services patterns across hundreds of conversations all in a few hours, not weeks. Product, design, and research teams at companies like Amazon and Duolingo are already using Strella for Figma prototype testing, concept validation, and customer journey research, getting insights overnight instead of waiting for the next sprint. If your team wants to understand customers at the speed you ship products, try Strella. Run your next study at strella.io/lenny. That’s S-T-R-E-L-L-A.io/lenny. Ash and Kiriti, thank you so much for being here and welcome to the podcast.
Calibrating Behavior and Constraining Autonomy
Aishwarya Naresh Reganti: Thank you, Lenny.
Kiriti Badam: Thank you for having us. Super excited for this.
More Incremental Evolution Examples
Lenny Rachitsky: Let me set the stage for the conversation that we’re going to have today. So you two have built a bunch of AI products yourself. You’ve gone deep with a lot of companies who have built AI products, have struggled to build AI products, build AI agents. You also teach a course on building AI products successfully and you’re kind of on this mission to just reduce pain and suffering and failure that you constantly see people go through when they’re building AI products. So to set a little just foundation for the conversation we’re going to have, what are you seeing on the ground within companies trying to build AI products? What’s going well? What’s not going well?
Aishwarya Naresh Reganti: I think 2025 has been significantly different than 2024. One, the skepticism has significantly reduced. There were tons of leaders last year who probably thought this would be yet another crypto wave and kind of skeptical to get started. And a lot of the use cases that I saw last year were more of slap chat on your data. And that was calling themselves an AI product. And this year, a ton of companies are really rethinking their user experiences and their workflows and all of that and really understanding that you need to deconstruct and reconstruct your processes in order to build successful AI products. And that’s the good stuff. The bad stuff is the execution is still all over the place. Think of it. This is a three-year-old field. There are no playbooks, there are no textbooks. So you really need to figure out as you go. And the AI lifecycle, both pre-deployment and post-deployment is very different as compared to a traditional software lifecycle.
And so a lot of old contracts and handoffs between traditional roles, like say PMs and engineers and data folks has now been broken and people are really getting adapted to this new way of working together and kind of owning the same feedback loop in a way. Because previously, I feel like PMs and engineers and all of these folks had their own feedback loops to optimize. And now you need to be probably sitting in the same room. You’re probably looking at agent traces together and deciding how your product should behave. So it’s a tighter form of collaboration. So companies are still kind of figuring that out. That’s kind of what I see in my consulting practice this year.
Prompt Injection and Security Risks
Lenny Rachitsky: So let me follow that thread. We worked on a guest post together that came out a few months ago. And the thing that stood out to me most that stuck with me most after working on that post is this really key insight that building AI products is very different from building non-AI products. And the thing that you’re big on getting across is there’s two very big differences. Talk about those two differences.
Keys to Building Successful AI Products
Aishwarya Naresh Reganti: Yes. And again, I want to make sure that we drive home the right point. There are tons of similarities of building AI systems and software systems as well, but then there are some things that kind of fundamentally change the way you build software systems versus AI systems. And one of them that most people tend to ignore is the non-determinism. You’re pretty much working with a non-deterministic API as compared to traditional software. What does that mean and why does that have to affect us is in traditional software, you pretty much have a very well-mapped decision engine or workflow. Think of something like Booking.com. You have an intention that you want to make a booking in San Francisco for two nights, et cetera. The product has kind of been built so that your intention can be converted into a particular action and you kind of are clicking through a bunch of buttons, options, forms, and all of that, and you finally achieve your intention.
But now that layer in AI products has completely been replaced by a very fluid interface, which is mostly natural language, which means the user can literally come up with a ton of ways of saying or communicating their intentions. And that kind of changes a lot of things because now you don’t know how your user’s going to be here. That’s on the input side. And the output is also that you’re working with a non-deterministic probabilistic API, which is your LLM. And LLMs are pretty sensitive to prompt phrasings and they’re pretty much black boxes. So you don’t even know how the output surface will look like. So you don’t know how the user might behave with your product, and you also don’t know how the LLM might respond to that. So you’re now working with an input, output, and a process. You don’t understand all the three very well. You’re trying to anticipate behavior and build for it.
And with agentic systems, this kind of gets even harder. And that’s where we talk about the second difference, which is the agency control trade-off. What we mean by that, and I’m kind of shocked so many people don’t talk about this. They’re extremely obsessed with building autonomous systems, agents that can do work for you. But every time you hand over decision-making capabilities or autonomy to agentic systems, you’re kind of relinquishing some amount of control on your end. And when you do that, you want to make sure that your agent has gained your trust or it is reliable enough that you can allow it to make decisions. And that’s where we talk about this agency controlled trade-off, which is if you give your AI agent or your AI system, whatever it is, more agency, which is the ability to make decisions, you’re also losing some control and you want to make sure that the agent or the AI system has earned that ability or has built up trust over time.
A Discussion on Evals
Lenny Rachitsky: So just to summarize what you’re sharing here, essentially, people have been building product, software products for a long time. We’re now in a world where the software you’re building is one, non-deterministic, can just do things differently. As you said, you go to booking.com, you find a hotel, it’s going to be the same experience every time. You’ll see different hotels, but it’s a predictable experience. With AI, you can’t predict that it’s going to be the exact same thing, the thing that you plan it to be every time. And then the other is there’s this trade-off between agency and control. How much will the AI do for you versus how much should the person still be in charge? And what I’m hearing is the big point here is this significantly changes the way you should be building product. And we’re going to talk about the impact on how the product development lifecycle should change as a result.
Is there anything else you want to add there before we get into that?
Production Monitoring and Evals
Kiriti Badam: Yeah, it’s definitely one of the key points that this kind of distinction needs to exist in your mind when you’re starting to build. For example, think about if your objective is to hike Half Dome in Yosemite. You don’t start hiking it every day, but you start training yourself in minor parts and then you slowly improve and then you go to the end goal. I feel like that’s extremely similar to what you want to build AI products in the sense that when you don’t start with agents with all the tools and all the context that you have in the company in day one and expect it to work or even tinker at that level. You need to be deliberately starting in places where there is minimal impact and more human control so that you have a good grip of what are the current capabilities and what can I do with them and then slowly lean into the more agency and lesser control.
So this gives you that confidence that, okay, I can know that, okay, this is the particular problem that I’m facing and the AI can solve this extent of it. And then let me next think through what context I need to bring in, what kind of tools I need to add to this to improve the experience. So I feel like also it’s a good and a bad thing in the sense that it’s good that you don’t have to see the complexity of the outside world of all of this fancy AI agents force and feel like I cannot do that. Everyone is starting from very minimalistic structures and then evolving. And the second part is the bad thing is that as you are trying to build this one click agents into your company, you don’t have to be overwhelmed with this complexity. You can slowly graduate.
So that’s extremely important. And we see this as a repeating pattern over and over.
Lenny Rachitsky: Okay. So let’s actually follow that because that’s a really important component of how you recommend people build AI stuff, AI products, AI agents, all the AI things. So give us an example of what you’re talking about here, this idea of starting slow with agency and control and then moving up this rung.
Kiriti Badam: Yeah. For example, a very important or very prevalent application of AI agents is customer support. Imagine you are a company who has a lot of customer support tickets and why even imagine OpenAI is the exact same thing when we were launching products and there was a huge spike of support volume as we launched successful products like Image or GPT-5 and things like that. The kind of questions you get is different. The kind of problems that the customers bring to you is different. So it’s not about just dumping all the list of help center articles that you have into the AI agent. You kind of understand what are the things that you can build. And so initially the first step of it would be something like you have your support agents, the human support agents, but you will be suggesting in terms of, okay, this is what the AI thinks that is the right thing to do.
And then you get that feedback loop from the humans that, okay, this is actually a good suggestion for me in this particular case and this is a bad suggestion. And then you can go back and understand, okay, this is what the drawbacks are or this is where the blind spots are, and then how do I fix that? And once you get that, you can increase the autonomy to say that, okay, I don’t need to suggest to the human. I’ll actually show the answer directly to the customer. And then we can actually add more complexity in terms of, okay, I was only answering questions based on health center articles, but now let me add new functionality. I can actually issue refunds to the customers. I can actually raise feature requests with the engineering team and all of these things. So if you start with all of this on day one, it’s incredibly hard to control the complexity.
So we recommend building step by step and then increasing it.
Lenny Rachitsky: Awesome. And you have a visual actually that we’ll share of what this looks like. But just to kind of mirror back what you’re describing, this idea of start with high control, low agency, the example you gave is the support agents just kind of giving suggestions, is not able to do anything, the user is in charge. And then as that becomes useful and you are confident it’s doing the right sort of work, you give it a little more agency and you kind of pull back on the control the user has. And then if that’s starting to go well, then you give it more agency and the user needs less control to control it. Awesome.
The True Weight of Evals
Aishwarya Naresh Reganti: I think the higher level idea here is with AI systems, it’s all about behavior calibration. It’s incredibly impossible to predict upfront how your system behaves. Now, what do you do about it? You make sure that you don’t ruin your customer experience or your end user experience. You keep that as is, but then remove the amount of control that the human has. And there is no single right way of doing it. You can decide how to constrain that autonomy. I mean, a different example of how you could constrain autonomy is pre-authorization use cases. Insurance pre-authorization is a very ripe use case for AI because clinicians spend a lot of time pre-authorizing things like blood tests, MRIs and things like that. And there are some cases which are more of low hanging fruits. For instance, MRIs and block tests, because as soon as you know patient’s information, it’s easier to approve that and AI could do that versus something like an invasive surgery, et cetera, is more high risk. You don’t want to be doing that autonomously.
So you can kind of determine which of these use cases should go through that human and the loop layer versus which of the use cases AI can conveniently handle. And then all through this process, you’re also logging what the human is doing because you want to build a flywheel that you could use in order to improve your system. So you’re essentially not showing the user experience, not eroding trust, at the same time logging what humans would otherwise do so that you can continuously improve your system.
How Codex Views Evals
Lenny Rachitsky: So let me give you a few more examples of this kind of progression that you recommend. And the reason I’m spending so much time here is this is a really key part of your recommendation to help people build more successful AI products. This idea of start slow with high control and low agency and then build up over time once you’ve built confidence that it’s doing the right sort of work. So a few more examples that you shared in your post that I’ll just read. So say you’re building a coding assistant, V1 would be just suggest inline completion and boilerplate snippets. V2 would be generate larger blocks like tests or refactors for humans to review. And then V3 is just apply the changes and open PRs autonomously. And then another example is a marketing assistant. So V1 would be draft emails or social copy, just like here’s what I would do.
V2 is build a multi-step campaign and run the campaign. And V3 is just launch it A/B tested auto-optimize campaigns across channels. Awesome. Yeah. And again, just to summarize where we’re at, just to give people the advice we’ve shared so far. One is just important to understand AI products are different. They’re non-deterministic. And you pointed out, and I forgot to actually mirror back this point, both on the input and the output. The user experience is non-deterministic.People will see different things, different outputs, different chat conversations, different maybe UI if it’s designing the UI for you. And also the output obviously is going to be non-terministic. So that’s a problem and a challenge. And then-
Aishwarya Naresh Reganti: I mean, if you think of it’s also the most beautiful part of AI, which is, I mean, we are all much more comfortable talking than following a bunch of buttons and all of that. So the bar to using AI products is much lower because you can be as natural as you would be with humans, but that’s also the problem, which is there are tons of ways we communicate and you want to make sure that that intent is rightly communicated and the right actions are taken because most of your systems are deterministic and you want to achieve a deterministic outcome, but with non-deterministic technology and that’s where it gets a little messy.
Continuous Calibration and Dev Framework
Lenny Rachitsky: Awesome. Okay. I love the optimistic version of why this is good. Okay. And then the other piece is this idea of this trade-off of autonomy versus control when you’re designing a thing. And I imagine what you’re seeing is people try to jump to the ideal, like the V3 immediately and that’s when they get into trouble both. It’s probably a lot harder to build that and it just doesn’t work. And then they’re just like, “Okay, this is a failure. What are we even doing?”
Kiriti Badam: Exactly. I feel there’s a bunch of things that you actually have to get confidence in before you get to V3. And it’s easy to get overwhelmed that, oh, my AI agent is doing these things wrong in a hundred different ways and you’re not going to actually tabulate all of them and fix it. Even though you’ve learned how do you deal with the evaluation practices and stuff like that, if you’re starting on the wrong spot, you are actually going to have a hard time correcting things from there. And when you start small and when you start with building a very minimalistic version with high human control and low agency, it also forces you to think about what is the problem that I’m going to solve. We use this term called problem first. And to me, it was obvious in the sense that that I do need to think about the problem, but it’s incredible how well it resonates with the people that in all this advancements of the AI that we are seeing, one easy, slippery slope is to just keep thinking about complexities of the solution and forget the problem that you’re trying to solve.
So when you’re trying to start at a smaller scale of autonomy, you start to really think about what is the problem that I’m trying to solve and how do I break it down into levels of autonomy that I can build later? So that is incredibly useful and we keep repeating this part and over and over with everyone we talk to.
Lenny Rachitsky: And there’s so many other benefits to limiting autonomy because there’s just danger also of the thing doing too much for you and just messing up your, I don’t know, your database, sending out all these emails you never expected. And there’s like so many reasons this is a good idea.
Fixing Bugs to Growing Autonomy
Aishwarya Naresh Reganti: Yep. I recently read this paper from a bunch of folks at UC Berkeley. Basically Matei Zaharia, [inaudible 00:21:54] and the folks at Databricks and it said about 74% or 75% of the enterprises that they had spoken to, their biggest problem was reliability. And that’s also why they weren’t comfortable deploying products to their end users or building customer facing products because they just weren’t sure or they just weren’t comfortable doing that and exposing their users to a bunch of these risks. And that’s also why they think a lot of AI products today have to do with productivity because it’s much low autonomy versus end-to-end agents that would replace workflows. And yeah, I love their work otherwise as well, but I think that’s very in line with what at least we are seeing at my startup as well.
Lenny Rachitsky: Okay. Very interesting. There’s an episode that’ll come out before this conversation where we go deep into another problem that this avoids, which is around prompt injection and jailbreaking and just how big of a risk that is for AI products where it’s essentially an unsolved and unsolvable problem potentially. I’m not going to go down that track, but that’s a pretty scary conversation we had that’ll be out before this conversation.
Support Agent Autonomy Evolution
Aishwarya Naresh Reganti: I think that will be a huge problem once systems go mainstream. We’re still so busy building AI products that we’re not worried about security, but it will be such a huge problem to kind of, especially with this non-deterministic API again. So you’re kind of stuck because there are tons of instructions that you could inject within your prompt and then it’s going really bad.
Lenny Rachitsky: Okay. Let’s actually spend a little time here because it’s actually really interesting to me and no one’s talking about this stuff, which is like the conversation we had is just it’s pretty easy to get AI to do stuff it shouldn’t do. And there’s all these guardrail systems people put in place, but turns out these guardrails aren’t actually very good and you can always get around them. And to your point, as agents become more autonomous and robots, it gets pretty scary that you could get AI to do things you shouldn’t do.
Kiriti Badam: I think this is definitely a problem, but I feel in the current spectrum of customers adopting AI, the extent to which companies can actually get advantage of AI or improve their processes or streamline the existing processes that they have, I feel it’s still in the very early stage. 2025 has been an extremely busy year for AI agents and customers trying to adopt AI, but I feel the penetration is still not as much as you would actually get advantage out of it. So with the right sort of human in the loop points in here, I feel we can actually avoid a bunch of these things and focus more towards streamlining the processes. And I am more on the optimist side in the sense that you need to try and adopt this before actually trying to be only for highlighting the negative aspects of what could go wrong.
So I feel like strongly that companies has this adopt this, they definitely … No company at OpenAI we talk to has never had been the case that, oh, AI cannot help me in this case. It has always been that, oh, there is this set of things that it can optimize for me and then let me see how I can adopt it.
Lenny Rachitsky: Sweet. I always like the optimistic perspective. I’m excited for you to listen to this and see what you think because it’s really interesting. And to your point, there’s a lot of things to focus on. It’s one of many things to worry about and think about. Okay, let’s get back on track here. So we’ve shared a bunch of pro-tips and important piece of advice. Let me ask, what other patterns and kind of ways of working do you see in companies that do this well and teams that build AI products successfully? And then just what are the most common pitfalls people fall into? So we could just maybe start with, what are other ways that companies do this well, build AI products successfully?
Real Chaos in Enterprise Taxonomies
Aishwarya Naresh Reganti: I almost think of it as like a success triangle with three dimensions that’s never always technical. Every technology problem is a people problem first. And with companies that we have worked with, it’s these three dimensions, like great leaders, good culture and technical prowess. With leaders itself, we work with a lot of companies for their AI transformation, training, strategy and stuff like that. And I feel like a lot of companies, the leaders have built intuitions over 10 or 15 years and they’re kind of highly regarded for those intuitions. But now with AI in the picture, those intuitions will have to be relearned and leaders have to be vulnerable to do that. I used to work with the CEO of now Rackspace, Gagan. So he would have this block every day in the morning, which would say catching up with AI 4:00 to 6:00 AM, and he would not have any meetings or anything like that.
And that was just his time to pick up on the latest AI podcast or information and all of that. And he would have weekend vibe coding sessions and stuff like that. So I think leaders have to get back to being hands-on. And that’s not because they have to be implementing these things, but more of rebuilding their intuitions because you must be comfortable with the fact that your intuitions might not be right and you probably are the dumbest person in the room and you want to learn from everyone. And that I’ve seen that being a very distinguishing factor of companies that build products which are successful because you’re kind of bringing in that top-down approach. It’s almost always impossible for it to be bottom-up. You can’t have a bunch of engineers go and get buy-in from the leader if they just don’t trust in the technology or if they have misaligned expectations about the technology.
I’ve heard from so many folks who are building that our leaders just don’t understand the extent to which AI can solve a particular problem or they just vibe code something and assume it’s easy to take it to production and you really need to understand the range of what AI can solve today so that you can guide decisions within the company. The second one is the culture itself. And again, I work with enterprises where AI is not their main thing and they need to bring in AI into their processes just because a competitor is doing it. And just because it does make sense because there are use cases that are very ripe. Then along the way, I feel a lot of companies have this culture of FOMO and you will be replaced and those kind of things and people get really afraid. Subject matter experts are such a huge part of building AI products that work because you really need to consult them to understand how your AI is behaving or what the ideal behavior should be.
But then I’ve spoken to a bunch of companies where the subject matter experts just don’t want to talk to you because they think their job is being replaced. So I mean, again, this comes from the leader itself. You want to build a culture of empowerment, of augmenting AI into your own workflows so that you can 10X at what you’re doing instead of saying that probably you’ll be replaced if you don’t adopt AI and stuff like that. So that kind of an empowering culture always helps. You want to make your entire organization be in it together and make AI work for you instead of trying to guard their own jobs, et cetera. And with AI, it’s also true that it opens up a lot more opportunities than before. So you could have your employees doing a lot more things than before and 10x their productivity. And the third one is the technical part which we talk about.
I think folks that are successful are incredibly obsessed about understanding their workflows very well and augmenting parts that could be ripe for AI versus the ones that might need human in the loop somewhere, et cetera. Whenever you’re trying to automate some part of a workflow, it’s never the case that you could use an AI agent and that will solve your problems. It’s always, you probably have a machine learning model that’s going to do some part of the job. You have deterministic code doing some part of the job. So you really need to be obsessed with understanding that workflow so you can choose the right tool for the problem instead of being obsessed with the technology itself. And another pattern I see is also folks really understand this idea of working with a non-deterministic API, which is your LLM. And what that means is they also understand the AI development lifecycle looks very different and they iterate pretty quickly, which is can I build something iterate quickly in a way that it doesn’t ruin my customer experience at the same time gives me enough amount of data so that I can estimate behavior.
So they build that flywheel very quickly. As of today, it’s not about being the first company to have an agent among your competitors. It’s about, have you built the right flywheels in place so that you can improve over time? When someone comes up to me and says, “We have this one click agent, it’s going to be deployed in your system.” And then in two or three days, it’ll start showing you significant gains. I would almost be skeptical because it’s just not possible. And that’s not because the models aren’t there, but because enterprise data and infrastructure is very messy and you need a bit to … Even the agent needs a bit to understand how these systems work. There are very messy taxonomies everywhere. People tend to do things like get customer data, we want, get customer data, we do, and these kind of things. And all those functions exist and they’re being called and basically there’s a lot of tech debt that you need to deal with.
So most of the times, if you’re obsessed with the problem itself and you understand your workflows very well, you will know how to improve your agents over time instead of just slapping an agent and assuming that it’ll work from day one. I probably will go as far to say that if someone’s selling you one click-agents, it’s pure marketing. You don’t want to buy into that. I would rather go with a company that says, “We’re going to build this pipeline for you,” and that will learn over time and build a flywheel to improve than something that’s going to work out of the box. To replace any critical workflow or to build something that can give you significant ROI, it easily takes four to six months of work, even if you have the best data layer and infrastructure layer.
Lenny Rachitsky: Amazing. There’s a lot there that resonates so deeply with other conversations I’ve been having on this podcast. One is just for a company to be successful at seeing a lot of impact from AI, the founder-CEO has to be deep into it. I had Dan Shipper on the podcast and they work with a bunch of companies helping them adopt AI. And he said that’s the number one predictor of success. Is the CEO chatting with ChatGPT, Claude, whatever, many times a day. I love this example you gave with the Rackspace CEO has catch up on AI news in the morning every day. I was imagining he’d be chatting with the chatbot versus reading news.
From Drafts to End-to-End Solutions
Aishwarya Naresh Reganti: With the kind of information you have as of today, you could just … I mean, you want to choose the right channels as well because everybody has an opinion. So whose opinion do you want to bank on? I feel like having that good quality set of people that you’re listening to really makes sense. So he just has a list of two or three sources that he always looks at. And then he comes back with a bunch of questions and bounces it around with a bunch of AI experts to see what they think about it. And I was part of that group, so I kind of know-
Lenny Rachitsky: I love that.
Risk Mitigation Framework
Aishwarya Naresh Reganti: … about the questions that he comes up with.
Overhyped vs. Underhyped in AI
Lenny Rachitsky: That’s cool.
Aishwarya Naresh Reganti: It’s pretty cool. I was like, “Why are you doing so much?” And then he says, “It trickles down into a bunch of decisions that we take.”
End of Agents and Proactivity
Lenny Rachitsky: Okay. Let me talk about another topic that’s very … It’s been a hot topic on this podcast. It was a hot topic on Twitter for a while, evals. A lot of people are obsessed with evals, think they’re the solution to a lot of problems in AI. A lot of people think they’re overrated that you don’t need evals. You can just feel the vibes and you’ll be all right. What’s your take on evals? How far does that take people in solving a lot of the problems that you talk about?
Kiriti Badam: In terms of what is going on in the community, I feel there’s just this false dichotomy of this either evals is going to solve everything or online monitoring or production monitoring is going to solve everything. And I find no reason to trust one of the extremes in the sense that I will entirely bank my application on this or like that to solve the thing. So if you take a step back, think of what are evals. Evals are basically your trusted product thinking or your knowledge about the product that is going into this set of data sets that you’re going to build in the sense that this is what matters to me. This is the kind of problems that my agent should not do and let me build a list of datasets so that I’m going to do well on those. And in terms of production monitoring, what you’re doing there is you’re deploying your application and then you’re having some sort of key metrics that actually communicate back to you on how customers are using your product.
You could be deploying any agent and if the customer is giving a thumbs up for your interaction, you better want to know that. So that is what production monitoring is going to do. And this production monitoring has existed for products for a long time, just that now with the AI agents, you need to be monitoring a lot more granularity. It’s not just the customer always giving you explicit feedback, but there is many implicit feedback that you can get. For example, in ChatGPT, if you are liking the answer, you can actually give a thumbs up. Or if you don’t like the answer, sometimes customers don’t give you thumbs down, but actually regenerate the answer. So that is a clear indication that the initial answer that regenerator is not meeting the customer’s expectation. So these are the kind of implicit signals you always need to think about.
And that spectrum has been increasing in terms of production monitoring. Now let’s come back to the initial topic of like, okay, is it evals or is it production monitoring? What does it matter? So I feel, again, we go back to this problem first approach of what is it that you’re trying to build. You’re trying to build a reliable application for your customers that’s not going to do a bad thing. It’s always going to do the right thing. Or if it is doing a wrong thing, you’re basically alerted very quickly. So I break this down into two parts. One is nobody goes into deploying an application without actually just testing that. This testing could be wipes or this testing could be, “Okay, I have this 10 questions that it should not go wrong no matter what changes I make, and let me build this and let’s call this an evaluation dataset.” Now, let’s say you build this, you deployed this, and then you figured, “Okay, now I need to understand whether it’s doing the right thing or not.”
So if you’re a high throughput or a high transaction customer, you cannot practically sit and evaluate all the traces. You need some indication to understand what are the things that I should look at. And this is where production monitoring comes into the picture that you cannot predict the base in which your agent could be doing wrong, but all of these other implicit signals and explicit signals, those are going to communicate back to you what are the traces that you need to look at. And that is where production monitoring helps. And once you get this kind of traces, you need to examine what are the failure patterns that you’re seeing in these different types of interactions. And is there something that I really care about that should not happen? And if that kind of failure modes are happening, then I need to think about building an evaluation dataset for it.
And okay, let’s say I built an evaluation dataset for my agent trying to offer refunds where explicitly I have configured it not to. So I built this evaluation dataset and then I made my changes in tools or prompts or whatever, and then I deployed the second version of the product. Now there is no guarantee that this is the only problem that you’re going to see. You still need production monitoring to actually catch different kinds of problems that you might encounter. So I feel evals are important, production monitoring is important, but this notion of only one of them is going to solve things for you that is completely dismissible in my opinion.
Pain Is the New Moat
Lenny Rachitsky: All right. A very reasonable answer. And the point here isn’t, it’s not just as simple as do both. It’s more that there are different things to catch and one approach won’t catch all the things you need to be paying attention to.
Aishwarya Naresh Reganti: Exactly.
Some Final Advice
Lenny Rachitsky: Awesome.
Aishwarya Naresh Reganti: I want to take two steps back and kind of talk about how much weight the term evals has had to take in the second half of 2025 because you go meet a data labeling company and they tell you our experts are writing evals and then you have all of these folks saying that PMs should be writing evals, they’re the new PRDs. And then you have folks saying that evals is pretty much everything, which is the feedback loop you’re supposed to be building to improve your products. Now, step back as a beginner and kind of think what are evals? Why is everyone saying evals? And these are actually different parts of the process and nobody is wrong in the sense that yes, these are evals, but when a data labeling company is telling you that our experts are writing evals, they’re actually referring to error analysis or experts just leading notes on what should be right.
Lawyers and doctors write evals, that doesn’t mean they’re building LLM judges or they’re building this entire feedback loop. And when you say that a PM should be writing evals, doesn’t mean they have to write an LLM judge that’s good enough for production. I think there are also very prescriptive ways of doing this and plus one to KD, which is you cannot predict upfront if you need to be building an LLM judge versus you need to be using implicit signals from production monitoring, et cetera. I think Martin Fowler at some point had this term called semantic diffusion back in the 2000s, which kind of means that someone comes up with a term, everybody starts butchering it with their own definitions and then you kind of lose the actual definition of it. That is what is happening to evals or agents or any word in AI as of today, everybody kind of sees a different side to it, I guess.
But if you make a bunch of practitioners sit together and ask them, “Is it important to build an actionable feedback loop for AI products?” I think all of them will agree. Now, how you do that really depends on your application itself. When you go to complex use cases, it’s incredibly hard to build LLM judges because you see a lot of emerging patterns. If you built a judge that would test for verbosity or something like that, it turns out that you’re seeing newer patterns that your LLM judge is not able to catch, and then you just end up building too many evals. And at that point, it just makes sense to look at your user signals, fix them, check if you have regressed and move on instead of actually building these judges. So it all depends. I think one statement that every ML practitioner will tell you is it really depends on the context. Don’t be obsessed with prescriptions they’re going to change.
Rapid Fire Questions
Lenny Rachitsky: That’s such an important point, this idea that, especially that evals just means many things to different people now. It’s just a term for so many things. And it’s complicated to just talk about evals when you see it as the stuff data labeling companies are giving you and things PMR, right? And there’s also benchmarks. People call benchmarks a little bit evals. It’s like-
Favorite Recent Product Recommendations
Aishwarya Naresh Reganti: I recently spoke to a client who told me, “We do evals.” And I was like, “Okay, can you show me your dataset?” And said, “No, we just checked LM Arena and Artificial Analysis. These are independent benchmarks and we know that this model is the right one for our use case.” And I’m like, “You’re not doing evals. That’s not evals. Those are model evals.”
Lenny Rachitsky: But it makes sense. The word, it could be used in that context. I get why people think that, but yeah, now it’s just confusing it even more.
A Personal Motto
Aishwarya Naresh Reganti: Yep.
A Moment of Mutual Appreciation
Lenny Rachitsky: Just one more line of questioning here that I think that’s on my mind is the reason this became kind of a big debate is Cloud Code. The head of Cloud Code, Boris, was like, “Nah, we don’t do evals on Cloud Code. It’s all vibes.” What can you share, Kiriti, on Kodex and the Kodex team, how you approach evals?
Kiriti Badam: So Kodex, we have this balanced approach of you need to have evals and you need to definitely listen to your customers. And I think Alex has been on your podcast recently and he’s been talking about how you’re extremely focused on building the right product. And a big part of it is basically listening to your customers. And coding agents are extremely unique compared to agents for other domains in the sense that these are actually built for customizability and these are built for engineers. So coding agent is not a product which is going to solve these top five workflows or top six workflows or whatever. It’s meant to be customizable in multi different ways. And the implication of that is that your product is going to be used in different integrations and different kinds of tools and different kinds of things. So it gets really hard to build an evaluation dataset for all kinds of interactions that your customers are going to use your product for.
With that said, you also need to understand that, okay, if I’m going to make a change, it’s at least not going to damage something that is really core to the product. So we have evaluations for doing that, butt the same time we take extreme care on understanding how the customers are using it. For example, we built this code review product recently and it has been gaining extreme amount of traction. And I feel like many, many bugs in OpenAI as well as even our external customers are getting caught with this. And now let’s say if I’m making a model change to the code review or a different kinds of RL mechanism that I trained with it, and now if I’m going to deploy it, I definitely do want A/B test and identify whether it’s actually finding the right mistakes and how are users reacting to it? And sometimes if users do get annoyed by your incorrect code regis, they go to the extent of just switching off the product.
So those are the signals that you want to look at and make sure that your new changes are doing the right thing. And it’s extremely hard for us to think of these kind of scenarios beforehand and develop evaluation data sets for it. So I feel like there’s a bit of both. There’s a lot of vibes and there’s a lot of customer feedback and we are super active on the social media to understand if anybody’s having certain types of problems and quickly fix that. So I feel it’s a … How do I put this? It’s like a domain of things that you do here.
Lenny Rachitsky: That makes so much sense. Okay. What I’m hearing, Codex, pro evals, but it’s not enough.
Kiriti Badam: Yes.
Lenny Rachitsky: But also just watch customer behavior and feedback. And also there’s some vibes just like, is this feeling good? As I’m using it, generating great code that I’m excited about that I think is great.
Kiriti Badam: I don’t think if anybody’s coming and seeing that I have this concrete set of evals that I can bet my life on and then I don’t need to think about anything else, it’s not going to work. And every new model that you’re going to launch, we get together as a team and test different things. Each person is concentrating on something else. And we have this list of hard problems that we have and we throw that to the model and see how well they’re progressing. So it’s like custom evals for each engineer, you would say, and just understand what the product is doing in its new model.
Lenny Rachitsky: If you’re a founder, the hardest part of starting a company isn’t having the idea, it’s scaling the business without getting buried in back office work. That’s where Brex comes in. Brex is the intelligent finance platform for founders. With Brex, you get high limit corporate cards, easy banking, high yield treasury, plus a team of AI agents that handle manual finance tasks for you. They’ll do all the stuff that you don’t want to do, like file your expenses, scour transactions for waste, and run reports all according to your rules. With Brex’s AI agents, you can move faster while staying in full control. One in three startups in the United States already runs on Brex. You can too at brex.com.
We’ve been talking for almost an hour already, and we haven’t even covered your extremely powerful software development workflow for building AI products that you two developed that you teach in your course, that you basically combined all the stuff we’ve been talking about into a step-by-step approach to building AI products. You call it the continuous calibration, continuous development framework. Let’s pull up a visual to show people what the heck we’re talking about, and then just walk us through what this is, how this works, how teams can shift the way they build their AI products to this approach to help them avoid a lot of pain and suffering.
Where to Find the Guest
Aishwarya Naresh Reganti: Before we go about explaining the life cycle, a quick story on why Kiriti and I came up with this is because there are tons of companies that we keep talking to that have the pressure from their competitors because they’re all building agents. We should be building agents that are entirely autonomous. And I did end up working with a few customers where we built these end-to-end agents. And turns out that because you start off at a place where you don’t know how the user might interact with your system and what kind of responses or actions the AI might come up with, it’s really hard to fix problems when you have this really huge workflow, which is taking four or five steps, making tons of decisions. You just end up debugging so much and then kind of hot fixing to the point where at a time we were building for a customer support use case, which is the example that we give in the newsletter as well.
And we had to shut down the product because we were doing so many hot fixes and there was no way we could count all the emerging problems that were coming up. And there’s also quite some news online. Recently, I think Air Canada had this thing where one of their agents predicted or hallucinated a policy for a refund, which was not part of their original playbook, and they had to go by it because legal stuff. And there have been a ton of really scary incidents. And that’s where the idea comes from. How can you build so that you don’t lose customer trust and you don’t end up, or your agent or AI system doesn’t end up making decisions that are super dangerous to the company itself. At the same time, build a flywheel so that you can improve your product as you go. And that’s where we came up with this idea of continuous calibration, continuous development.
The idea is pretty simple, which is we have this right side of the loop, which is continuous development, where you scope capability and curate data, essentially get a data set of what your expected inputs are and what your expected outputs should be looking at. This is a very good exercise before you start building any AI product because many times you figure out that a lot of the folks within the team are just not aligned on how the product should behave. And that’s where your PMs can really give in a lot more information and your subject matter experts as well. So you have this data set that you know your AI product should be doing really well on. It’s not comprehensive, but it lets you get started. And then you set up the application and then design the right kind of evaluation metrics. And I intentionally use the term evaluation metrics, although we say evals because I just want to be very specific in what it is because evaluation is a process, evaluation metrics are dimensions that you want to focus on during the process.
And then you go about deploying, run your evaluation metrics. And the second part is the continuous calibration, which is the part where you understand what behavior you hadn’t expected in the beginning, right? Because when you start the development process, you have this data set that you’re optimizing for, but more often than not, you realize that that data set is not comprehensive enough because users start behaving with your systems in ways that you did not predict. And that’s where you want to do the calibration piece. I’ve deployed my system. Now I see that there are patterns that I did not really expect and your evaluation metrics should give you some insight into those patterns, but sometimes you figure out that those metrics were also not enough and you probably have new error patterns that you have not thought about. And that’s where you analyze your behavior, spot error patterns.
You apply fixes for issues that you see, but you also design newer evaluation metrics to figure out that they are emerging patterns. And that doesn’t mean you should always design evaluation metrics. There are some errors that you can just fix and not really come back to because they’re very spot errors. For instance, there’s a tool calling error just because your tool wasn’t defined well and stuff like that. You can just fix it and move on. And this is pretty much how an AI product lifecycle would look like. But what we specifically also mention is while you’re going through these iterations, try to think of lower agency iterations in the beginning and higher control iterations. What that means is constrain the number of decisions your AI systems can make and make sure that they’re humans in the loop and then increase that over time because you’re kind of building a flywheel of behavior and you’re understanding what kind of use cases are coming in or how your users are using the system.
And one example I think we give in the newsletter itself is the customer support. This is a nice image that kind of shows how you can think of agency and control as two dimensions. And each of your versions keep on increasing the agency or the ability of your AI system to make decisions and lower the control as you go. And one example that we give is that of the customer support agent, where you can break it down into three versions. The first version is just routing, which is your agent able to classify and route a particular ticket to the right department? And sometimes when you read this, you probably think, is it so hard to just do routing? Why can’t an agent easily do that? And when you go to enterprises, routing itself can be a super complex problem. Any retail company, any popular retail company that you can think of has hierarchical taxonomies.
Most of the times the taxonomies are incredibly messy. I have worked in use cases where you probably have taxonomy that says some kind of hierarchy and then that says shoes and then women’s shoes and men’s shoes all at the same layer where idea you should be having shoes and then women’s shoes and men’s shoes should be subclasses. And then you’re like, okay, fine. I could just merge that. And you go further and you see that there’s also another section on the shoes that says for women and for men, and it’s just not aggregated. It’s not fixed for some reason. So if an agent kind of sees this kind of a taxonomy, what is it supposed to do? Where is it supposed to route? And a lot of the times we are not aware of these problems until you actually go about building something and understanding it.
And when these kind of problems, real human agents see these kind of problems, they know what to check next. Maybe they realize that the node that says for women and for men that’s under shoes was last updated in 2019, which means that it’s just a dead node that’s lying there and not being used. So they kind of know that, okay, we’re supposed to be looking at a different node and stuff like that. And I’m not saying agents cannot understand this or models are not capable enough to understand this, but there are really weird rules within enterprises that are not documented anywhere. And you want to make sure that the agents have all of that context instead of just throwing the problem at that.
Yeah. Coming back to the versions we had, routing was one where you have really high control because even if your agent routes to the wrong department, humans can take control and undo those actions. And along the way, you also figure out that you probably are dealing with a ton of data issues that you need to fix and make sure that your data layer is good enough for the agent to function. We do is what we said of a Copilot, which is now that you’ve figured out routing works fine after a few iterations and you’ve fixed all of your data issues, you could go to the next step, which is, can my agent provide suggestions based on some standard operating procedures that we have for the customer support agent? And it could just generate a draft that the human can make changes to. And when you do this, you’re also logging human behavior, which means that how much of this draft was used by the customer support agent or what was omitted. So you’re actually getting error analysis for free when you do this because you’re literally logging everything that the user is doing that you could then build back into your flywheel.
And then we say, post that, once you’ve figured out that those drafts look good and most of the times maybe humans are not making too many changes, they’re using these drafts as is. That’s when you want to go to your end-to-end resolution assistant that could draft a resolution that could solve the ticket as well. And those are the stages of agency where you start with low agency and then you go up high. We also have this really nice table that we put together, which is what do you do at each version and what you learn that can enable you to go to the next step and what information do you get that you can feed into the loop, right? When you’re just doing your routing, you have better quality routing data, you also know what kind of prompts you need to be building to improve the routing system.
Essentially, you’re figuring out your structure for context engineering and building that flywheel that you want. And while I go through this, I want to also be very clear that two things. One is when you build with CCCD in mind, it doesn’t mean that you’ve fixed the problem all for one. It’s possible that you’ve probably gone through V3 and you see a new distribution of data that you never previously imagined, but this is just one way to lower your risk, which is you get enough information about how users behave with your system before going to a point of complete autonomy. And the second thing is you’re also kind of building this implicit logging system. A lot of people come and tell us that, “Oh, wait, there are evals. Why do you need something like this? ” The issue with just building a bunch of evaluation metrics and then having them in production is evaluation metrics catch only the errors that you’re already aware of, but there can be a lot of emerging patterns that you understand only after you put things in production.
So for those emerging patterns, you’re kind of creating a low risk kind of a framework so that you could understand user behavior and not really be in a position where there are tons of errors and you’re trying to fix all of them at once. And this is not the only way to do it. There are tons of different ways. You want to decide how you constrain your autonomy. It could be based on the number of actions that the agent is taking, which is what we do in this example. It could be based on topic. There’s just some domains where it’s pretty high risk to make a system completely autonomous for certain decisions, but for some other topics, it’s okay to make them completely autonomous and depending on the complexity of the problem. And that’s where you really want your product managers, your engineers and subject matter experts to align on how to build this system and continuously improve it.
The idea is just behavior calibration and not losing user trust as you do that behavior calibration, I guess.
Lenny Rachitsky: We’ll link folks to this actual post if they want to go really deep. You basically go through all of these steps by step, a bunch of examples. And the idea here is, as you said, that the reason, everything about what you’re describing here is about making it continuous and iterative and kind of moving along this progression of higher autonomy, less control. And this idea of even calling continuous calibration, continuous development is communicating it’s this kind of iterative process. And just to be clear, this naming is kind of ode to CI/CD, continuous integration, continuous deployment suite. And the idea here is that this is the version of that for AI where instead of just integrating into unit tests and deploying constantly, it’s running evals, looking at results, iterating on the metrics you’re watching, figuring out where it’s breaking and iterating on that. Awesome. Okay.
So again, we’ll point people to this post if they want to go deeper. That was a great overview. Is there anything else before we go into different topic around this framework specifically that you think is important for people to know?
Aishwarya Naresh Reganti: I think one of the most common questions we get is, how do I know if I need to go to the next stage or if this is calibrated enough? There’s not really a rule book you can follow, but it’s all about minimizing surprise, which means let’s say you’re calibrating every one or two days and you figure out that you’re not seeing new data distribution patterns, your users have been pretty consistent with how they’re behaving with the system. Then the amount of information you gain is kind of very low and that’s when you know you can actually go to the next stage. And it’s all about the wipes at that point, do you know you’re ready, you’re not receiving any new information. But also it really helps to understand that sometimes there are events that could completely mess up the calibration of your system. An example is GPT-4o doesn’t exist anymore, or it’s going to be deprecated in APIs as well.
So most companies that were using 4o should switch to 5 and 5 has very different properties. So that’s where your calibration’s off again. You want to go back and do this process again. Sometimes users start behaving with systems also differently over time or user behavior evolves. Even with consumer products, you don’t talk to ChatGPT the same way you were talking, say, two years ago, just because you know the capabilities have increased so much. And also just people get excited when these systems can solve one task, they want to try it out on other tasks as well. We built this system for underwriters at some point. Underwriting is a painful task. There are agreements that are like loan applications are like 30 or 40 pages, and the idea for this bank was to build a system that could help underwriters pick policies and information about the bank so that they could approve loans.
And for a good three or four months, everybody was pretty impressed with the system. We had underwriters actually report gains in terms of how much time they were spending, et cetera. And first three months, we realized that they were so excited with the product that they started asking very deep questions that we never anticipated. They would just throw the entire application document at the system and go, “For a case that looks like this, what did previous underwriters do? ” And for a user, that just seems like a natural extension of what they were doing, but the building behind it should significantly change. Now, you need to understand what does for a case like this mean in the context of the loan itself? Is it referring to people of a particular income range or is it referring to people in a particular geo and stuff like that?
And then you need to pick up historical documents, analyze those documents, and then tell them, “Okay, this is what it looks like,” versus just saying that there’s a policy X, Y, and Z, and you want to look up that policy. So something that might seem very natural to an end user might be very hard to build as a product builder, and you see that user behavior also evolves over time, and that’s when you know that you want to go back and recalibrate.
Lenny Rachitsky: What do you think is overhyped in the AI space right now? And even more importantly, what do you think is under-hyped?
Kiriti Badam: As I said, super optimistic in different things that are going in AI. So I wouldn’t say overhyped, but I feel kind of misunderstood is the concept of multi-agents. People have this notion of, “I have this incredibly complex problem. Now I’m going to break it down into, hey, you are this agent. Take care of this. You’re this agent. Take care of this.” And now if I somehow connect all of these agents, they think they’re the agent utopia and it’s never the case that there are incredibly successful multi-agent systems that are built. There’s no doubt about that. But I feel a lot of it comes in terms of how are you limiting the ways in which the system can go off tracks. And for example, if you’re building a supervisor agent and there are subagents that actually do the work for the super agent, supervisor agent, that is a very successful pattern.
But coming with this notion of I’m going to divide the responsibilities based on functionality and somehow expect all of that to work together in some sort of gossip protocol, that is extremely misunderstood that you could do that. I don’t think current ways of building and current model capabilities are right there in terms of building those kind of applications. I feel that is kind of misunderstood than overrated. Underrated, I feel it’s hard to probably believe, but I still feel coding agents are underrated in the sense that I feel like you can go on Twitter and you can go on Reddit and you see a lot of chatter about coding agents, but talking to an engineer in any random company, especially outside of Bay Area, you can see the amount of impact this coding agents can create and the penetration is very low. So I feel like 2025 and 2026 is going to be an incredible year for optimizing all of these processes.
And I feel that is going to be creating a lot of value with AI.
Lenny Rachitsky: That’s really interesting on that first point. So the idea there is you’ll probably be more successful building and using an agent that is able to do its own sub-agent splitting of work versus a bunch of, say, Codex agents. Will you do this task, you do that task?
Kiriti Badam: You can have agents to do these things and you as a human can orchestrate it or you can have one larger agent that is going to orchestrate all of these things, but letting the agents communicate in terms of peer-to-peer kind of protocol, and then especially doing this in a customer support kind of use case is incredibly hard to control what kind of agent is replying to your customer because you need to shift your guardrails everywhere and things like that.
Lenny Rachitsky: Yeah. Okay. Great picks. Okay. Ash, what do you got?
Aishwarya Naresh Reganti: Can I say evals? Will I be canceled?
Lenny Rachitsky: In which category? Which bucket do they go?
Aishwarya Naresh Reganti: Overrated.
Lenny Rachitsky: Overrated. Okay, go for it. We won’t let you get canceled.
Aishwarya Naresh Reganti: Just kidding. I think evals are misunderstood. They are important, folks. I’m not saying they’re not important, but I think just this, I’m going to keep jumping across tools and going to pick up and learn if new tool is overrated. I still am old school and feel like you would really need to be obsessed with the business problem you’re trying to solve. AI is only a tool. I try to think of it that way. Of course, you need to be learning about the latest and greatest, but don’t be so obsessed with just building so quickly. Building is really cheap today. Design is more expensive, really thinking about your product, what you’re going to build. Is it going to really solve a pain point? Is what is way more valuable today? And it will only become more true in the near future. So really obsessing about your problem and design is underrated and just rote building is overrated, I guess.
Lenny Rachitsky: Awesome. Okay. Similar sort of question. From a product point of view, what do you think the next year of AI is going to look like? Give us a vision of where you think things are going to go by, say by the end of 2026.
Kiriti Badam: Yeah, I feel there’s a lot of promise in terms of this background agents are proactive agents who is … They’re going to basically understand your workflow even more. If you think of where is AI failing to create value today, it’s mainly about not understanding the context. And the reason that it’s not understanding the context is it’s not plugged into the right places where actual work is happening. And as you do more of this, you can give the agent more of context and then it start to see the world around you and understand what are the set of metrics that you’re optimizing for or what are the kind of activities that you’re trying to do. It is a very easy extension from there to actually gain more out of it and then let the agent prompt you back. We already do this in terms of ChatGPT pulse, which kind of gives you this daily update of things you might care about.
And it’s very nice to actually have that jog your brain up in terms of, “Oh, this is something that I haven’t thought about. Maybe this is good.” And now when you extend this to more complex tasks, like a coding agent, which says that, “Okay, I have fixed five of your linear tickets and here are the patches. Just to review them at the start of your day.” So I feel that is going to be extremely useful. And I see that as a strong direction in which products are going to build in 2026.
Lenny Rachitsky: That’s so cool. So essentially agents anticipating what you want to do and getting ahead of you and I’ve solved these problems for you or I think this is going to crash your site. Maybe you should fix this thing right here or I see the spike here and let’s refactor our database. Amazing. What a world. Okay. Ash, what do you got?
Aishwarya Naresh Reganti: I’m all in for multimodal experiences in 2026. I think we have done quite some progress in 2025, and not just in terms of generation, but also understanding. Until now, I think LLMs have been our most commonly used modules, but as humans, we are multimodal creatures, I would say. Language is probably one of our last forms of evolution. As the three of us are talking, I think we’re constantly getting so many signals. I’m like, “Oh, Lenny’s nodding his head, so probably I would go in this direction or Lenny’s bored, so let me stop talking.” So there’s a chain of thought behind your chain of thought and you’re constantly altering it with language that dimension of expression is not explored as well. So if we could build better multimodal experiences that would get us closer to human-like conversation richness. And you will also, just given the kind of models, there’s a bunch of boring tasks as well, which are ripe for AI.
If multimodal understanding gets better, there are so many handwritten documents and really messy PDFs that cannot be passed even by the best of the models as of today. And if it’s possible, there’ll be so much data that we can tap into.
Lenny Rachitsky: Awesome. I just saw Demis from DeepMind, AI, Google, whatever they call the whole org, talking about this where he thinks that’s going to be a big part of where they’re going, combining the image model work, the LLM, and also their world model stuff, Genie, I think is what it’s called. Yes. So that’s going to be a wild, wild time. Okay. Last question. If someone wants to just get better at building AI products, what’s just maybe one skill or maybe two skills that you think they should lean into and develop?
Aishwarya Naresh Reganti: I think we did cover a bunch of best practices for AI products, which is start small, try to get your iteration going well and build a flywheel and all of that. But again, if you kind of look at it at a 10,000 feet level for anybody building today, like I was saying, implementation is going to be ridiculously cheap in the next few years. So really nail down your design, your judgment, your taste and all of that. And in general, if you’re building a career as well, I feel for the past few years, your former years, say the first two, three years of building your career is always focused on execution, mechanics and all of that. And now we have AI that could help you ramp pretty quickly and post that. I mean, after a few years, I think everybody’s job becomes about your taste, your judgment and kind of what is uniquely you.
I think nail down on that part and try to figure out how you can bring in that kind of a perspective. It doesn’t have to mean that you should be significantly old, have years of experience. We recently hired someone and we use this very popular app for tracking our tasks and we’ve been using it for years and we pay a high subscription fee for it. And this guy just came with his own vibe coded app to the meeting. He onboarded us to all of it and he’s like, “Okay, let’s start using this.” And I think that kind of agency and that kind of ownership to really rethink experiences is what will set people apart. And I’m not being blind to the fact that vibe coded apps have high maintenance costs. And maybe as we scale as a company, we have to replace it or we have to think of better approaches.
But given that we are a small size company now and just … I was really shocked because I never thought of it. If you’ve been used to working in a certain way, you associate a cost with building. And I feel like folks who grew up in this age have a much lower cost associated in their mind. They just don’t mind building something and going ahead with it. And they’re also very enthusiastic to try out new tools. That’s also probably why AI products have this retention problem because everybody’s so excited about trying out these new tools and all of that. But essentially having the agency and ownership, and I think it’s also the going to be the end of the busy work era. You can’t be sitting in a corner doing something that doesn’t move the needle for a company. You really need to be thinking about end-to-end workflows, how you can bring in more impact.
I think all of that will be super important.
Lenny Rachitsky: That reminds me, I just had Jason Lemkit on the podcast. He’s very smart on sales, go to market, run Saster, and he replaced his whole sales team with agents. He had 10 salespeople and then he was 1.2 and 20 agents. And one of the agents, it was just tracking everyone’s updates to Salesforce and kind of updating it automatically for them based on their calls. And one of the salespeople was like, “Okay, I quit.” And it turned out he wasn’t really doing anything. He was just sitting around and he’s like, “Okay, this will catch me. I got to get out of here. So to your point about, it’ll be harder to sit around and twiddle your thumbs, I think is really right.
Kiriti Badam: Yeah. I think to add on to that, I feel like persistence is also something that is extremely valuable, especially given that anybody who wants to build something, the information is at your fingertips even more than the past decade. You can learn anything overnight and become that sort of Ironman kind of approach. So I feel like having that persistence and going through the pain of learning this, implementing this and understanding what works and what doesn’t work. And as you are going through this pain of developing multiple approaches and then solving the problem, I feel that is going to be the real moat as an individual. I like to call it pain is the new moat, but I feel that is exactly super useful to actually have this in, especially in building these AI products.
Lenny Rachitsky: Say more about this. I love this concept. Pain is the new moat. Is there more there?
Kiriti Badam: Yeah, I feel as a company, I mean, successful companies right now building in any new area, they are successful not because they’re first to the market or they have this fancy feature that more customers are liking it. They went through the pain of understanding what are the set of non-negotiable things and trade them off exactly with what are the features or what are the model capabilities that they can use to solve that problem. This is not a straightforward process. There’s no textbook to do this or there’s no straightforward way or a known credit path to be here. So a lot of this pain I was talking about is just going through this iteration of like, “Okay, let’s try this and if this doesn’t work, let’s try this.” And that kind of knowledge that you built across the organization or across your own lived experiences, I feel that pain is what translates into the moat of the company. This could be a product of evals or something that you built. And I feel that is going to be the game changer.
Lenny Rachitsky: That is awesome. It’s like turning a coal into diamond.
Kiriti Badam: Yes.
Lenny Rachitsky: Okay. I feel like we’ve done a great job helping people avoid some of the biggest issues people consistently run into building AI products. We covered so many of the pitfalls and the ways to actually do it correctly. Before we get to our very exciting lightning round, is there anything else that you wanted to share? Anything else you want to leave listeners with?
Aishwarya Naresh Reganti: Be obsessed with your customers. Be obsessed with the problem. AI is just a tool and try to make sure that you’re really understanding your workflows. 80% of so called AI engineers, AIPMs spend their time actually understanding their workflows very well. They’re not building the fanciest and the most cool models or workflows around it. They’re actually in the weeds understanding their customer’s behavior and data. And whenever a software engineer who’s never done AI before, here’s the term, look at your data. I think it’s a huge revelation to them, but it’s always been the case. You need to go there, look at your data, understand your users, and that’s going to be a huge differentiator.
Lenny Rachitsky: That’s a great way to close it. The AI isn’t the answer. It’s a tool to solve the problem. With that, we have reached our very exciting lightning round. I’ve got five questions for both of you. Are you ready?
Aishwarya Naresh Reganti: Yay. Yes.
Lenny Rachitsky: All right. So you can both answer them. You can pick one which you want to answer. Either way, up to you. What are two or three books you find yourself recommending most to other people?
Aishwarya Naresh Reganti: For me, it’s this book called When Breath Becomes Air, Lenny. It was written by Paul Kalanithi. I think he was an Indian original neurosurgeon who was diagnosed with lung cancer at 31 or 32. And the whole book is his memoir and just is written after he was diagnosed. And it’s really beautiful, especially because I read it during COVID and all we ever wanted to do during COVID is stay alive. There are a bunch of really nice quotes within the book as well, but I remember one of them, he was kind of arguing against a very popular quote by Socrates, which is, “The unexamined life is not worth living,” or something like that, which means you really need to be thinking about your choices, you need to understand your values, your mission and all of that. And Paul says, “If the unexamined life is not worth living, was the unlived life worth examining?” Which means are you spending so much time just understanding your mission and purpose that you’ve forgotten to live?
And I think everybody who’s staying in the AI era and building and continuously going through the space of reinventing themselves need to take a pause and live for a bit, I guess. They need to stop evaling life too much.
Lenny Rachitsky: I was going to say that. That’s where my mind went. You got to write some evals for your life. Oh my God, we’ve gone too far.
Aishwarya Naresh Reganti: Yep. Yeah.
Lenny Rachitsky: Beautiful.
Aishwarya Naresh Reganti: That’s my favorite book.
Kiriti Badam: I like more of science fiction books. So I really like this 3 Body problem series. It’s like a three book series. It has elements of grander than science fiction, life outside earth and how it impacts human decision making process. And it also has elements of geopolitics and how much important or valuable abstract science is to human progress. And then when that gets stopped, it’s not noticeable in everyday life, but it can cause devastating effects. So I feel like AI helping in these areas, for example, is going to be extremely crucial. And that book is a nice example of what could happen otherwise.
Lenny Rachitsky: Completely agree. Absolutely. Love. Might be my favorite sci-fi book except, or series even, and it’s three. I have to read of all three, by the way. I find that it only got really good about one and a half books in. So if anyone’s tried it and like, “What the heck is going on here?” Just keep reading and get to the middle of the second one and then it gets mind-blowing.
Kiriti Badam: Yes.
Lenny Rachitsky: If you love sci-fi and you’re in AI, you got to read this book called A Fire Upon the Deep by Vernon Vinge. Check it out. It’s incredible. I saw Noah Smith on his newsletter recommend this book and there’s sequels to it, but this is the one that’s so incredible. And it’s actually, it turns out it’s about AGI and super intelligence and all these things, and it’s just so epic. And no one’s heard of it.
Kiriti Badam: Thank you.
Lenny Rachitsky: There you go. I’m giving you one back. Okay, next question. What’s a favorite recent movie or TV show that you’ve really enjoyed?
Aishwarya Naresh Reganti: I started rewatching Silicon Valley and I think it’s so true. It’s so timeless. Everything is repeating all over again. Anybody who’s watched it a few years ago should start rewatching it and you’ll see that it’s eerily similar to everything that’s happening right now with the AI wave.
Lenny Rachitsky: That’s a good idea to rewatch it. I love that their whole business was like an algorithm to compress, like a compression algorithm. It’s like maybe a precursor to LLMs in some small way. No, I get it. All right, Kiriti, what you got?
Kiriti Badam: I’m going to drag this and say lot a movie or a TV show, but there’s this game I picked up recently called Expedition 33. It has nothing to do with AI, but it’s an incredibly well-made game in terms of the gameplay or the movie and the story and the music. It’s been amazing.
Lenny Rachitsky: I love that you have time to play games. That’s a great sign. I love that. Someone OpenAI, I’m just imagining you’re … There’s nothing else going on except just coding and having meetings.
Kiriti Badam: Yeah, it has been incredibly hard to find time for that.
Lenny Rachitsky: That’s good. That’s a good sign. I’m happy to hear this. Okay. What’s a favorite product that you’ve recently discovered that you really love?
Aishwarya Naresh Reganti: For me, it’s Whisper Flow. I think I’ve been using it quite a bit and I didn’t know I needed it so much. The best part is it’s a conceptual transcription tool, which means if you go to Codex and start using Whisper Flow, it starts identifying variables and all of that. And it’s so seamless in terms of transcription to instruction. You could say something like, “I’m so excited today. Add three exclamation marks,” and it seamlessly switches. It adds those three exclamation marks instead of writing add three exclamation marks. And I think it’s pretty cool. If you’re not using it, you should try it.
Lenny Rachitsky: I’ll do a plug. Get Whisper Flow for free for an entire year for a year for free by becoming an annual subscriber of my newsletter.
Aishwarya Naresh Reganti: That’s how I got access to it, Lenny.
Lenny Rachitsky: There we go. I think I pitched this deal. I think people don’t truly understand how incredible this is. They’re like, “No way this is real. It’s real.” And 18 other products, lennysproductpass.com, check it out. Moving on. Kiriti.
Kiriti Badam: Awesome. I actually am a stickler for productivity. I keep experimenting new CLI tools and things which can make me faster. So I feel like a Raycast has been amazing. I’ve discovered all this new shortcuts that you can use to open different things, type in shortcut commands and things like that. And Caffeinate is another thing that I’ve recently discovered from my teammates. It helps you prevent Mac from sleeping so you can run this really long Codex task for four or five hours locally, let it build the thing and then you can wake up and be like, “Okay, this is good. I like this.”
Lenny Rachitsky: That’s hilarious, that combo. Codex and Caffeinate. You guys need to use it, build that yourself, an OpenAI version of that, or the Codex agent should just keep your Mac from sleeping. That’s so funny. By the way, Raycast, also part of Lenny’s product pass. One year for your Raycast. Amazing. Yeah.
Aishwarya Naresh Reganti: Lenny didn’t tell us these folks. Yes. These are actually our favorite products.
Lenny Rachitsky: These are just two of 19 products. No Caffeinate though. I don’t know if that’s even paid. Okay, let’s keep going. Do you have a favorite life motto that you find yourself coming back to in work or in life?
Aishwarya Naresh Reganti: For me, I think this is one my dad told me when I was a kid and it’s always stuck, which is they told it couldn’t be done, but the fool didn’t know it, so he did it anyway. I think be foolish enough to believe that you can do anything if you put your heart to it, especially now because you have so much data at your hand that could be pointing towards the fact that you probably will be unsuccessful. How many podcasts made it to more than a thousand subscribers or how many companies hit more than one million ARR? And there’s always data to show you that you won’t be successful, but sometimes just be foolish and go ahead with it.
Lenny Rachitsky: That’s great. Yeah.
Kiriti Badam: For me, I am more of an overthinker. So I really like this quote from Steve Jobs that you can only connect the dots looking backwards. So a lot of the times there are numerous choices and you don’t really know the optimal one to pick, but life works in ways that you can actually see back and be like, “Oh, these are actually beautiful in terms of how our transition.” So I feel like that is extremely useful in keep moving forward, keep experimenting.
Lenny Rachitsky: Final question. Whenever I have two guests on the podcast at once, I like to ask this question. What’s something that you admire about the other person?
Aishwarya Naresh Reganti: I think with Kiriti, he’s pretty calm and very grounded and he’s always been my sounding board. I can throw a ton of ideas at him and he always comes up with, he’s able to anticipate the kind of issues that might land into. And he’s extremely kind and lets his work speak instead of actually doing a lot of talking, I guess. But if I had to pick one, I think he’s the most incredible husband.
Lenny Rachitsky: Reveal. Little did people know.
Aishwarya Naresh Reganti: We’ve been married for four years and been the most beautiful four years of my life.
Lenny Rachitsky: Wow. Okay. How do you follow that?
Kiriti Badam: Yeah, it’s super hard to follow that. I would say I am extremely privileged in terms of working with really smart people in great companies in the Silicon Valley. And I feel the unique thing that stands with Aishwarya across like any other smart folks I’ve worked on is she has this really amazing knack of teaching and explaining something in a very understandable and easy to comprehend way. And that combined with persistence is super useful, especially in this fast-moving AI world that we are in the sense that there’s so many new things coming up. It feels overwhelming, but when I hear her talk about, this is how you make sense of this entire thing, this is where it plugs in. I feel like, oh, that is so simple. I can also do that. So she empowers a lot of people by simplifying things and explaining things in the most understandable way.
So I feel that is an incredible quality.
Lenny Rachitsky: Amazing. How sweet. I got to do this all the time. I need more guest to do it. That was great. Okay. Final questions. Where can folks find stuff that you’re working on, find you online, share your course link, and then just how can listeners be useful to you?
Aishwarya Naresh Reganti: I write a lot on LinkedIn. So if you want to listen to pragmatists who’ve been in the weeds, working on AI products and what they’re seeing, you can follow my work. We also have a GitHub repository with about 20K stars, and that repository is all about good resources for learning AI. It’s completely free. And if you like what we spoke today, we also run a super popular course. We leave a link to it on building enterprise AI products. And the course is a lot about unlearning mindsets and following a problem-first approach instead of a tool-first or a hype-first approach. So you can check that out as well. And if you don’t want to do the course, we write a lot, we give out a lot of free resources, we have free sessions, so make sure you follow our work.
Kiriti Badam: Yeah, I would also add that you can also find me on LinkedIn. I don’t write a lot, I guess, but I’m super all excited to just talk to any complex product that you’re building. And if you have thoughts on how you can use coding agents to make your life better or however the problems that you’re seeing, always my DMs are open and we can have a great discussion.
Lenny Rachitsky: Awesome. Well, Kiriti and Ash, thank you so much for being here.
Kiriti Badam: Thank you so much.
Aishwarya Naresh Reganti: Thank you, Lenny. This was so much fun.
Lenny Rachitsky: So much fun. Bye, everyone.
Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennyspodcast.com. See you in the next episode.
Glossary
| English | 中文 |
|---|---|
| A/B test | A/B 测试 |
| agency control trade-off | 自主性与控制权的权衡 |
| agent traces | agent 运行轨迹 |
| agentic systems | agentic 系统 |
| Aishwarya Naresh Reganti | Aishwarya Naresh Reganti(人名) |
| Alex | Alex(人名) |
| ARR | ARR(年度经常性收入) |
| Artificial Analysis | Artificial Analysis(benchmark 平台) |
| background agents | 后台 agent |
| behavior calibration | 行为校准 |
| benchmark | benchmark |
| Boris | Boris(人名) |
| Caffeinate | Caffeinate(macOS 防休眠命令) |
| ChatGPT pulse | ChatGPT pulse(ChatGPT 的主动提醒功能) |
| CI/CD | CI/CD(持续集成/持续部署) |
| CLI | CLI(命令行界面) |
| context engineering | 上下文工程 |
| continuous calibration | 持续校准 |
| continuous development | 持续开发 |
| Copilot | Copilot(辅助模式) |
| Dan Shipper | Dan Shipper(人名) |
| evals | evals(评估体系) |
| flywheel | 飞轮 |
| FOMO | FOMO(错失恐惧) |
| Gagan | Gagan(人名) |
| gossip protocol | 八卦协议 |
| guardrails | 护栏 |
| Half Dome | 半圆顶(优胜美地地标) |
| hallucinating | 幻觉 |
| hot fix | 热修复 |
| human in the loop | 人在回路 |
| jailbreaking | 越狱攻击 |
| Jason Lemkit | Jason Lemkit(人名) |
| Kiriti Badam | Kiriti Badam(人名) |
| Lenny Rachitsky | Lenny Rachitsky(人名) |
| Linear 工单 | Linear(项目管理工具)中的工单 |
| LLM judge | LLM 裁判器 |
| LM Arena | LM Arena(benchmark 平台) |
| low hanging fruits | 低垂的果实 |
| Matei Zaharia | Matei Zaharia(人名) |
| moat | 护城河 |
| multi-agents | 多 agent |
| multimodal | 多模态 |
| Noah Smith | Noah Smith(人名) |
| non-determinism | 非确定性 |
| online monitoring | 线上监控 |
| Paul Kalanithi | Paul Kalanithi(人名) |
| playbook | playbook(成熟方法论/操作手册) |
| PM | PM(产品经理) |
| PRD | PRD(产品需求文档) |
| pre-authorization | 预授权 |
| proactive agents | 主动型 agent |
| problem first | 问题优先 |
| production monitoring | 生产环境监控 |
| prompt | prompt |
| prompt injection | prompt 注入 |
| Rackspace | Rackspace(公司名) |
| Raycast | Raycast(效率启动器工具) |
| ROI | ROI(投资回报率) |
| routing | routing(路由/分发) |
| semantic diffusion | 语义扩散 |
| standard operating procedures | 标准操作流程 |
| subagents | 子 agent |
| subject matter experts | 领域专家 |
| supervisor agent | 监督者 agent |
| tech debt | 技术债 |
| underwriters | 承保人 |
| Vernon Vinge | Vernon Vinge(人名) |
| vibe coding | vibe coding |
| Whisper Flow | Whisper Flow(语音转录工具) |
Reformatted by reformat_english.py
Why most AI products fail: Lessons from 50+ AI deployments at OpenAI, Google & Amazon
Aishwarya Naresh Reganti + Kiriti Badam
文字记录
Lenny Rachitsky: 我们曾一起撰写过一篇客座文章。他们提出了一个非常关键的洞察:构建 AI 产品与构建非 AI 产品截然不同。
Aishwarya Naresh Reganti: 大多数人倾向于忽略非确定性(non-determinism)的问题。你既不知道用户会如何与你的产品交互,也不知道 LLM 会如何做出响应。第二个区别是自主性与控制权的权衡(agency control trade-off)。每当你将决策能力交给 agentic 系统,你就在一定程度上放弃了自己这边的控制权。
Lenny Rachitsky: 这极大改变了你构建产品的方式。
Kiriti Badam: 因此我们建议循序渐进地构建。从小处着手,会迫使你思考:我要解决的问题到底是什么。在 AI 各种层出不穷的进步中,一个容易滑入的陷阱就是不断纠结解决方案的复杂性,而忘了你最初要解决的问题。
Aishwarya Naresh Reganti: 关键不在于成为竞争对手中第一个拥有 agent 的公司,而在于你是否建立了正确的飞轮(flywheels),从而能够持续改进。
Lenny Rachitsky: 那些成功构建 AI 产品的公司,你看到他们有哪些工作方式?
Aishwarya Naresh Reganti: 我以前和现任 Rackspace 的 CEO 共事过。他每天早上都会在日程里留出一个时间段,写着”跟进 AI:凌晨 4 点到 6 点”。领导者必须重新回到亲自动手的状态。你必须接受自己的直觉可能并不正确,而且你很可能是房间里最不了解情况的那个人,你需要向所有人学习。
Lenny Rachitsky: 你认为未来一年 AI 会是什么样子?
Kiriti Badam: 坚持极为宝贵。目前在任何新领域取得成功的公司,都在经历学习这些技术、落地实施并弄清什么有效、什么无效的痛苦过程。痛苦就是新的护城河(moat)。
嘉宾介绍
Lenny Rachitsky: 今天的嘉宾是 Aishwarya Reganti 和 Kiriti Badam。Kiriti 在 OpenAI 负责 Codex 的工作,过去十年间在 Google 和 Kumo 构建 AI 与 ML 基础设施。Ash 曾是 Alexa 和 Microsoft 的早期 AI 研究员,发表了超过 35 篇研究论文。他们两人共同领导和支持了超过 50 个 AI 产品部署,覆盖 Amazon、Databricks、OpenAI、Google 等公司,以及初创企业和大型企业。他们还一起在 Maven 上教授评分最高的 AI 课程,向产品负责人传授他们关于构建成功 AI 产品所积累的所有关键经验。本期节目的目标是帮你和你的团队节省大量痛苦、煎熬和浪费时间,避免在构建 AI 产品时走弯路。无论你已经在挣扎着让产品运转,还是想避免那种挣扎,这期节目都适合你。
(赞助商广告已跳过)
Lenny Rachitsky: Ash 和 Kiriti,非常感谢你们来到这里,欢迎来到播客。
Aishwarya Naresh Reganti: 谢谢你,Lenny。
Kiriti Badam: 谢谢你的邀请,非常期待这次对话。
AI 产品的现状
Lenny Rachitsky: 让我先为今天的对话做一个铺垫。你们两位自己构建过不少 AI 产品,也深入接触了许多构建 AI 产品、在构建过程中遇到困难、尝试构建 AI agent 的公司。你们还在 Maven 上教授一门关于成功构建 AI 产品的课程,你们的使命就是减少人们构建 AI 产品时反复经历的痛苦、挫折和失败。那么,为了给我们的对话奠定一个基础,你在一线看到了什么?那些试图构建 AI 产品的公司,哪些做得好,哪些做得不好?
Aishwarya Naresh Reganti: 我认为 2025 年与 2024 年相比有显著不同。首先,怀疑态度大幅减少了。去年有大量领导者可能认为这不过是又一波加密货币式的浪潮,对入局持观望态度。而去年我看到的大多数用例,更像是”把 GPT 简单套在你的数据上”,然后就自称是 AI 产品了。今年,大量公司真正在重新思考他们的用户体验和工作流程,真正意识到你需要对流程进行解构和重构,才能构建成功的 AI 产品。这是好的一面。不好的一面是,执行层面仍然一片混乱。想想看,这是一个才三年的领域。没有成熟的 playbook,没有教科书。你真的只能边做边摸索。而且 AI 的生命周期,无论是部署前还是部署后,与传统软件生命周期都有很大不同。
因此,过去传统角色之间的大量既有分工和协作模式——比如产品经理、工程师和数据人员之间的——现在已经被打破,人们正在努力适应这种全新的协作方式,某种程度上是在共同维护同一个反馈循环。因为过去,产品经理、工程师和各方各自有自己需要优化的反馈循环。而现在,你们可能需要坐在同一个房间里,一起查看 agent 的运行轨迹,一起决定产品应该怎么表现。这是一种更加紧密的协作形式。所以各公司还在摸索这种新的工作方式。这也是我今年在咨询工作中看到的情况。
AI 产品与传统软件的两大核心差异
Lenny Rachitsky: 顺着这个话题聊。我们几个月前合作发表了一篇客座文章,那篇文章中让我印象最深、至今记忆犹新的,是这个非常关键的洞察:构建 AI 产品和构建非 AI 产品截然不同。你特别强调的一点是,有两大根本差异。来谈谈这两个差异。
Aishwarya Naresh Reganti: 好的。我再次强调,我想确保我们传达正确的观点。构建 AI 系统和软件系统之间有大量的相似之处,但确实也有一些根本性地改变了你构建软件系统与 AI 系统方式的东西。其中大多数人往往忽视的一个,就是非确定性。你基本上是在和一个非确定性的 API 打交道,而不是传统软件中的确定性 API。这意味着什么?为什么这会影响我们?因为在传统软件中,你基本上有一个非常清晰的决策引擎或工作流。拿 Booking.com 来说,你有一个意图——想在旧金山订两晚酒店之类的。产品已经被设计成可以把你的意图转化为某个特定动作,你通过点击一系列按钮、选项、表单等等,最终实现你的意图。
但在 AI 产品中,那个中间层已经完全被一个非常灵活的界面取代了,主要是自然语言界面。这意味着用户可以用无数种方式来表达和传递自己的意图。这就改变了很多东西,因为你不知道用户会以什么样的方式出现在你的产品里。这是输入端的情况。而输出端,你面对的是一个非确定性的、概率性的 API,也就是你的 LLM。LLM 对 prompt 的措辞非常敏感,而且基本上是黑盒。你甚至不知道输出的表现形式会是什么样子。所以你既不知道用户会如何使用你的产品,也不知道 LLM 会如何回应。你现在面对的是输入、输出和过程三个环节,而你对这三个环节的理解都很有限。你在试图预判行为并为此构建产品。
而在 agentic 系统中,这变得更加困难。这就是我们要谈的第二个差异,即自主性与控制权的权衡。我们所说的意思是——我很震惊这么多人不讨论这个问题。大家极其痴迷于构建自主系统、能替你工作的 agent。但每当你把决策能力或自主权交给 agentic 系统,你就在某种程度上放弃了一些控制权。而当你这么做的时候,你需要确保你的 agent 已经赢得了你的信任,或者足够可靠,你才能允许它做决策。这就是我们所说的自主性与控制权的权衡——如果你给你的 AI agent 或 AI 系统——不管是什么——更多的自主性,也就是做决策的能力,你同时在失去一些控制权。你需要确保这个 agent 或 AI 系统已经赢得了这种能力,或者随着时间推移已经建立起了信任。
从低自主性起步,逐步演进
Lenny Rachitsky: 所以总结一下你刚才分享的内容——本质上,人们构建产品、软件产品已经很长时间了。而现在我们进入了一个新的世界,你构建的软件,第一,是非确定性的,每次可能做出不同的事情。正如你所说,你去 Booking.com 找酒店,每次的体验是一样的。你可能会看到不同的酒店,但它是可预测的体验。而 AI 产品,你无法预测它每次都会完全如你计划的那样表现。第二点是存在自主性与控制权之间的权衡。AI 能为你做多少,人还应该掌控多少。我听到的一个核心要点是,这从根本上改变了你构建产品的方式。接下来我们会讨论它对产品开发生命周期应该产生怎样的影响。
在我们进入这个话题之前,你还有什么要补充的吗?
Kiriti Badam: 有的。这确实是关键要点之一——在你开始构建的时候,脑海中必须有这种区分。举个例子,假设你的目标是登顶优胜美地的 Half Dome。你不会第一天就开始每天爬它,而是先在小的路段训练自己,慢慢提高,最终达到终极目标。我觉得这和构建 AI 产品极其相似——你不会在第一天就给 agent 装上所有的工具和公司里的所有上下文,然后指望它就能工作,甚至在那个层面上进行调试。你需要刻意从影响最小、人类控制最多的地方开始,这样你才能牢牢把握当前的能力边界——它们能做到什么,我能用它们做什么——然后慢慢向更高的自主性、更少的控制权方向倾斜。
这给了你一种信心——好吧,我知道我面临的特定问题是什么,AI 能在多大程度上解决它。然后我再考虑需要引入什么上下文、需要给它加什么样的工具来改善体验。所以我觉得这既是好事也是坏事。好的一面是,你不必一上来就面对外面世界那些花哨 AI agent 的全部复杂性,觉得我做不到。每个人都是从极简的结构开始,然后逐步演进。不好的一面是——反过来说,在你试图为公司构建一键式 agent 时,你不需要被这种复杂性所压倒,你可以逐步升级。
这极其重要。而且我们看到这个模式在不断重复。
Lenny Rachitsky: 好,那就顺着这个话题深入聊,因为这是你们建议人们构建 AI 产品、AI agent、各种 AI 项目时非常关键的一个组成部分。给我们举个例子,讲讲你在这里所说的——从低自主性、高控制权起步,然后逐步向上攀升的这个思路。
客服场景的渐进式路径
Kiriti Badam: 举个例子,AI agent 一个非常典型、非常普遍的应用场景就是客服支持。假设你是一家公司,有大量的客服工单——甚至不用假设,OpenAI 自己就是这样的情况,当我们发布产品的时候,比如 Image、GPT-5 等成功产品上线后,客服工单量会出现巨大的峰值。而你收到的提问类型各不相同,客户带来的问题也五花八门。所以不是简单地把帮助中心的所有文章堆到 AI agent 里就完事了。你需要理解哪些东西是你可以逐步构建出来的。
所以最初的第一步大概是:你仍然有你的客服人员——人类客服人员——但系统会在旁边提供建议,比如说,AI 认为这是正确的处理方式。然后你会从人类那里获得反馈回路:在这种情况下,这是一个好的建议;那种情况下,这是一个不好的建议。然后你可以回去分析:缺陷在哪里,盲点在哪里,以及我该如何修复。一旦你把这些搞定了,你就可以提高自主性——不再只是向人类提供建议,而是直接把答案展示给客户。接下来我们可以加入更多复杂度:之前只是基于帮助中心文章回答问题,现在可以增加新功能——我可以真正为客户办理退款,我可以向工程团队提交功能需求等等。如果你在第一天就把所有这些都塞进去,控制这种复杂度会极其困难。所以我们建议一步一步来,逐步增加。
Lenny Rachitsky: 很好。你们还有一张图,我们会分享出来,展示这个过程具体长什么样。不过为了复述一下你所描述的内容——从高控制权、低自主性起步这个思路——你举的例子是客服人员只是在旁边提供建议,不能做任何事情,由用户主导。然后随着它变得有用,你对它做的事情有了信心,你就给它多一点自主性,相应地减少用户的控制权。如果效果继续向好的方向发展,你就再给它更多自主性,用户需要的控制就越来越少。很好。
行为校准与约束自主性
Aishwarya Naresh Reganti: 我觉得更高层的理念是,AI 系统的核心在于行为校准。你几乎不可能预先预测系统会如何表现。那怎么办?你要确保不破坏客户体验或终端用户体验——保持那个不变——同时逐步减少人类的控制量。而且没有唯一正确的方式,你可以自己决定如何约束自主性。
另一个约束自主性的例子是预授权场景。保险预授权是 AI 非常成熟的应用场景,因为临床医生要花大量时间预授权血液检查、MRI 之类的事情。其中有些情况属于低垂的果实——比如 MRI 和血液检查,因为一旦掌握了患者信息,审批起来比较容易,AI 就可以处理这些;而像侵入性手术之类的情况则风险更高,你不希望由 AI 自主完成。
所以你可以判断哪些用例需要经过人类审核环节,哪些可以由 AI 方便地处理。而在整个过程中,你也在记录人类在做什么,因为你要构建一个飞轮,用来持续改进系统。本质上,你没有损害用户体验,没有侵蚀信任,同时记录了人类本会做的事情,这样你就能持续改进你的系统。
更多渐进式演进的例子
Lenny Rachitsky: 我再给你们几个你们推荐的渐进路径的例子。我在这个话题上花这么多时间,是因为这是你们帮助人们构建更成功的 AI 产品的建议中非常关键的一部分——从高控制权、低自主性缓慢起步,在建立了对系统正确行为的信心后再逐步提升。
我再分享几个你们在文章中给出的例子。假设你在构建一个编程助手:V1 只是建议行内补全和样板代码片段;V2 会生成更大的代码块,比如测试或重构代码,供人类审查;V3 则直接应用更改、自主提交 PR。再举一个营销助手的例子:V1 是起草邮件或社交媒体文案,就是”这是我会做的”;V2 是构建多步骤营销活动并运行它;V3 则是直接启动 A/B 测试、跨渠道自动优化营销活动。很好。
总之再总结一下到目前为止我们给出的建议。第一,很重要的一点是要理解 AI 产品是不同的。它们是非确定性的。你们指出了——我之前忘了复述这一点——在输入和输出两端都是非确定性的。用户体验是非确定性的。不同的人会看到不同的东西、不同的输出、不同的对话内容,如果它在帮你设计 UI 的话,甚至可能看到不同的界面。输出显然也是非确定性的。所以这既是一个问题,也是一个挑战。然后——
Aishwarya Naresh Reganti: 不过如果你换个角度看,这也是 AI 最美妙的地方。毕竟我们用自然语言交流比按一堆按钮要自在得多。所以使用 AI 产品的门槛要低得多,你可以像跟人交流一样自然地使用它。但问题也恰恰在这里——我们沟通的方式千千万万,你要确保意图被正确传达,正确的行动被正确执行,因为你的大多数系统都是确定性的,你要实现的是确定性的结果,但用的却是非确定性的技术,这就是事情变得有点棘手的地方。
Lenny Rachitsky: 好,我喜欢这种乐观视角下它为什么是好事的解释。另一个要点就是设计时的自主性与控制权的权衡。我猜你们看到的实际情况是,人们试图直接跳到理想状态——比如直接上 V3——然后就出问题了。一方面,构建难度大得多,另一方面,它根本跑不起来。然后他们就觉得:“好吧,这是个失败品,我们到底在干什么?”
Kiriti Badam: 没错。我觉得在到达 V3 之前,你确实需要在很多事情上建立信心。而且很容易被压垮——你的 AI agent 在一百个不同的地方做错了事情,你不可能把它们全部罗列出来逐个修复。即使你已经掌握了如何处理评估实践之类的事情,如果你的出发点选错了,从那里纠正问题就会非常困难。而当你从小处着手,从高人类控制、低自主性的极简版本开始构建时,它也会倒逼你思考:我到底要解决的问题是什么。我们用了一个词叫”问题优先”(problem first)。对我来说这很显而易见——我当然需要先想清楚问题——但令人惊讶的是,这个理念与人们的共鸣如此之深。在我们所见到的 AI 的所有进步中,一个很容易滑入的陷阱就是不断思考解决方案的复杂度,却忘记了你真正要解决的问题。
Kiriti Badam: 所以当你从更小的自主性规模开始时,你就会真正思考:我要解决的问题到底是什么?我如何将其分解为日后可以逐步构建的自主性层级?这极其有用,而且我们会不断向每一个与我们交流的人反复强调这一点。
Lenny Rachitsky: 限制自主性还有许多其他好处,因为它确实存在风险——系统可能替你做得太多,然后把你的数据库搞乱,或者发出一堆你根本没打算发的邮件。这么做有太多理由了。
Aishwarya Naresh Reganti: 没错。我最近读了一篇来自 UC Berkeley 的几位研究者的论文,主要是 Matei Zaharia 和 Databricks 的一些同事。论文指出,在他们调研的企业中,大约 74% 到 75% 的企业认为最大的问题是可靠性。这也是他们不愿意向终端用户部署产品或构建面向客户的产品的的原因——他们无法确定,也不放心让用户暴露在这些风险之下。这也解释了为什么当今很多 AI 产品都与生产力提升有关,因为这类产品的自主性较低,而不是那种会替代整个工作流的端到端 agent。总的来说我很欣赏他们的工作,而且我认为这也与我在自己的创业公司中观察到的情况高度一致。
Lenny Rachitsky: 好的,非常有意思。在这一期节目之前会先播出另一期,那一期我们深入讨论了这个问题可以避免的另一个麻烦,就是 prompt 注入和越狱攻击,以及这对 AI 产品来说风险有多大——这在本质上可能是一个未解决甚至不可解的问题。我不打算在这里展开,但那确实是一次相当令人不安的对话,会在这期之前发布。
Aishwarya Naresh Reganti: 我觉得一旦这些系统大规模普及,这将是一个巨大的问题。我们现在忙于构建 AI 产品,还没顾上担忧安全问题,但它将来一定会成为一个巨大的问题,尤其是再加上这个非确定性的 API。你会陷入两难——你的 prompt 里可以被注入大量指令,然后事情就会变得非常糟糕。
Prompt 注入与安全风险
Lenny Rachitsky: 好的,我们在这里多聊一会儿吧,因为我对这个话题真的很感兴趣,而且没什么人在讨论这些。我们那次对话的核心就是:让 AI 做不该做的事情其实相当容易。人们部署了各种防护系统,但事实证明这些护栏并不怎么有效,总能找到办法绕过。而且正如你所说,随着 agent 变得更加自主,甚至涉及机器人,你能让 AI 做不该做的事情,这确实相当可怕。
Kiriti Badam: 我认为这确实是一个问题,但我觉得在当前企业采用 AI 的光谱上,企业真正能从 AI 中获益、改善流程或优化现有流程的程度,我认为仍处于非常早期的阶段。2025 年对 AI agent 和企业尝试采用 AI 来说是极其繁忙的一年,但我觉得渗透率还没有高到足以让你真正充分利用它的程度。所以只要在恰当的位置设置”人在回路”的检查点,我觉得我们可以避免大量这类问题,而更多地聚焦于流程优化。我更偏向乐观派,意思是说,你应该先尝试去采纳和应用,而不是仅仅停留在强调可能出错的那些负面因素上。我坚定地认为企业应该采用这项技术。在我们接触过的公司中,从来没有出现过”AI 在这方面帮不了我”的情况,而总是”它确实有这样一组可以为我优化的功能,让我看看如何将其落地”。
Lenny Rachitsky: 好,我总是喜欢这种乐观视角。我很期待你们去听听那期节目然后告诉我你们的看法,确实很有意思。而且正如你所说,有很多事情需要关注,这只是众多需要担忧和思考的事情之一。好,我们回到正题。我们已经分享了很多实用的建议和重要的经验。我想问的是,在做得好的公司中,以及成功构建 AI 产品的团队中,你们还观察到哪些其他模式和工作方式?以及人们最容易掉进哪些常见的坑?我们先从做得好的公司说起——成功构建 AI 产品的公司还有哪些做法?
成功构建 AI 产品的关键要素
Aishwarya Naresh Reganti: 我几乎把它想象成一个成功三角,包含三个维度,而且从来都不只是技术问题。每一个技术问题首先都是人的问题。在与我们合作过的公司中,成功的公司都具备这三个维度:优秀的领导者、良好的文化,以及技术实力。就领导者而言,我们与许多公司合作过 AI 转型、培训、战略等方面的工作。我感觉到很多公司的领导者在过去十到十五年间积累了深厚的商业直觉,也因此备受推崇。但现在 AI 的出现意味着这些直觉必须重新建立,而领导者需要有勇气去做这件事。我以前和现在 Rackspace 的 CEO Gagan 共事过。他每天早上都会在日程表上留出一个时间段,写着”跟进 AI,凌晨 4 点到 6 点”,其间不会安排任何会议。那就是他专门用来听最新的 AI 播客、了解最新信息的时间。他还会在周末搞一些 vibe coding 之类的活动。所以我认为领导者必须重新亲自动手。这不是因为他们需要亲自去实现这些技术,而是为了重建自己的直觉——因为你必须接受一个事实:你的直觉可能是错的,你很可能是在场最不懂的那个人,你需要向每个人学习。我看到这是那些成功构建产品的公司之间非常显著的区别因素,因为你在引入一种自上而下的推动方式。几乎不可能靠自下而上推动——如果领导者不信任这项技术,或者对技术抱有不切实际的期望,一群工程师是无法从他们那里获得支持的。
我从很多做实际构建工作的人那里听到,他们的领导者根本不理解 AI 在解决某个具体问题时能做到什么程度,或者他们自己 vibe coding 出了什么东西就觉得把它上线很容易。你确实需要了解 AI 今天能解决问题的范围,才能在公司内部做出正确的指导决策。第二个维度是文化本身。同样,我合作的企业中很多并非以 AI 为主业,他们需要在流程中引入 AI,原因可能是竞争对手在做,也可能是因为确实存在非常成熟的用例。但在这个过程中,我觉得很多公司弥漫着一种 FOMO 文化,以及”你将被替代”之类的说法,人们变得非常恐惧。然而领域专家是构建有效的 AI 产品的关键组成部分,因为你确实需要咨询他们来理解你的 AI 的行为表现,或者理想的行为应该是什么样的。
Aishwarya Naresh Reganti: 但我也接触过一些公司,那里的领域专家根本不愿意跟你交流,因为他们觉得自己的工作正在被取代。所以我的意思是,这归根结底还是来自领导者本身。你希望建立一种赋能的文化,把 AI 融入到你自己的工作流程中去,让你在做的事情能够 10 倍提升,而不是告诉员工”如果你不采用 AI,可能就会被取代”之类的话。这种赋能型的文化总是有帮助的。你希望让整个组织一起参与进来,让 AI 为你服务,而不是让员工各自守护自己的饭碗。而且 AI 确实也比以前开辟了更多机会,所以你可以让员工做比以前多得多的事情,把生产力提升 10 倍。第三个维度就是我们常说的技术部分。
我认为那些成功的团队对深入理解自己的工作流程有着近乎偏执的执着,并且清楚地知道哪些环节适合用 AI 来增强,哪些环节需要在某处保持人在回路等等。每当你试图自动化某个工作流程的一部分时,从来不会出现只需用一个 AI agent 就能解决所有问题的情况。通常你会有一个机器学习模型负责一部分工作,有确定性代码负责另一部分工作。所以你真的需要对工作流程有极深的理解,才能为问题选择合适的工具,而不是一味地对技术本身着迷。我看到的另一个模式是,他们真正理解了与非确定性 API——也就是你的大语言模型——协作意味着什么。这意味着他们也理解 AI 开发生命周期看起来非常不同,他们会非常快速地迭代——能不能在不会破坏客户体验的同时,以足够快的速度迭代并获取足够的数据,从而对行为进行评估。
所以他们能够非常快地建立起那个飞轮。就目前而言,关键不在于你是否是竞争对手中第一个上线 agent 的公司,而在于你是否已经搭建了正确的飞轮,能够随着时间的推移不断改进?当有人来找我说,“我们有这个一键 agent,部署到你的系统里,两三天就能显示显著的收益”,我几乎会持怀疑态度,因为根本不可能。这并不是因为模型不够好,而是因为企业的数据和基础设施非常混乱,你需要一点时间……即使是 agent 本身也需要一点时间来理解这些系统是怎么运作的。到处都是非常混乱的分类体系,人们倾向于用诸如”获取客户数据-我们想要的”、“获取客户数据-我们做的”这样的命名方式,这类函数都存在且正在被调用,基本上有大量的技术债需要处理。
所以大多数情况下,如果你专注于问题本身,并且非常了解自己的工作流程,你就会知道如何随着时间的推移改进你的 agent,而不是简单地贴一个 agent 上去就指望它从第一天起就能正常工作。我甚至可以这样说:如果有人在卖一键 agent,那纯粹是营销。你不应该买账。我宁愿选择一家说”我们会为你构建这条流水线”的公司——它能够随着时间学习并构建飞轮来持续改进——而不是一个号称开箱即用的东西。要替代任何关键工作流程,或者构建一个能带来显著 ROI 的产品,即使你拥有最好的数据层和基础设施层,也至少需要四到六个月的工作。
Lenny Rachitsky: 太精彩了。这里面有很多与我在这档播客上其他对话深度共鸣的内容。其中一点是,一家公司要想在 AI 方面看到显著成效,创始人兼 CEO 必须深度投入。我之前请过 Dan Shipper 上播客,他们帮助很多公司采用 AI。他说成功的第一大预测因素就是 CEO 是否每天多次与 ChatGPT、Claude 等工具对话。我很喜欢你举的 Rackspace CEO 每天早上跟进 AI 新闻的例子。我当时想象他应该是在跟聊天机器人对话,而不是在阅读新闻。
Aishwarya Naresh Reganti: 就今天的信息量而言,你确实可以……我的意思是,你也想选择合适的信息渠道,因为每个人都有自己的观点,那你要相信谁的观点呢?我觉得拥有一组高质量的信息来源确实很重要。他就是固定看两三个来源,然后带着一堆问题回来,跟一群 AI 专家讨论,看看他们的看法。我曾是那个小组的一员,所以我多少知道——
Lenny Rachitsky: 我很喜欢这个做法。
Aishwarya Naresh Reganti: ——他会提出什么样的疑问。
Lenny Rachitsky: 真酷。
Aishwarya Naresh Reganti: 确实挺酷的。我当时问他,“你为什么要做这么多?“然后他说,“这会渗透到我们做出的大量决策中去。“
关于 Evals 的讨论
Lenny Rachitsky: 好。让我来聊聊另一个话题,这个话题在这档播客上非常热门,有一段时间在 Twitter 上也很热——evals。很多人对 evals 非常着迷,认为它是解决 AI 很多问题的方案。也有很多人觉得 evals 被高估了,你不需要 evals,凭感觉就行,一切都会好的。你对 evals 怎么看?它在解决你谈到的那些问题时能走多远?
Kiriti Badam: 就目前社区的情况而言,我觉得存在一种虚假的二元对立——要么 evals 能解决一切,要么线上监控或生产环境监控能解决一切。我认为没有理由去相信任何一个极端,比如说完全把我的应用押注在这上面或那上面来解决问题。退一步来看,想想 evals 是什么。Evals 本质上就是你将值得信赖的产品思维或对产品的认知,转化为一组数据集——这些对我很重要,这些是我的 agent 不应该犯的错误,让我来构建一组数据集,确保在这些方面表现良好。而生产环境监控方面,你做的是部署应用,然后通过一些关键指标来反馈客户实际使用你产品的情况。
你可以部署任何 agent,但如果客户对交互体验给出了赞,你肯定想知道。这就是生产环境监控要做的事。这种生产监控在产品领域已经存在很久了,只是现在有了 AI agent,你需要监控的粒度要细得多。不仅仅是客户总是给你显式反馈,还有很多隐式反馈可以获取。比如在 ChatGPT 中,如果你喜欢某个回答,可以给个赞;如果你不喜欢,有时候客户不会给踩,而是直接重新生成回答。这就是一个明确的信号,表明最初的回答没有达到客户的期望。这些就是你始终需要关注的隐式信号。
生产监控与 evals 的关系
Kiriti Badam: 而且这种生产监控的覆盖范围一直在不断扩大。现在让我们回到最初的话题——到底是 evals 重要,还是生产环境监控重要?这重要吗?我觉得,我们又要回到”问题优先”的思路——你到底想构建什么。你想为客户构建一个可靠的应用,不会做坏事,始终做正确的事。或者即使做了错事,你也能很快收到警报。我把这个问题拆成两部分。一是没有人会在不测试的情况下就上线部署应用。这个测试可能就是手动跑一跑,也可能是”好,我有这 10 个问题,不管做什么改动都不应该出错,我来建一个这样的测试集,称之为评估数据集”。然后,假设你建好了这个数据集,部署了应用,接着你就需要了解它到底做没做对。
如果你是高吞吐量或高交易量的客户,实际上不可能逐条审查所有运行轨迹。你需要一些指标来判断哪些东西值得去看。这就是生产环境监控发挥作用的地方——你无法预测 agent 可能出错的所有方式,但所有这些隐式信号和显式信号,会告诉你哪些运行轨迹需要查看。这正是生产监控的用处。一旦拿到这些轨迹,你需要检查在不同类型的交互中看到了哪些失败模式。有没有我特别在意、绝对不应该发生的事情?如果这类失败模式确实出现了,那我就需要考虑为它构建一个评估数据集。
假设我为 agent 构建了一个评估数据集,专门测试它在明确被禁止的情况下是否会主动退款。我建好了这个评估数据集,然后在工具或 prompt 或其他方面做了改动,部署了产品的第二个版本。但并不能保证这是你唯一会遇到的问题。你仍然需要生产环境监控来捕获可能出现的各种不同问题。所以我认为 evals 很重要,生产环境监控也很重要,但”只有其中一个能帮你解决问题”这种说法,在我看来是完全站不住脚的。
Lenny Rachitsky: 好,很客观的回答。这里的重点并不是简单地”两边都做”就行了。更准确地说,它们各自要捕获的问题不同,单一方法不可能覆盖所有你需要关注的东西。
Aishwarya Naresh Reganti: 没错。
Lenny Rachitsky: 很好。
evals 这个词到底承载了多少重量
Aishwarya Naresh Reganti: 我想退两步,谈谈 evals 这个词在 2025 年下半年不得不承载多大的分量。你去见一家数据标注公司,他们告诉你”我们的专家在写 evals”;然后又有一帮人说产品经理应该写 evals,evals 就是新的 PRD;还有人认为 evals 基本上就是一切,是你应该构建的用来改进产品的反馈回路。现在作为一个初学者退一步想想,evals 到底是什么?为什么人人都在说 evals?实际上这些是流程中不同的环节,也没有人是错的——是的,这些确实都是 evals,但当数据标注公司告诉你”我们的专家在写 evals”时,他们其实指的是错误分析,或者专家在标注什么才是正确的。
律师和医生写 evals,并不意味着他们在构建 LLM 裁判器,或者在构建一整套反馈回路。当你说产品经理应该写 evals 时,也不意味着他们必须写一个能用于生产环境的 LLM 裁判器。我认为做这件事有一些非常具体的指导方法,而且我很赞同 Kiriti 的观点——你无法预先确定是该构建一个 LLM 裁判器,还是该利用生产监控中的隐式信号等。Martin Fowler 在 2000 年代某个时候提出过一个概念叫”语义扩散”,意思是有人创造了一个术语,然后每个人开始用自己的定义去曲解它,最后你就失去了这个术语本来的含义。这正是现在发生在 evals、agents 以及 AI 领域几乎任何一个词上的事情——每个人都只看到了它的某个侧面。
但如果你让一群从业者坐在一起,问他们”为 AI 产品构建一个可操作的反馈回路重要吗?“我认为所有人都会同意。至于具体怎么做,真的取决于你的应用本身。当你的应用场景比较复杂时,构建 LLM 裁判器极其困难,因为你会看到很多新涌现的模式。假设你建了一个裁判器来测试输出是否过于冗长,结果发现出现了你的 LLM 裁判器无法捕获的新模式,然后你就不得不建越来越多的 evals。到了某个阶段,直接看用户信号、修复问题、检查是否出现退化然后继续前进,会比不断构建裁判器更合理。所以这完全取决于具体情况。我想每一位 ML 从业者都会告诉你的一句话是:这真的取决于上下文。不要迷信那些”处方”,它们会变的。
Lenny Rachitsky: 这个观点非常重要——evals 对不同的人意味着太多不同的东西了。它已经变成了一个包含太多内容的统称。当你把数据标注公司给你的东西、PM 写的东西都叫做 evals,沟通起来就很混乱。对了,还有 benchmarks,大家有时也把 benchmarks 算作 evals 的一种——
Aishwarya Naresh Reganti: 我最近跟一个客户聊,他告诉我”我们做 evals”。我就问,“好,能给我看看你们的数据集吗?“他说,“没有,我们就是查了一下 LM Arena 和 Artificial Analysis。这些是独立的 benchmark,我们知道这个模型适合我们的场景。“我当时就想,你没有在做 evals。那不叫 evals,那些是模型级别的 evals。
Lenny Rachitsky: 但这也可以理解。这个词确实可以用在那个语境下。我能理解人们为什么这么想,但这样一来概念就更混乱了。
Aishwarya Naresh Reganti: 是的。
Codex 如何看待 evals
Lenny Rachitsky: 这方面还有一条我想到的线索——这个话题之所以变成一场大争论,原因之一是 Claude Code。Claude Code 的负责人 Boris 说过,“我们 Claude Code 不做 evals,全凭感觉。“Kiriti,关于 Codex 和 Codex 团队,你们是怎么看待 evals 的,能分享一下吗?
Kiriti Badam: Codex 这边,我们采取的是一种平衡的方式——你需要 evals,同时也必须倾听客户的声音。Alex 最近上了你的播客,他谈到你们如何极度专注于构建正确的产品,其中很大一部分就是倾听客户。编码 agent 与其他领域的 agent 相比非常独特,因为它们本身就是为可定制性而设计的,而且是为工程师设计的。编码 agent 不是一个只解决五个或六个固定工作流的产品,它注定要在多种不同方式下被定制。这意味着你的产品会被用于不同的集成、不同的工具、不同的场景。所以要为你的客户可能使用的所有交互类型构建评估数据集,是非常困难的。
话虽如此,你也需要确保,如果我做一个改动,至少不会破坏产品的核心功能。所以我们确实有评估来做这件事,但同时我们也极其重视理解客户的使用方式。比如我们最近做了一款代码审查产品,目前的增长势头非常猛。我觉得 OpenAI 内部以及外部客户的大量 bug 都被它捕获了。那么假设我现在要对代码审查做一个模型改动,或者调整了训练时使用的某种 RL 机制,然后要部署上线,我肯定要做 A/B 测试,确认它是否真正找到了正确的错误,以及用户的反应如何。因为有时候如果用户对你的错误代码审查建议感到厌烦,他们会直接关掉这个产品。
Kiriti Badam: 所以这些才是你需要关注的信号,确保你的新改动在朝着正确的方向走。而我们很难事先预想到所有这些场景并为它们构建评估数据集。所以我觉得两者都需要——既有大量的”凭感觉”,也有大量的客户反馈,同时我们也在社交媒体上非常活跃,了解是否有人遇到某些问题并迅速修复。所以这就像……怎么说呢,这是一个你需要同时做多方面事情的领域。
Lenny Rachitsky: 这太有道理了。好的,我听到的是,Codex 支持 evals,但光有 evals 还不够。
Kiriti Badam: 对。
Lenny Rachitsky: 同时也要观察客户行为和反馈。而且也有一部分是”凭感觉”——就是用起来感觉好不好?生成的代码是否让我很兴奋、觉得很棒?
Kiriti Badam: 我不认为有谁能拿出一套具体的 evals 说”我把命押在这上面,其他什么都不用管了”——这样做行不通。每次发布新模型,我们团队都会聚在一起测试不同的东西。每个人专注于不同的方面。我们有一份难题清单,会把这些难题抛给模型,看它们的进展如何。所以你可以把它理解为每个工程师都有自己定制的 evals,借此理解新产品在不同模型下的表现。
(广告段落已跳过)
持续校准、持续开发框架
Lenny Rachitsky: 我们已经聊了快一个小时了,还没有谈到你们两位开发的那套非常强大的 AI 产品软件开发工作流。你们在课程中教授这套方法,基本上把我们前面讨论的所有内容整合成了一套构建 AI 产品的分步方法论。你们称之为”持续校准、持续开发”(continuous calibration, continuous development)框架。我们来放一张图让大家看看这到底是怎么回事,然后请你们带我们走一遍——这是什么、怎么运作、团队如何将构建 AI 产品的方式切换到这套方法上来,从而避免大量痛苦和踩坑。
Aishwarya Naresh Reganti: 在我们解释这个生命周期之前,先讲一个小故事,说明为什么 Kiriti 和我会想出这套东西。我们不断与大量公司交流,这些公司面临着来自竞争对手的压力——大家都在做 agent,我们也应该构建完全自主的 agent。我确实和几个客户合作过,为他们构建了端到端的 agent。结果发现,因为你起步时并不知道用户会如何与你的系统交互,AI 可能会产生什么样的回复或行动,所以当你的工作流非常庞大——需要执行四五个步骤、做大量决策时——问题就非常难修复。你最终花了大量时间调试,然后不断打热修复补丁,以至于有一次我们在做一个客服场景的产品——也就是我们在 newsletter 里举的那个例子。
我们最终不得不关停那个产品,因为我们做了太多热修复,而不断涌现的问题根本无法计数。网上也有不少相关新闻。最近,加拿大航空(Air Canada)就出了这么一件事——他们的一个 agent 幻觉出了一条退款政策,而这条政策并不在原始的 playbook 中,但由于法律原因他们不得不照此执行。类似这样真正令人恐惧的事件还有很多。这就是我们这套想法的来源:如何构建 AI 系统,才能既不失去客户信任,又不会让你的 agent 或 AI 系统做出对公司极其危险的决策?同时还能建立一个飞轮,让你在前进的过程中不断改进产品。这就是我们提出”持续校准、持续开发”这个理念的原因。
思路其实很简单。我们有循环的右侧,即”持续开发”——你圈定能力范围并策划数据,本质上是建立一个数据集,明确你的预期输入是什么、预期输出应该长什么样。这是在开始构建任何 AI 产品之前非常好的一个练习,因为很多时候你会发现团队内部对产品应该怎么表现根本没有达成共识。这就是 PM 和领域专家可以大量贡献信息的地方。于是你就有了一个你知道你的 AI 产品应该表现良好的数据集。它并不全面,但能让你起步。然后你搭建应用,设计合适的评估指标。我刻意使用”评估指标”(evaluation metrics)这个说法,虽然我们常说的是 evals,但我想在用词上非常明确,因为评估是一个过程,而评估指标是你在过程中需要关注的具体维度。
然后你进行部署,运行评估指标。第二部分是”持续校准”——也就是你去理解那些你在初期没有预见到的行为。因为在开始开发流程时,你有那个正在优化的数据集,但往往你会发现那个数据集不够全面,因为用户开始以你未曾预料的方式与你的系统交互。这就是你需要做校准的部分:我部署了系统,现在我看到有一些我确实没有预料到的行为模式,而你的评估指标应该能让你对这些模式有一些洞察。但有时候你会发现那些指标本身也不够,你可能会遇到从未想到过的新的错误模式。这就是你需要分析行为、发现错误模式的地方。
从修复错误到逐步提升自主性
Aishwarya Naresh Reganti: 你对看到的问题应用修复,同时设计新的评估指标来识别正在浮现的模式。但这并不意味着你应该总是去设计新的评估指标。有些错误你修完就可以不用再回头看了,因为它们是非常特定的问题。比如,仅仅因为你的工具定义不够好而导致的工具调用错误——你修好就行了,然后继续前进。这基本就是 AI 产品生命周期的样子。但我们特别想强调的是,在进行这些迭代时,尽量在初期采用低自主性、高控制度的迭代方式。也就是说,限制你的 AI 系统能做的决策数量,确保有”人在回路”,然后随着时间推移逐步提高自主性,因为你正在构建一个行为的飞轮,同时在理解什么样的用例正在涌现、用户是如何使用你的系统的。
客户支持 agent 的自主性演进
我们在 newsletter 中给出的一个例子是客户支持。这里有一张很好的图,展示了如何将自主性和控制权视为两个维度。你的每一个版本都在持续提升 AI 系统的自主性——即做出决策的能力——同时逐步降低控制度。我们举的例子是客户支持 agent,你可以把它拆分为三个版本。第一个版本是路由——即你的 agent 能否对一个工单进行分类并路由到正确的部门?有时候你读到这些,可能会想:做路由真的有那么难吗?为什么 agent 不能轻松做到?但当你走进企业内部,你会发现路由本身就是一个超级复杂的问题。任何你能想到的零售公司都有层级化的分类体系。
企业分类体系中的真实混乱
大多数时候,这些分类体系混乱得令人难以置信。我曾经处理过一个用例,那里的分类体系大概有某种层级结构,然后在同一层级上同时出现了”鞋类”、“女鞋”、“男鞋”——而按理说应该是”鞋类”作为父类,“女鞋”和”男鞋”作为子类。你说,好吧,那我把它们合并就行了。然后你再往下看,发现鞋子下面还有另一个部分写着”女性专用”和”男性专用”,根本没有聚合在一起。由于某种原因,就是没人去修。如果一个 agent 看到这样的分类体系,它该怎么办?该往哪里路由?很多时候,我们并不了解这些问题,直到你真正动手去构建、去深入理解。
而当真人客服遇到这类问题时,他们知道接下来该查什么。也许他们会发现那个在鞋子下面标着”女性专用”和”男性专用”的节点上次更新还是在 2019 年,这意味着它就是一个没人管的死节点。所以他们知道该去看别的节点。我并不是说 agent 不能理解这些,或者模型不够聪明去理解这些,而是企业内部确实存在大量奇怪的规则,这些规则没有任何文档记录。你要确保 agent 拥有所有这些上下文,而不是简单地把问题丢给它们。
回到我们说的版本划分,路由是自主性最低的版本,你的控制度非常高,因为即使你的 agent 路由到了错误的部门,人类也可以接管并撤销这些操作。在此过程中,你还会发现自己在处理大量的数据问题——你需要修复这些问题,确保数据层足够好,agent 才能正常运作。第二步是我们所说的 Copilot 模式——既然你经过几轮迭代已经确认路由没问题了,数据问题也修复了,你就可以进入下一步:让 agent 根据客户支持的标准操作流程提供建议。它可以生成一份草稿,由人工进行修改。当你这样做的时候,你同时也在记录人类的行为——比如客服人员用了草稿的多少内容、删掉了什么。所以你实际上是在免费获取错误分析,因为你把用户的所有操作都记录下来了,这些数据可以回灌到你的飞轮中。
从草稿到端到端解决
然后我们说,在那之后,当你发现草稿质量不错、大多数时候人工不需要做太多修改、基本直接使用时——那就是你该进入端到端解决助手阶段的时候了,它可以起草解决方案并直接解决工单。这就是自主性的递进阶段——从低自主性开始,逐步提升。我们还整理了一张非常好的表格,列出了在每个版本你应该做什么、你能学到什么来推动进入下一步、以及你可以把什么信息回灌到循环中。当你只是在做路由时,你获得了更高质量 routing 数据,你也知道了需要构建什么样的 prompt 来改进路由系统。
本质上,你是在搭建你的上下文工程(context engineering)的结构,构建你想要的飞轮。在讲这些的同时,我想非常明确地说明两点。第一,当你按照 CCCD 的思路去构建时,并不意味着你一劳永逸地解决了所有问题。你可能已经推进到了 V3,然后又看到了一个之前从未想象过的全新数据分布——但这只是降低风险的一种方式:在达到完全自主之前,你有足够多的信息来了解用户如何与你的系统交互。第二,你也顺带构建了一个隐式的日志系统。很多人来找我们说:“等等,已经有 evals 了,为什么还需要这种东西?“问题是,仅仅构建一堆评估指标然后放到生产环境中,评估指标只能捕捉你已经意识到的错误,但还有很多你只有上线后才能发现的涌现模式。
降低风险的框架
对于那些涌现模式,你实际上是在创建一个低风险框架,让你能理解用户行为,而不是等到错误堆积如山时再试图一次性修复所有问题。当然,这也不是唯一的做法。还有很多不同的方式。你需要决定如何约束自主性——可以基于 agent 执行的动作数量来限制(就像我们在这个例子中做的),也可以基于主题来限制——有些领域做完全自主决策的风险非常高,但其他一些主题让它完全自主就没问题——还可以根据问题的复杂程度来决定。这正是你真正需要产品经理、工程师和领域专家达成一致的地方:如何构建这个系统并持续改进它。
核心理念就是行为校准,并且在做行为校准的过程中不失去用户信任。
Lenny Rachitsky: 如果大家想深入了解,我们会把那篇文章的链接放出来。文章里有逐步的指引和大量实例。正如你所说,这里的核心思想就是让整个过程变成持续的、迭代的,沿着更高自主性、更少控制权的方向不断推进。甚至”持续校准""持续开发”这样的命名本身,就是在传达这是一种迭代过程。顺便说明一下,这个命名是对 CI/CD——持续集成、持续部署——的致敬。这里的理念是,AI 时代版本的 CI/CD 不再只是集成单元测试和不断部署,而是运行 evals、查看结果、对你关注的指标进行迭代、找出系统在哪里出了问题,然后针对那些问题再迭代。好的。
我们还是把链接放出来,供想深入了解的读者参考。刚才是一个很棒的概览。在我们切换到下一个话题之前,关于这个框架本身,还有什么你认为大家需要了解的重要内容吗?
Aishwarya Naresh Reganti: 我觉得我们最常被问到的一个问题是:我怎么知道是否该进入下一个阶段,或者当前的校准是否已经足够了?这其实没有什么可以照搬的规则手册,关键在于减少意外。也就是说,假设你每一两天做一次校准,发现看不到新的数据分布模式,用户与系统的交互行为也一直很稳定,你能获得的新信息量就很低了,这时你就知道可以进入下一个阶段了。核心就是看你是否还在接收新信息——如果没有,你就准备好了。但同时也需要理解,有时候某些事件会彻底打乱你系统的校准。比如 GPT-4o 已经不存在了,或者说它在 API 中也会被弃用。
大多数使用 4o 的公司都需要迁移到 5,而 5 的特性差异很大。这时你的校准就又失效了,需要回去重新走一遍这个流程。还有些时候,用户与系统的交互方式也会随着时间推移而变化,用户行为会演化。即便在消费级产品中,你现在与 ChatGPT 的对话方式也不像两年前那样了,因为你清楚地知道它的能力已经大幅提升。而且当这些系统能解决某类任务时,人们会很兴奋,想把它拿来试试其他任务。我们曾经为一家银行构建了一个面向承保人的系统。承保是一项很痛苦的工作,贷款申请材料动辄三四十页,这家银行的想法是建一个系统来帮助承保人查询政策和银行信息,以便他们审批贷款。
头三四个月,大家都对系统相当满意,承保人确实反馈说节省了大量时间。但到了第三个月,我们发现他们因为对产品太兴奋,开始提出一些我们从未预料到的深层问题。他们会把整份申请材料丢给系统,问:“对于类似这样的案例,之前的承保人是怎么做的?” 对用户来说,这似乎只是他们原来做的事的自然延伸,但背后的系统构建需要发生根本性的改变。现在你需要理解”类似这样的案例”在贷款本身的语境下意味着什么——是指某个收入区间的人群,还是某个特定地区的人群?
然后你还需要调取历史文档、分析这些文档,然后告诉他们”情况是这样的”,而不是简单地说存在某某政策你可以去查。所以,对终端用户来说非常自然的事情,对产品构建者来说可能极其困难。你会看到用户行为也在不断演化,这时候你就知道需要回去重新校准了。
AI 领域的过热与被低估
Lenny Rachitsky: 你觉得当前 AI 领域有什么被过度炒作的?更重要的是,有什么被低估的?
Kiriti Badam: 就像我说的,我对 AI 的各种进展总体非常乐观,所以我不太会说什么是”过度炒作”的,但我觉得”多 agent”这个概念有些被误解了。很多人有一种想法:“我有一个极其复杂的问题,现在我要把它拆开——你这个 agent 负责这个,你那个 agent 负责那个。“然后觉得只要把这些 agent 串联起来就能达到 agent 乌托邦。事实上确实有非常成功的多 agent 系统,这毫无疑问。但我认为成功的关键在于你如何限制系统偏离轨道的方式。比如,如果你构建一个监督者 agent,下面有若干子 agent 替它执行具体工作,这是一个非常成功的模式。
但如果抱着”我按功能划分职责,然后期望所有 agent 通过某种类似八卦协议的方式自动协调运作”的想法,那就是极大的误解。我不认为当前的构建方式和模型能力已经能够支撑这类应用。我觉得这与其说是被高估,不如说是被误解了。至于被低估的,可能很难让人相信,但我依然觉得编程 agent 被低估了。你去 Twitter 和 Reddit 上会看到很多关于编程 agent 的讨论,但如果你跟随便哪家公司的工程师聊一聊,尤其是湾区以外的公司,你会发现编程 agent 能带来的影响力依然巨大,而渗透率还非常低。所以我觉得 2025 和 2026 年将会是优化所有这些流程的极佳年份。
我认为这将会用 AI 创造出大量价值。
Lenny Rachitsky: 第一点非常有意思。所以你的意思是,相比于一堆各管一摊的 agent——比如一个 Codex agent 做这个任务,另一个做那个任务——更成功的做法可能是构建一个能够自行拆分子任务的 agent?
Kiriti Badam: 你可以有多个 agent 来做这些事,由你作为人来编排;或者你也可以有一个更大的 agent 来统一编排所有这些子 agent。但如果让 agent 之间以点对点的方式互相通信,尤其是在客服这类场景下,你会极难控制到底是哪个 agent 在回复你的客户,因为你需要到处设置护栏之类的防护措施。
Lenny Rachitsky: 没错。好的,很棒的观点。好,Aishwarya,你呢?
Aishwarya Naresh Reganti: 我可以说 evals 吗?我会不会被网暴?
Lenny Rachitsky: 归到哪一类?放哪个桶里?
Aishwarya Naresh Reganti: 被高估的那一类。
Lenny Rachitsky: 被高估。好的,说说看。我们不会让你被网暴的。
Aishwarya Naresh Reganti: 开个玩笑。我认为 evals 被误解了。它们确实重要,大家,我没有说它们不重要。但我觉得那种”我不断跳到新工具上、学一个新工具就是被高估了”的想法……我还是比较老派,觉得你真的需要对你要解决的业务问题着迷。AI 只是一个工具。我试着用这种方式来看待它。当然,你需要了解最新最好的技术,但不要痴迷于快速构建。如今构建的成本已经很低了。设计反而更昂贵——真正思考你的产品、你要构建什么,它是否真的能解决一个痛点?这才是如今更有价值的事情,而且在不久的将来只会更加如此。所以真正对问题和设计的痴迷是被低估的,而机械式地构建是被高估的。
Lenny Rachitsky: 太好了。接下来一个类似的问题。从产品的角度来看,你认为未来一年的 AI 会是什么样子?给我们描绘一下,比如说到 2026 年底,你觉得事情会往什么方向发展。
Kiriti Badam: 我觉得后台 agent(background agents),或者说主动型 agent(proactive agents),在这方面有很大的前景。它们会更好地理解你的工作流。如果你想想今天 AI 未能创造价值的地方,主要就是没有理解上下文。而它之所以没有理解上下文,是因为它没有接入到真正发生工作的正确位置。随着你在这些方面做得更多,你能给 agent 更多的上下文,然后它就能看到你周围的世界,理解你在优化哪些指标,或者你在做哪些类型的活动。从那里出发,就很容易获得更多收益,然后让 agent 来主动提醒你。我们已经在 ChatGPT pulse 中做了类似的事情,它会给你一个每日更新,关于你可能关心的事情。这很棒,能激发你的思考——“哦,这个我没想过,也许不错”。当你把这个延伸到更复杂的任务时,比如一个编程 agent 说:“好,我已经修好了你的五个 Linear 工单,这是补丁,你开始工作时看一下就行。“我觉得这会非常有用。我看到这是 2026 年产品构建的一个强劲方向。
Lenny Rachitsky: 太酷了。所以基本上就是 agent 能预判你想做什么,走在你前面——“我已经帮你解决了这些问题”,或者”我觉得这个会导致你的网站崩溃,你可能需要修一下这个”,或者”我看到这里有一个尖峰,让我们重构一下数据库”。太棒了。好,Aish,你怎么看?
Aishwarya Naresh Reganti: 我非常看好 2026 年的多模态(multimodal)体验。我认为我们在 2025 年已经取得了相当大的进展,不仅在生成方面,在理解方面也是。直到现在,LLM 一直是我们最常用的模块,但作为人类,我们是多模态的生物。语言可能是我们进化中最晚出现的形式之一。就像我们三个人在交谈时,我们不断接收到大量信号。比如我会想,“哦,Lenny 在点头,那我大概可以往这个方向说”,或者”Lenny 无聊了,那我不说了”。所以在你的思维链背后还有另一条思维链,而你不断用语言在调整它。这种表达维度还没有被很好地探索。如果我们能构建更好的多模态体验,就能让我们更接近人类对话的丰富度。而且考虑到现有模型的类型,还有一大堆枯燥的任务也非常适合 AI 来做。如果多模态理解变得更好,有那么多手写文档和非常混乱的 PDF,即使是目前最好的模型也无法解析。如果这成为可能,我们将能挖掘出大量的数据。
Lenny Rachitsky: 太好了。我刚看到 Google DeepMind——不管他们整个组织叫什么——的 Demis 在谈论这个,他认为这将是他们未来方向的一个重要组成部分,把图像模型的工作、LLM 以及他们的世界模型——Genie,好像是叫这个名字——结合起来。那将是一个非常非常疯狂的时代。好,最后一个问题。如果有人想要更擅长构建 AI 产品,你觉得他们应该重点培养的一项或两项技能是什么?
Aishwarya Naresh Reganti: 我觉得我们确实讲了不少 AI 产品的最佳实践——从小处着手,让迭代运转起来,建立一个飞轮等等。但如果从一万英尺的高度来看,对于今天任何构建东西的人来说,就像我说的,未来几年实现的成本会变得极其低廉。所以你要真正打磨你的设计、判断力、品味这些。总的来说,如果你也在构建职业生涯的话,我觉得过去几年,你职业生涯的早期——比如前两三年——总是聚焦在执行力、技能操作这些方面。而现在我们有 AI 可以帮你快速上手。在那之后,也就是几年之后,我认为每个人的工作都会变成关于你的品味、你的判断力、以及什么是独一无二属于你的。我觉得要把重点放在那上面,想办法如何带来那种独特的视角。这不意味着你必须年纪很大、有很多年的经验。我们最近招了一个人,我们用一个很流行的应用来跟踪任务,用了好几年了,付着高昂的订阅费。结果这个家伙直接带着他自己 vibe coding 出来的应用来开会,把我们全部迁移过去了,然后说”好,开始用这个吧”。我觉得那种主动性、那种主人翁意识——真正重新思考体验——才是让人脱颖而出的东西。我也不是看不到 vibe coding 的应用维护成本很高这个事实。也许随着公司规模扩大,我们不得不替换它,或者想出更好的方案。但考虑到我们现在是一家小公司……我当时真的很震惊,因为我从没想过这样做。如果你一直习惯以某种方式工作,你会在心里给构建关联一个成本。而我觉得在这个时代成长起来的人,他们心里的构建成本要低得多。他们就是不介意先构建一个东西然后继续往前走。他们也非常热衷于尝试新工具。这可能也是 AI 产品存在留存问题的原因——因为每个人都对尝试新工具感到兴奋。但归根结底,拥有这种主动性和主人翁意识是关键。我觉得这也将标志着繁忙工作时代的终结。你不能坐在角落里做一些对公司没有推动力的事情。你真的需要思考端到端的工作流,思考如何带来更大的影响。我觉得所有这些都会变得非常重要。
代理人时代的终结与主动性
Lenny Rachitsky: 这让我想起,我最近请了 Jason Lemkit 上播客。他在销售、go to market 方面非常厉害,运营着 Saster。他把整个销售团队都换成了 agent。原来有 10 个销售,后来变成了 1.2 个人和 20 个 agent。其中一个 agent 专门追踪每个人在 Salesforce 里的更新,根据他们的通话自动帮他们填写。结果有个销售人员说:“好吧,我辞职。“事实证明他其实什么都没干,就是坐在那儿,然后他想:“完了,这会抓住我的,我得赶紧走。“所以回到你说的,以后想坐在那儿无所事事会更难了,我觉得这确实是对的。
Kiriti Badam: 对,我想补充一点,我觉得坚持也非常有价值,特别是考虑到现在任何想构建东西的人,信息就在你指尖,甚至比过去十年更加如此。你可以一夜之间学会任何东西,变成那种钢铁侠式的人物。所以我觉得要保持那种坚持,经历学习这些东西的痛苦,去实践,去理解什么有效什么无效。当你在这个不断尝试不同方法、解决问题的痛苦过程中不断积累,我觉得这才是一个人真正的护城河。我喜欢称之为”痛苦是新护城河”,我觉得这在构建 AI 产品时尤其有用。
痛苦是新护城河
Lenny Rachitsky: 展开讲讲,我很喜欢这个概念。“痛苦是新护城河”,还有什么更多的吗?
Kiriti Badam: 我觉得作为一家公司来说,我的意思是,现在在任何新领域取得成功的公司,他们之所以成功,不是因为先发优势,也不是因为他们有什么花哨的功能让更多客户喜欢。他们经历了那种痛苦——弄清楚哪些是不可妥协的东西,然后精确地权衡这些与可用的功能或模型能力之间的关系。这不是一个简单直接的过程。没有教科书教你做这个,也没有什么现成的路径。所以我说的痛苦,就是经历这种反复迭代——“好,试试这个,不行就试那个。“这种你在组织内部或个人经历中积累下来的知识,我觉得那种痛苦最终就转化为了公司的护城河。它可能体现为你构建的一套 evals 或其他什么东西。我觉得这才是真正的制胜关键。
Lenny Rachitsky: 太棒了。就像把煤炭变成钻石。
Kiriti Badam: 没错。
最后的建议
Lenny Rachitsky: 好了。我觉得我们在帮助大家避开构建 AI 产品时最常遇到的那些大坑方面做得不错。我们涵盖了很多陷阱以及正确的做法。在我们进入令人兴奋的快问快答环节之前,还有什么想分享的吗?还有什么想留给听众的?
Aishwarya Naresh Reganti: 对你的客户着迷。对问题着迷。AI 只是一个工具,要确保你真正理解自己的工作流。80% 所谓的 AI 工程师和 AI PM 实际上把时间花在深入理解自己的工作流上。他们不是在构建最花哨、最酷的模型或围绕它的工作流。他们实际上在泥里摸爬滚打,理解客户的行为和数据。每当一个从未做过 AI 的软件工程师听到”看看你的数据”这句话,我觉得这对他们来说是一个巨大的启发,但事实一直如此。你需要亲自去看你的数据,理解你的用户,这将成为巨大的差异化优势。
Lenny Rachitsky: 很好的收尾。AI 不是答案,它只是解决问题的工具。那么,我们到了令人兴奋的快问快答环节。我有五个问题给你们两位,准备好了吗?
Aishwarya Naresh Reganti: 好的,准备好了。
快问快答
Lenny Rachitsky: 好,你们都可以回答,也可以选一个来回答,随你们。你最常推荐给别人的两三本书是什么?
Aishwarya Naresh Reganti: 我推荐一本叫《当呼吸化为空气》的书,Lenny。作者是 Paul Kalanithi。他是一位印裔神经外科医生,在三十一二岁时被诊断出肺癌。整本书是他的回忆录,是在被确诊后写的。这本书非常美,特别是因为我是在 COVID 期间读的,而那段时期我们最大的愿望就是活着。书里有很多很棒的句子,我记得其中一句,他在反驳苏格拉底那句很著名的话——“未经审视的人生不值得过”——意思是说你真的需要思考自己的选择,理解自己的价值观、使命等等。而 Paul 说:“如果未经审视的人生不值得过,那么未曾度过的人生值得审视吗?“意思是你是否花了太多时间在理解自己的使命和目标上,以至于忘了真正去生活?我觉得每一个身处 AI 时代、不断构建、不断重塑自己的人,都需要偶尔停下来,好好生活一段时间。他们需要停止对生活做太多 evals。
Lenny Rachitsky: 我正想说这个,我脑子就往那个方向跑了。你得给自己的生活写点 evals。天哪,我们走得太远了。
Aishwarya Naresh Reganti: 对,就是。
Lenny Rachitsky: 说得真好。
Aishwarya Naresh Reganti: 这是我最喜欢的书。
Kiriti Badam: 我更喜欢科幻类的书。我非常喜欢《三体》系列,一套三本。它有宏大科幻的元素,地球以外的生命以及它如何影响人类的决策过程。它还涉及地缘政治的元素,以及抽象科学对人类进步有多么重要和有价值。当这一切被阻断时,在日常生活中并不明显,但可能造成毁灭性的后果。所以我觉得 AI 在这些领域提供帮助将是极其关键的。而这本书很好地展示了如果不这样做会发生什么。
Lenny Rachitsky: 完全同意。绝对喜欢。可能是我最喜欢的科幻书,甚至可以说是系列,一共三本。顺便说一下,三本都要读。我觉得大概读到一本半之后才真正精彩起来。所以如果有人试过之后觉得”这到底在讲什么?“——继续读下去,读到第二本中间,就会变得令人震撼。
Kiriti Badam: 是的。
Lenny Rachitsky: 如果你喜欢科幻又在做 AI,你得读一本叫《A Fire Upon the Deep》的书,作者 Vernon Vinge。去看看,非常棒。我在 Noah Smith 的 Newsletter 上看到他推荐这本书,它有续集,但这本是最令人惊叹的。而且它其实是关于 AGI 和超级智能这些主题的,非常史诗级。但几乎没人听过它。
Kiriti Badam: 谢谢推荐。
Lenny Rachitsky: 好了,礼尚往来。下一个问题。你最近最喜欢的电影或电视剧是什么?
Aishwarya Naresh Reganti: 我开始重温《硅谷》了,我觉得它太真实了。完全不过时。一切都在重演。几年前看过的人都应该重新看一遍,你会发现它和现在 AI 浪潮中发生的一切惊人地相似。
Lenny Rachitsky: 重温这剧确实是个好主意。我喜欢他们整个业务就是一个压缩算法,某种意义上也许是 LLM 的一个小小先驱。我能理解。好了,Kiriti,你有什么推荐?
Kiriti Badam: 我要稍微跑个题,不算严格意义上的电影或电视剧。我最近在玩一款叫 Expedition 33 的游戏。它跟 AI 完全无关,但无论是玩法、过场动画、故事还是音乐,都做得极其出色,令人惊叹。
Lenny Rachitsky: 你居然还有时间玩游戏,这太好了。我太喜欢这个了。想象一下你在 OpenAI,大家可能以为你除了写代码和开会就没有别的事了。
Kiriti Badam: 确实,很难挤出时间来玩。
Lenny Rachitsky: 挺好的,这是个好信号,我很高兴听到这个。下一个问题——最近发现的、让你特别喜欢的产品是什么?
最近最喜欢的推荐产品
Aishwarya Naresh Reganti: 对我来说是 Whisper Flow。我用了一段时间了,之前完全没意识到自己这么需要它。最棒的地方在于,它是一个语义转录工具——如果你打开 Codex 开始用 Whisper Flow,它会自动识别变量名之类的东西。从语音转录到指令的转换非常丝滑。你可以说”我今天好开心,加三个感叹号”,它会直接加上三个感叹号,而不是把”加三个感叹号”这几个字写出来。我觉得特别酷。如果你还没用过,一定要试试。
Lenny Rachitsky: 我来做个广告——订阅我的 newsletter 年度会员,就能免费使用 Whisper Flow 整整一年。
Aishwarya Naresh Reganti: 我就是这么拿到使用权限的,Lenny。
Lenny Rachitsky: 看吧。我就说这 deal 值得推广。我觉得大家没有真正理解这个优惠有多厉害,都觉得”不可能吧”。但它是真的。另外还有 18 个其他产品,lennysproductpass.com,去看看。继续吧,Kiriti。
Kiriti Badam: 好的。我其实是个效率工具控,一直在尝试各种新的 CLI 工具和能让我更快的东西。Raycast 让我非常惊艳——我发现了各种新快捷方式,可以用来打开不同的应用、输入快捷命令等等。还有一个最近从队友那里发现的工具叫 Caffeinate,它可以防止 Mac 进入休眠,这样你就可以在本地跑一个长达四五个小时的 Codex 任务,让它把东西构建好,然后你醒来一看,“好,不错,我很满意。”
Lenny Rachitsky: Codex 加 Caffeinate 这个组合太搞笑了。你们得自己做这个,在 OpenAI 内部搞一个,或者让 Codex agent 自动阻止 Mac 休眠。太好笑了。顺便说一句,Raycast 也在 Lenny’s Product Pass 里,一年免费。很棒。
Aishwarya Naresh Reganti: Lenny 没告诉我们这些合作方的事呢。不过这些确实是我们真正喜欢的产品。
Lenny Rachitsky: 这只是 19 个产品中的两个。Caffeinate 倒不在里面,我也不知道那东西是不是收费的。好,继续。你有没有一个在工作和生活中经常回想起来的座右铭?
座右铭
Aishwarya Naresh Reganti: 对我来说,有一句话是我小时候爸爸告诉我的,一直深深刻在心里:别人说这做不到,但傻瓜不知道,于是就做了。我觉得要足够”傻”地去相信只要全力以赴就能做到任何事,尤其是现在——你手边有那么多数据都在告诉你,你很可能会失败。有多少播客超过了一千个订阅者?有多少公司达到了一百万 ARR?总有数据可以证明你不会成功,但有时候就是要傻一点,放手去做。
Lenny Rachitsky: 说得好。
Kiriti Badam: 我是个容易想太多的人,所以我特别喜欢 Steve Jobs 的那句话:你只能在回头看时才能把点连起来。很多时候面前有无数选择,你真的不知道哪个是最优的,但生活总有办法让你回头一看,“哦,原来这些转变竟然这么美妙。“所以我觉得这句话非常有用——继续往前走,继续尝试。
互相欣赏
Lenny Rachitsky: 最后一个问题。每次我在播客上同时请两位嘉宾,我都喜欢问这个问题——你欣赏对方什么?
Aishwarya Naresh Reganti: 我觉得 Kiriti 非常沉稳、踏实,他一直是我的倾听者。我可以把大量想法抛给他,他总能预见到可能会遇到什么样的问题。他非常善良,让工作替自己说话,而不是靠说很多话。但如果只能选一点的话,我觉得他是最不可思议的丈夫。
Lenny Rachitsky: 揭晓了。大家恐怕都没想到吧。
Aishwarya Naresh Reganti: 我们结婚四年了,是我人生中最美好的四年。
Lenny Rachitsky: 哇。好吧。这怎么接?
Kiriti Badam: 确实,太难接了。我想说,在硅谷的优秀公司和极聪明的人共事,我非常幸运。而在所有我合作过的聪明人中,Aishwarya 有一个独特的地方——她有一种非凡的教学天赋,能把复杂的东西讲得通俗易懂。这种能力加上她的坚持,在我们这个快速变化的 AI 世界里特别有价值——新东西层出不穷,让人应接不暇,但当她告诉你”你应该这样理解这整件事,它是这样嵌入其中的”,你就会觉得,哦,原来这么简单,我也能做到。她通过简化事物、用最易懂的方式解释来赋能很多人。我觉得这是一种非常了不起的品质。
Lenny Rachitsky: 太棒了。好甜啊。我应该每次都问这个问题,需要更多嘉宾来这样做。太好了。最后的问题——大家可以在哪里找到你们正在做的东西、在线找到你们?分享一下课程链接,以及听众怎样能对你们有帮助?
在哪里找到他们
Aishwarya Naresh Reganti: 我在 LinkedIn 上写很多东西。如果你想关注那些真正在 AI 产品一线摸爬滚打、分享实际观察的实践者,可以关注我的内容。我们还有一个大约 20K star 的 GitHub 仓库,里面全是学习 AI 的优质资源,完全免费。如果你喜欢今天聊的内容,我们还运营一门非常受欢迎的课程,关于如何构建企业级 AI 产品。这门课的核心是帮助大家转变思维方式,遵循问题优先而非工具优先或炒作优先的方法论。大家可以去看看。如果不想上课也没关系,我们写很多文章,提供大量免费资源,还有免费公开课,记得关注我们的工作。
Kiriti Badam: 对,补充一下,你也可以在 LinkedIn 上找到我。我写东西不算多,但我非常乐意跟人聊任何你正在构建的复杂产品。如果你对如何用 coding agent 让自己的工作更轻松有什么想法,或者遇到了什么问题,我的 DM 始终开放,欢迎随时来聊。
Lenny Rachitsky: 太好了。Kiriti、Ash,非常感谢你们来参加这次节目。
Kiriti Badam: 非常感谢。
Aishwarya Naresh Reganti: 谢谢你,Lenny。今天聊得太开心了。
Lenny Rachitsky: 确实很开心。大家再见。
Lenny Rachitsky: 非常感谢大家的收听。如果你觉得这期节目有价值,欢迎在 Apple Podcasts、Spotify 或你喜欢的播客应用上订阅。也请考虑给我们评分或留下评价,这真的能帮助更多听众找到这个播客。你可以在 lennyspodcast.com 找到所有往期节目或了解更多关于这个节目的信息。下期再见。
术语表
| 原文 | 中文 |
|---|---|
| A/B test | A/B 测试 |
| agency control trade-off | 自主性与控制权的权衡 |
| agent traces | agent 运行轨迹 |
| agentic systems | agentic 系统 |
| Aishwarya Naresh Reganti | Aishwarya Naresh Reganti(人名) |
| Alex | Alex(人名) |
| ARR | ARR(年度经常性收入) |
| Artificial Analysis | Artificial Analysis(benchmark 平台) |
| background agents | 后台 agent |
| behavior calibration | 行为校准 |
| benchmark | benchmark |
| Boris | Boris(人名) |
| Caffeinate | Caffeinate(macOS 防休眠命令) |
| ChatGPT pulse | ChatGPT pulse(ChatGPT 的主动提醒功能) |
| CI/CD | CI/CD(持续集成/持续部署) |
| CLI | CLI(命令行界面) |
| context engineering | 上下文工程 |
| continuous calibration | 持续校准 |
| continuous development | 持续开发 |
| Copilot | Copilot(辅助模式) |
| Dan Shipper | Dan Shipper(人名) |
| evals | evals(评估体系) |
| flywheel | 飞轮 |
| FOMO | FOMO(错失恐惧) |
| Gagan | Gagan(人名) |
| gossip protocol | 八卦协议 |
| guardrails | 护栏 |
| Half Dome | 半圆顶(优胜美地地标) |
| hallucinating | 幻觉 |
| hot fix | 热修复 |
| human in the loop | 人在回路 |
| jailbreaking | 越狱攻击 |
| Jason Lemkit | Jason Lemkit(人名) |
| Kiriti Badam | Kiriti Badam(人名) |
| Lenny Rachitsky | Lenny Rachitsky(人名) |
| Linear 工单 | Linear(项目管理工具)中的工单 |
| LLM judge | LLM 裁判器 |
| LM Arena | LM Arena(benchmark 平台) |
| low hanging fruits | 低垂的果实 |
| Matei Zaharia | Matei Zaharia(人名) |
| moat | 护城河 |
| multi-agents | 多 agent |
| multimodal | 多模态 |
| Noah Smith | Noah Smith(人名) |
| non-determinism | 非确定性 |
| online monitoring | 线上监控 |
| Paul Kalanithi | Paul Kalanithi(人名) |
| playbook | playbook(成熟方法论/操作手册) |
| PM | PM(产品经理) |
| PRD | PRD(产品需求文档) |
| pre-authorization | 预授权 |
| proactive agents | 主动型 agent |
| problem first | 问题优先 |
| production monitoring | 生产环境监控 |
| prompt | prompt |
| prompt injection | prompt 注入 |
| Rackspace | Rackspace(公司名) |
| Raycast | Raycast(效率启动器工具) |
| ROI | ROI(投资回报率) |
| routing | routing(路由/分发) |
| semantic diffusion | 语义扩散 |
| standard operating procedures | 标准操作流程 |
| subagents | 子 agent |
| subject matter experts | 领域专家 |
| supervisor agent | 监督者 agent |
| tech debt | 技术债 |
| underwriters | 承保人 |
| Vernon Vinge | Vernon Vinge(人名) |
| vibe coding | vibe coding |
| Whisper Flow | Whisper Flow(语音转录工具) |
此文档由 AI 分片翻译(translate_long_document)