为什么 AI 评估是产品建设者最炙手可热的新技能 | Hamel Husain 和 Shreya Shankar(排名第一的评估课程创作者)
Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)
About AI Evals
Lenny Rachitsky: To build great AI products, you need to be really good at building evals. It’s the highest ROI activity you can engage in.
Hamel Husain: This process is a lot of fun. Everyone that does this immediately gets addicted to it. When you’re building an AI application, you just learn a lot.
Today’s Guests: Hamel & Shreya
Lenny Rachitsky: What’s cool about this is you don’t need to do this many, many times. For most products, you do this process once and then you build on it.
Shreya Shankar: The goal is not to do evals perfectly, it’s to actionably improve your product.
Intro to AI Evals
Lenny Rachitsky: I did not realize how much controversy and drama there is around evals. There’s a lot of people with very strong opinions.
Shreya Shankar: People have been burned by evals in the past. People have done evals badly, so then they didn’t trust it anymore, and then they’re like, “Oh, I’m anti evals.”
Lenny Rachitsky: What are a couple of the most common misconceptions people have with evals?
Beyond Vibe Checks
Hamel Husain: The top one is, “We live in the age of AI. Can’t the AI just eval it?” But it doesn’t work.
Real-World Eval Demo
Lenny Rachitsky: A term that you used in your posts that I love is this idea of a benevolent dictator.
Error Analysis: Start with Data
Hamel Husain: When you’re doing this open coding, a lot of teams get bogged down in having a committee do this. For a lot of situations, that’s wholly unnecessary. You don’t want to make this process so expensive that you can’t do it. You can appoint one person whose taste that you trust. It should be the person with domain expertise. Oftentimes, it is the product manager.
Lenny Rachitsky: Today, my guests are Hamel Husain and Shreya Shankar. One of the most trending topics on this podcast over the past year has been the rise of evals. Both the chief product officers of Anthropic and OpenAI shared that evals are becoming the most important new skill for product builders. And since then, this has been a recurring theme across many of the top AI builders I’ve had on. Two years ago, I had never heard the term evals. Now it’s coming up constantly. When was the last time that a new skill emerged that product builders had to get good at to be successful?
Hamel and Shreya have played a major role in shifting evals from being an obscure, mysterious subject to one of the most necessary skills for AI product builders. They teach the definitive online course on evals, which happens to be the number one course on Maven. They’ve now taught over 2,000 PMs and engineers across 500 companies, including large swaths of the OpenAI and Anthropic teams along with every other major AI lab.
In this conversation, we do a lot of show versus tell. We walk through the process of developing an effective eval, explain what the heck evals are and what they look like, address many of the major misconceptions with evals, give you the first few steps you can take to start building evals for your product, and also share just a ton of best practices that Hamel and Shreya have developed over the past few years. This episode is the deepest yet most understandable primer you’ll find on the world of evals. And honestly, it got me excited to write evals, even though I have nothing to write evals for. I think you’ll feel the same way as you watch this.
If this conversation gets you excited, definitely check out Hamel and Shreya’s course on Maven. We’ll link to it in the show notes. If you use the code LENNYSLIST when you purchase the course, you’ll get 35% off the price of the course. With that, I bring you Hamel Husain and Shreya Shankar.
And Fin is trusted by over 5,000 customer service leaders and top AI companies like Anthropic and Synthesia. And because Fin is powered by the Fin AI engine, which is a continuously improving system that allows you to analyze, train, test, and deploy with ease, Fin can continuously improve your results too. So if you’re ready to transform your customer service and scale your support, give Fin a try for only 99 cents per resolution. Plus, Fin comes with a 90-day money-back guarantee. Find out how Fin can work for your team at fin.ai/lenny. That’s fin.ai/lenny.
Real Trace Example
Hamel Husain: Thank you for having us.
Shreya Shankar: Yeah, super excited.
Lenny Rachitsky: I’m even more excited. Okay, so a couple years ago, I had never heard the term evals. Now it’s one of the most trending topics on my podcast, essentially, that to build great AI products, you need to be really good at building evals. Also, it turns out some of the fastest-growing companies in the world are basically building and selling and creating evals for AI labs. I just had the CEO of Mercor on the podcast. So there’s something really big happening here. I want to use this conversation to basically help people understand this space deeply, but let’s start with the basics. Just what the heck are evals? For folks that have no idea what we’re talking about, give us just a quick understanding of what an eval is, and let’s start with Hamel.
Error Analysis Methods
Hamel Husain: Sure. Evals is a way to systematically measure and improve an AI application, and it really doesn’t have to be scary or unapproachable at all. It really is, at its core, data analytics on your LLM application and a systematic way of looking at that data, and where necessary, creating metrics around things so you can measure what’s happening, and then so you can iterate and do experiments and improve.
Lenny Rachitsky: So that’s a really good broad way of thinking about it. If you go one level deeper just to give people a very, even more concrete way of imagining and visualizing what we’re talking about, even if you have a example to show would be even better, what’s an even deeper way of understanding what an eval is?
SMS Channel Issues
Hamel Husain: Let’s say you have a real estate assistant application and it’s not working the way you want. It’s not writing emails to customers the way you want, or it’s not calling the right tools, or any number of errors. And before evals, you would be left with guessing. You would maybe fix a prompt and hope that you’re not breaking anything else with that prompt, and you might rely on vibe checks, which is totally fine.
And vibe checks are good and you should do vibe checks initially, but it can become very unmanageable very fast because as your application grows, it’s really hard to rely on vibe checks. You just feel lost. And so evals help you create metrics that you can use to measure how your application is doing and kind of give you a way to improve your application with confidence. That you have a feedback signal in which to iterate against.
Lenny Rachitsky: So just to make very real, so imagining this real estate agent, maybe they’re helping you book a listing or go see an open house. The idea here is you have this agent talking to people, it’s answering questions, pointing them to things. As a builder of that agent, how do you know if it’s giving them good advice, good answers? Is it telling them things that are completely wrong?
So the idea of evals, essentially, is to build a set of tests that tell you, how often is this agent doing something wrong that you don’t want it to do? And there’s a bunch of ways you could define wrong. It could be just making up stuff. It could be just answering in a really strange way. The way I think about evals, and tell me if this is wrong, just simply is like unit tests for code. You’re smiling. You’re like, “No, you idiot.”
Shreya Shankar: No, that’s not what I was thinking.
Lenny Rachitsky: Okay. Okay, okay, tell me. Tell me, how does that feel as a metaphor?
Shreya Shankar: Okay. I like what you said first, which is we had a very broad definition. Evals is a big spectrum of ways to measure application quality. Now, unit tests are one way of doing this. Maybe there are some non-negotiable functionalities that you want your AI assistant to have, and unit tests are going to be able to check that. Now, maybe you also, because these AI assistants are doing such open-ended tasks, you kind of also want to measure how good are they at very vague or ambiguous things like responding to new types of user requests or figuring out if there’s new distributions of data like new users are coming and using your real estate agent that you didn’t even know would use your product. And then all of a sudden, you think, “Oh, there’s a different way you want to kind of accommodate this new group of people.”
So evals could also be a way of looking at your data regularly to find these new cohorts of people. Evals could also be like metrics that you just want to track over time, like you want to track people saying, “Yes. Thumbs up. I liked your message.” You want very, very basic things that are not necessarily AI-related but can go back into this flywheel of improving your product. So I would say, overall, unit tests are a very small part of that very big puzzle.
Lenny Rachitsky: Awesome. You guys actually brought an example of an eval just to show us exactly what the hell we’re talking about. We’re talking in these big ideas. So how about let’s pull one up and show people, “Here’s what an eval is.”
The Hallucination Problem
Hamel Husain: Yeah, let me just set the stage for it a little bit. So to echo what Shreya said, it’s really important that we don’t think of evals as just tests. There’s a common trap that a lot of people fall into because they jump straight to the test like, “Let me write some tests,” and usually that’s not what you want to do. You should start with some kind of data analysis to ground what you should even test, and that’s a little bit different than software engineering where you have a lot more expectations of how the system is going to work. With LLMs, it’s a lot more surface area. It’s very stochastic, so you kind of have a different flavor here.
And so the example I’m going to show you today, it’s actually a real estate example. It’s a different kind of real estate example. It’s from a company called Nurture Boss. I can share my screen to show you their website just to help you understand this use case a little bit, so let me share my screen. So this is a company that I worked with. It’s called Nurture Boss, and it is a AI assistant for property managers who are managing apartments, and it helps with various tasks such as inbound leads, customer service, booking appointments, so on and so forth. It’s like all the different sort of operations you might be doing as a property manager, it helps you with that. And so you can see kind of what they do. It’s a very good example because it has a lot of the complexities of a modern AI application.
So there’s lots of different channels that you can interact through the AI with like chat, text, voice, but also, there’s tool calls, lots of tool calls for booking appointments, getting information about availability, so on and so forth. There’s also RAG retrieval, getting information about customers and properties and things like that. So it’s pretty fully fleshed in terms of an AI application. And so they have been really generous with me in allowing me to use their data as a teaching example. And so we have anonymized it, but what I’m going to walk through today is, okay, let’s do the first part of how we would start to build evals for Nurture Boss. Why would we even want to do that?
So let’s go through the very beginning stage, what we call error analysis, which is, let’s look at the data of their application and first start with what’s going wrong. So I’m going to jump to that next, and I’m going to open an observability tool. And you can use whatever you want here. I just happen to have this data loaded in a tool called Braintrust, but you can load it in anything. We don’t have a favorite tool or anything in the blog post that we wrote with you. We had the same example but in Phoenix Arize, and I think Aman, on your blog post, used Phoenix Arize as well. And there’s also LangSmith. So these are kind of like different tools that you can use.
So what you see here on the screen, this is logs from the application, and let me just show you how it looks. So what you see here is, and let me make it full screen, this is one particular interaction that a customer had with the Nurture Boss application, and what it is is a detailed log of everything that happened. So it’s called a trace, and it’s just the engineering term for logs of a sequence of events. The concept of a trace has been around for a really long time, but it’s especially really important when it comes to AI applications.
And so we have all the different components and pieces and information that the AI needs to do its job, and we are logged all of it and we’re looking at a view of that. And so you see here a system prompt. The system prompt says, “You are an AI assistant working as a leasing team member at Retreat at Acme Apartments.” Remember, I said this is anonymized, so that’s why the name is Acme Apartments. “Your primary role is to respond to text messages from both current residents and prospective residents. Your goal is to provide accurate, helpful information,” yada, yada, yada. And then there’s a lot of detail around guidelines of how we want this thing to behave.
Lenny Rachitsky: Is this their actual system prompt, by the way, for this company?
LLM Limits in Open Coding
Hamel Husain: It is. Yes, it is.
The Benevolent Dictator
Lenny Rachitsky: Amazing. That’s so cool.
Hamel Husain: It’s a real system prompt.
More Open Coding Examples
Lenny Rachitsky: That’s amazing because it’s rare you see a actual company product’s system prompt. That’s like their crown jewels a lot of times, so this is actually very cool on its own.
Hamel Husain: Yeah. Yeah, it’s really cool. And you see all of these different sort of features that are different use cases, so things about tour scheduling, handling applications, guidance on how to talk to different personas, so on and so forth. And you can see the user just kind of jumps in here and asks, “Okay, do you have a one-bedroom with study available? I saw it on virtual tours.” And then you can see that the LLM calls some tools. It calls this get individual’s information tool, and it pulls back that person’s information. And then it gets the community’s availability. So it’s querying a database with the availability for that apartment complex.
And then finally, the AI responds, “Hey, we have several one-bedroom apartments available, but none specifically listed with a study. Here are a few options.”
And then it says, “Can you let me know when one with a study is available?”
And then it says, “I currently don’t have specific information on the availability of a one-bedroom apartment.”
User says, “Thank you.”
And the AI says, “You’re welcome. If you have any more questions, feel free to reach out.” Now, this is an example of a trace, and we’re looking at one specific data point. And so one thing that’s really important to do when you’re doing data analysis of your LLM application is to look at data. Now, you might wonder, “There’s a lot of these logs. It’s kind of messy. There’s a lot of things going on here. How in the hell are you supposed to look at this data? Do you want to just drown in this data? How do you even analyze this data?”
So it turns out there is a way to do it that is completely manageable, and it’s not something that we invented. It’s been around in machine learning and data science for a really long time, and it’s called error analysis. And what you do is, the first step in conquering data like this is just to write notes. Okay? So you got to put your product hat on, which is why we’re talking to you, because product people have to be in the room and they have to be involved in sort of doing this. Usually a developer is not suited to do this, especially if it’s not a coding application.
Understanding Theoretical Saturation
Lenny Rachitsky: And just to mirror back, why I think you’re saying that is because this is the user experience of your product. People talking to this agent is the entire product essentially, and so it makes sense for the product person to be super involved in this.
Hamel Husain: Yeah. So let’s reflect on this conversation. Okay, a user asked about availability. The AI said, “Oh, we don’t really have that. Have a nice day.” Now, for a product that is helping you with lead management, is that good? Do you feel like this is the way we want it to go?
Lenny Rachitsky: Not ideal.
Hamel Husain: Yes, not ideal, and I’m glad you said that. A lot of people would say, “Oh, it’s great. The AI did the right thing. It looked, it said, ‘We didn’t have available,’ and it’s not available.” But with your product hat on, you know that’s not correct. And so what you would do is you would just write a quick note here. You would say, “Okay.” You might pop in here, and you can write a note. So every observability application has ability to write notes, and you wouldn’t try to figure out if something is wrong. In this case, it’s kind of not doing the right thing, but you just write a quick note, “Should have handed off to a human.”
Lenny Rachitsky: And as we watch this happening, it’s like you mention this and you’ll explain more. You’re doing this, this feels very manual and unscalable, but as you said, this is just one step of the process and there’s a system to this. That was just the first one.
Hamel Husain: Yeah, and you don’t have to do it for all of your data. You sample your data and just take a look, and it’s surprising how much you learn when you do this. Everyone that does this immediately gets addicted to it and they say, “This is the greatest thing that you can do when you’re building an AI application.” You just learn a lot and you’re like, “Hmm, this is not how I want it to work. Okay.” And so that’s just an example.
So you write this note, and then we can go on to the next trace. So this is the next trace. I just pushed a hot key on my keyboard. Let me go back to looking at it.
Lenny Rachitsky: And these tools make it easy to go through a bunch and add these notes quickly.
Hamel Husain: Yes. And so this is another one. Similar system prompt. We don’t need to go through all of it again. We’ll just jump right into the user question. “Okay, I’ve been texting you all day.” Isn’t that funny? And the user says, “Please.” Okay, yeah, this one is just like an error in the application where this is a text message application, sorry, the channel through which the customer is communicating is through text message, and you’re just getting really garbled. And you can see here that it kind of doesn’t make sense. The words are being cut off like, “In the meantime,” and then the system doesn’t know how to respond, because you know how people text message, they write short phrases. They split their sentence across four or five different turns. So in this case-
Lenny Rachitsky: Yeah, so what do you do with something like that?
Hamel Husain: Yeah, so this is a different kind of error.
Lenny Rachitsky: Mm.
Hamel Husain: This is more of, “Hey, we’re not handling this interaction correctly. This is more of a technical problem,” rather than, “Hey, the AI is not doing exactly what we want.” So we would write that down too.
Lenny Rachitsky: Which is still really cool.
Hamel Husain: Yeah.
Lenny Rachitsky: It’s amazing you’re catching that, too, here. Otherwise, you’d have no idea this was happening.
Hamel Husain: Yeah, you might not know this is happening, right? And so you would just say, “Okay.” You would write a note like, “Oh, conversation flow is janky because of text message.”
Lenny Rachitsky: And I like that, I like that you’re using the word janky. It shows you just how informal this can be at this stage.
Hamel Husain: Yeah, it’s supposed to be chill. Just don’t overthink it. And there’s a way to do this. So the question always comes up, how do you do this? Do you try to find all the different problems in this trace? What do you write a note about? And the answer is, just write down the first thing that you see that’s wrong, the most upstream error. Don’t worry about all the errors, just capture the first thing that you see that’s wrong, and stop, and move on. And you can get really good at this. The first two or three can be very painful, but you can do a bunch of them really fast.
So here’s another one, and let’s skip the system prompt again. And the user asks, “Hey, I’m looking for a two- to three-bedroom with either one or two baths. Do you provide virtual tours?”
And a bunch of tools are called and it says, “Hi Sarah. Currently, we have three-bedroom, two-and-a-half-bathroom apartment available for $2,175. Unfortunately, we don’t have any two-bedroom options at the moment. We do offer virtual tours. You can schedule a tour,” blah, blah. It just so happens that there is no virtual tour, right?
Lenny Rachitsky: Mm-hmm. Nice.
Hamel Husain: So it is hallucinating something that doesn’t exist. Then you kind of have to bring your context as an engineer, or even product content, and say, “Hey, this is kind of weird. We shouldn’t be telling a person about virtual tour when it’s not offered.”
So you would say, “Okay, offered virtual tour,” and you just write the note. So you can see there’s a diversity of different kinds of errors that we’re seeing, and we’re actually learning a lot about your application in a very short amount of time.
Building Intuition for Saturation
Shreya Shankar: One common question that we get from people at this stage is, “Okay, I understand what’s going on. Can I ask an LLM to do this process for me?”
Lenny Rachitsky: Mm, great question.
Shreya Shankar: And I loved Hamel’s most recent example because what we usually find when we try to ask an LLM to do this error analysis is it just says the trace looks good because it doesn’t have the context needed to understand whether something might be bad product smell or not. For example, the hallucination about scheduling the tour, right? I can guarantee you, I would bet money on this, if I put that into chat GPT and asked, “Is there an error?” it would say, “No, did a great job.”
But Hamel had the context of knowing, “Oh, we don’t actually have this virtual tour functionality,” right? So I think, in these cases, it’s so important to make sure you are manually doing this yourself. And we can talk a little bit more about when to use LLMs in the process later, but number one pitfall right here is people are like, “Let me automate this with an LLM.”
Lenny Rachitsky: Do you think we’ll get to a place where an agent can do this, where it has that context?
Shreya Shankar: Oh, no. No, no, no. Sorry. There are parts of error analysis that an LLM is suited for, which we could talk about later in this podcast. But right now, in this stage of free form, note-taking is not the place for an LLM.
Lenny Rachitsky: Got it. And this is something you call open coding, this step?
Shreya Shankar: Yes, absolutely.
Lenny Rachitsky: Cool. Another term that you used in your posts that I love and that fits into this step is this idea of a benevolent dictator. Maybe just talk about what that is, and maybe, Shreya, cover that.
Shreya Shankar: Yeah, so Hamel actually came up with this term.
Lenny Rachitsky: Okay, maybe Hamel cover that, actually.
From Open to Axial Coding
Hamel Husain: No problem. And we’ll actually show the LLM automation in this example, because we’re going to take this example, we’re going to go all the way through.
Lenny Rachitsky: Amazing.
Longstanding ML Traditions
Hamel Husain: And so benevolent dictator is just a catchy term for the fact that when you’re doing this open coding, a lot of teams get bogged down in having a committee do this. And for a lot of situations, that’s wholly unnecessary. People get really uncomfortable with, “Okay, we want everybody on board. We want everybody involved,” so on and so forth. You need to cut through the noise. And a lot of organizations, if you look really deeply, especially small, medium-sized companies, you can appoint one person whose tastes that you trust. And you can do this with a small number of people and often one person, and it’s really important to make this tractable. You don’t want to make this process so expensive that you can’t do it. You’re going to lose out.
So that’s the idea behind benevolent dictator, is, “Hey, you need to simplify this across as many dimensions as you can.” Another thing that we’ll talk about later is when it goes to building an LLM as a judge, you need a binary score. You don’t want to think about, “Is this like a 1, 2, 3, 4, 5?” Like, assign a score to it. You can’t. That’s going to slow it down.
Lenny Rachitsky: Just to make sure this benevolent dictator point is really clear, basically, this is the person that-
Make sure this benevolent dictator point is really clear. Basically, this is the person that does this note-taking, and ideally they’re the expert on the stuff. So if it’s law stuff, maybe there’s a legal person that owns this, it could be a product manager. Give us advice on who this person should be?
AI Tools for Coding & Categorization
Hamel Husain: Yeah. It should be the person with domain expertise. So in this case, it would be the person who understands the business of leasing, apartment leasing, and has context to understand if this makes sense. It’s always a domain expert, like you said. Okay. For legal, it would be a law person. For mental health, it would be the mental health expert, whether that’s a psychiatrist or someone else.
Rhythm of Continuous Improvement
Lenny Rachitsky: Cool.
Pivot Tables to Issue Prioritization
Hamel Husain: Though oftentimes, it is the product manager.
Lenny Rachitsky: Cool. So the advice here is pick that person. It may not feel so super fair that they’re the one in charge and they’re the dictator, but they’re benevolent. It’s going to be okay.
When to Write Evals
Hamel Husain: Yeah. It’s going to be okay. It’s not perfection. You’re just trying to make progress and get signal quickly so you have an idea of what to work on because it can become infinitely expensive if you’re not careful.
Lenny Rachitsky: Yeah. Okay, cool. Let’s go back to your examples.
Building LLM as a Judge
Hamel Husain: Yeah, no problem. So this is another example where we have someone saying, “Okay. Do you have any specials?” And the assistant or the AI responds, “Hey, we have a 5% military discount.” User responds, and it switches the subject, “Can you tell me how many floors there are? Do you have any one-bedrooms available or one-bedrooms on the first floor?” And the AI responds, “Yeah, okay. We have several one-bedroom apartments available.” And then the user wants to confirm, “Any of those on the first floor and how much are the one-bedrooms?” And then also, it’s a current resident, so they’re also asking, “I need a maintenance request.”
You could see the messiness of the real world in here, and the assistant just calls a tool that says transfer call, but it doesn’t say anything. It just abruptly does transfer call, so it’s pretty jank, I would say. It’s just not-
Lenny Rachitsky: Another jank.
LLM as a Judge Example
Hamel Husain: Another kind of jank, a different kind of jank. So when you write the open note, you don’t want to say jank, because what we want to do is we want to understand, and when we look at the notes later on, we want to understand what happened.
So you just want to say, “Did not confirm call transfer with user.” And it doesn’t have to be perfect. You just have to have a general idea of what’s going on.
Lenny Rachitsky: Cool.
Judge Prompts & Human Alignment
Hamel Husain: So, okay. So let’s say we do, and Shreya and I, we recommend doing at least 100 of these. The question is always, “How many of this do you do?” And so there’s not a magic number. We say 100 just because we know that as soon as you start doing this, once you do 20 of these, you will automatically find it so useful that you will continue doing it.
So we just say 100 to mentally unblock you, so it’s not intimidating. It’s like, “Don’t worry, you’re only going to do 100.” And there is a term for that, so the right answer is, “Keep looking at traces until you feel like you’re not learning anything new.” Maybe Shreya should talk about-
Shreya Shankar: Yeah. So there’s actually a term-
Hamel Husain: … that.
Shreya Shankar: … in data analysis and qualitative analysis called theoretical saturation. So what this means is when you do all of these processes of looking at your data, when do you stop? It’s when you are theoretically saturating or you’re not uncovering any new types of notes, new types of concepts, or nothing that will materially change the next part of your process.
And this kind of takes a little bit of intuition to develop, so typically, people don’t really know when they’ve reached theoretical saturation yet. That’s totally fine. When you do two or three examples or rounds of this, you will develop the intuition. A lot of people realize, “Oh, okay. I only need to do 40, I only need to do 60. Actually, I only need to do 15.” I don’t know. Depends on the application and depends on how savvy you are with error analysis for sure.
Evals as Requirement Docs
Lenny Rachitsky: And your point about you’re going to want to do a bunch. I imagine it’s because you’re just like, “Oh, I’m discovering all these problems. I got to see what else is going on here.”
Shreya Shankar: Exactly.
Lenny Rachitsky: Is that right?
Shreya Shankar: And promise, at some point, you’re not going to discover new types of problems.
Lenny Rachitsky: Yeah. Awesome. So let’s say you did 100 of these, what’s the next step?
Advanced Data Analysis Tips
Hamel Husain: Yeah. Okay. So you did 100 of these. Now you have all these notes. So this is where you can start using AI to help you. So the part where you looked at this data is important, like we discussed. You don’t want to automate this part too much.
Lenny Rachitsky: Humans will still have jobs. This is the takeaway here. That’s great.
Hamel Husain: Yes.
Lenny Rachitsky: Just reviewing traces. At least there’s one job left for now. Great.
Hamel Husain: So, yeah. Exactly. And so, okay. You have all these notes. Now, to turn this into something useful, you can do basic counting. So basic counting is the most powerful analytical technique in data science because it’s so simple and it’s kind of undervalued in many cases, and so it’s very approachable for people.
And so the first thing you want to do is take these notes, and you can categorize them with an LLM, and so there’s a lot of different ways to do that. Right before this podcast, I took three different coding agents or AI tools in how to categorize these notes. So one is, “Okay, I uploaded into a cloud project, I uploaded a CSV of these notes, and I just exported them directly from this interface.” There’s a lot of different ways to do this, but I’m showing you the simple, stupid way, the most basic way of doing things.
And so I dumped the CSV in here and I said, “Please analyze the following CSV file.” And I told it there’s a metadata field that has a note in it, but what I said is I used the word open codes, and I said, “Hey, I have different open codes,” and that’s a term of art. LLMs know what open codes are and they know what axial codes are because it is a concept that’s been around for a really long time, so those words help me shortcut what I’m trying to do.
Lenny Rachitsky: That’s awesome. And the end of the prompt is telling it to create axial codes?
Hamel Husain: Yes. Creating axial codes, so what it does is-
Downstream Uses of LLM Judges
Shreya Shankar: So maybe it’s worth talking about what are axial codes or what’s the point here? You have a mess of open codes, and you don’t have 100 distinct problems. Actually, many of them are repeats, but because you phrased them differently, and that you shouldn’t have tried to create your taxonomy of failures as you’re open coding. You just want to get down what’s wrong and then organize, “Okay, what’s the most common failure mode?”
So the purpose, axial code basically is just a failure mode. It’s the label or category. And what our goal is, is to get to this clusters of failure modes and figure out what is the most prevalent, so then you can go and run and attack that problem.
The Evals Debate
Lenny Rachitsky: That is really helpful. Basically, just synthesizing all these-
Shreya Shankar: Absolutely.
Lenny Rachitsky: … into categories and themes. Super cool. And we’ll include this prompt in our show notes for folks so they don’t have to sit there and screenshot it and try to type it up themselves.
Limits of Dogfooding
Hamel Husain: Yeah. Great idea. And so Claude went ahead and analyzed the CSV file and decided how to parse it, blah, blah, blah. We don’t need to worry about all that stuff, but it came up with a bunch of axial codes. Basically, axial codes are categories, like Shreya said. So one is, okay, capability limitations, misrepresentation, process and protocol violations, human handoff issues, communication, quality. It created these categories.
Now, do I like all the categories? Not really. I like some of them. It’s a good first stab at it. I would probably rename it a little bit because some of them are a bit too generic. Like what is capability limitation? That’s a little bit too broad. It’s not actionable. I want to get a little bit more actionable with it so that if I do decide it’s a problem, I know what to do with it, but we’ll discuss that in a little bit. So you can do this with anything, and this is the dumbest way to do it, but dumb sometimes is a good way to get started, so-
Evals vs. A/B Testing
Lenny Rachitsky: And this is what LLMS are really good at, taking a bunch of information and synthesizing it.
Shreya Shankar: Absolutely. Synthesizing for us to make sense of, right? Note that it’s not automatically proposing fixes or anything, that’s our job, but now, we can wade through this mess of open codes a lot easier.
Another thing that’s interesting here in this prompt to generate the axial codes is you can be very detailed if you want, right? You can say, “I want each axial code to actually be some actionable failure mode,” and maybe the LLM will understand that and propose it, or, “I want you to group these open codes by what stage of the user story that it’s in.” So this is where you can be creative or do what’s best for you as a product manager or engineer working on this, and that will help you do the improvement later.
OpenAI Acquires Statsig
Lenny Rachitsky: So there’s no definitive prompt of, “Here’s the one way to do it”?
Shreya Shankar: Absolutely.
Lenny Rachitsky: You’re saying you can iterate, see what works for you?
Shreya Shankar: Absolutely.
Lenny Rachitsky: It’s interesting the tools don’t do this, or do they try and they just don’t do a great job?
Shreya Shankar: No, I don’t think they do it. We’ve been screaming from the rooftops, “Please, please-”
Lenny Rachitsky: Oh, wow.
Shreya Shankar: ”… do this.” I do think it’s a little bit hard, right? Part of this whole experience with the eval scores Hamel and I are teaching are a lot of people don’t actually know this, so maybe it’s that people don’t know this and they don’t know how to build tools for it. And hopefully, we can demystify some of this magic.
Lenny Rachitsky: And just to double-click on this point, this is not a thing everyone does or knows. This is something you two developed based on your experience doing data analysis and data science at other companies?
Shreya Shankar: Well, I want to caveat that we didn’t invent error analysis. We don’t actually want to invent things. That’s bad signal. If somebody is coming to you with a way to do something that’s entirely new and not grounded in hundreds of years of theory and literature, then you should, I don’t know, be a little bit wary of that.
But what we tried to do was distill, “Okay, what are the new tools and techniques that you need to make sense of the LLM error-out analysis?” And then we created a curriculum or structured way of doing this. So this is all very tailored to LLMs, but the terms open coding, axial coding, are grounded in social science.
Lenny Rachitsky: Amazing. Okay. What’s funny about you guys doing this is I just want to go do this somewhere. I don’t have any AI product to do this on, but it’s just like, “Oh, this would be so fun.” Just sit there and find all the problems I’m running into and categorize them and then try to fix them.
Shreya Shankar: I love that.
Lenny Rachitsky: Hamel pulled up a video. What do you got going on here?
The Future of Eval Tools
Hamel Husain: Yeah. So I pulled up a video just to drive home Shreya’s point. We are not inventing anything, so what you see on the screen here is Andrew Ng, one of the famous machine learning researchers in the world who have taught a lot of people, frankly, machine learning. And you can see this is an eight-year-old video, and he’s talking about error analysis.
And so this is a technique that’s been used to analyze stochastic systems for ages, and it’s something that it was just using the same machine learning ideas and principles, just bringing them into here, because again, these are stochastic systems.
Market Demand for Evals
Lenny Rachitsky: Awesome. Well, one thing, we’re working on getting Andrew on the podcast, we’re chatting, so that will-
Shreya Shankar: Nice.
Lenny Rachitsky: … be really fun. Two, I love that my podcast episode just came out today is in your feed there, and it’s standing out really well in that feed, so I’m really happy about that [inaudible 00:39:13].
Common Evals Misconceptions
Hamel Husain: Very nice. Yeah. The recommendation algorithm is quite good.
Practical Evals Advice
Lenny Rachitsky: Yes. Here we go. Hope you click on that. Don’t screw my algorithm. Okay, cool. So we’ve done some synthesis. I know we’re not going to go through the entire step. This is you have a whole course that takes many days to learn this whole process. What else do you want to share about how to go about this process?
Building Data Viewing Tools
Hamel Husain: Okay. So you can do this through anything, and the same thing works just fine in ChatGPT, the same exact prompt. You can see it made axial codes. I really like using Julius AI. It’s one of my favorite tools.
Julius is kind of this third-party tool that uses notebooks. I personally like Jupiter notebooks a lot, and so it’s more of a data science thing, but a lot of product managers that are kind of learning notebooks nowadays, and it’s kind of cool. It’s like a fun playground where you can write code and look at data. But we don’t have to go deeply into that. Just wanted to mention, you can use a lot. AI is really good at this.
So let’s go to the fun part. Here we go. So now we have these axial codes. So the first thing I like to do, I have these open codes, and I have the axial codes, let’s say, that we assigned from the cloud project or the ChatGPT. And so what I do is I collect them first and I take a look, like, “Does these axial codes make sense?” And I look at the correspondence between the different axial codes and the open codes, and I go through an exercise and I say, “Hmm. Do I like these codes? Can I make them better? Can I refine them? Can I make them more specific?” Instead of being generic, I make them very specific and actionable.
So you see the ones that I came up with here are tour scheduling, rescheduling issues, human handoff or transfer issue, formatting error with an output, conversational flow. We saw the conversational flow issue with the text messages. Making follow-up promises not kept.
And so basically, what I can do, what you can do now is you have these axial codes, and so I just collect them into a list, so this is an Excel formula. Just collect these codes into a list, and now we have a comma-separated list of these codes. And then what you can simply do is you could take your notes that you have, those open codes, and you can tell an AI, and this is using Gemini and AI just for simplicity, this is, again, we’re trying to keep it simple, categorize the following note into one of the following categories as always.
Lenny Rachitsky: For folks watching, I like all these different prompts and formulas you’re sharing. This is the Google Sheets AI prompt.
Time Investment & Consistency
Shreya Shankar: Huge fan.
The Joy in the Process
Hamel Husain: And so basically, what you could do is you can categorize your traces into one of the buckets, and that’s what we have here. We have categorized all those problems that we encountered into one of these things.
Shreya Shankar: And this is automatic, which is very exciting. I mean, the AI is doing it. So this also drives home the point that your open codes have to be detailed, right? You can’t just say janky because if the AI is reading janky, it’s not going to be able to categorize it. Even a human wouldn’t, right? It would have to go and remember why you said janky, so it’s important to be somewhat detailed in your open code.
Courses and Deep Dives
Lenny Rachitsky: Okay. So avoid the word janky. It’s a good rule of thumb.
Shreya Shankar: Yeah. Or have it with 10 other words.
Quick Lightning Q&A
Lenny Rachitsky: Oh, okay. What is-
Hamel Husain: Yeah. I was being funny.
Lenny Rachitsky: Yeah, okay. What are some of those other words that people often use that you think are not good?
Shreya Shankar: I don’t think it’s specific words. I think it’s just people are not detailed enough in the open code, so it’s hard to do the categorization.
Lenny Rachitsky: Great. And by the way, the reason you have to map them back is because, say, Claude or ChatGPT gave you suggestions and you change them and iterated on them, so you can’t just go back and say, “Cool, whatever,” in each bucket?
Hamel Husain: Yeah, yeah.
Lenny Rachitsky: Great.
Hamel Husain: That’s a really good question, actually. It’s good to iterate and think about it a little bit like, “Do I like these open codes? Do these actually make sense to me?” Just like anything that AI does, it’s really good to kind of put yourself in the middle just a little bit.
Lenny Rachitsky: It’s in the loop. Still space for us. Great.
Shreya Shankar: One of the things that I like to do with this step if I’m trying to use AI to do this labeling, is also have a new category called none of the above. So an AI can actually say, “None of the above,” in the axial code, and that informs me, “Okay, my axial codes are not complete. Let’s go look at those open codes, let’s figure out what some new categories are or figure out how to reword my other axial codes.”
Lenny Rachitsky: Awesome. And what’s cool about this is you don’t need to do this many, many times.
Shreya Shankar: No.
Lenny Rachitsky: For most products, you do this process once, and then you build on it, I imagine, and you just tweak it over time?
Shreya Shankar: Absolutely. And it gets so fast. People do this once a week, and you can do all of this in 30 minutes, and suddenly your product is so much better than if you were never aware of any of these problems.
Lenny Rachitsky: Yeah. It’s absurd to feel like you wouldn’t know this is happening. Watching this happening, I’m like, “How could you not do this to your product?”
Shreya Shankar: A lot of people have no idea.
Lenny Rachitsky: Most people. Yeah. We’ll talk about that. There’s a whole debate around this stuff that we want to talk about. Okay, cool. So you have the sheet. What comes next?
Hamel Husain: Okay. So here’s sort of the big unveil. This is the magic moment right now. So we have all these codes that we applied, the ones that we like on our traces. Now, you can do the ta-da, you can count them.
So here’s a pivot table, and we just can do pivot table on those, and we can count how many times those different things occurred. So what do we find? Find on these traces that we categorized? We found 17 conversational flow issues. And I really like pivot tables because you can do cool things. You can double-click on these. You can say, “Oh, okay. Let me take a look at those,” but that’s going into an aside about pivot tables, how cool they are.
But now, we have just a nice, rough cut of what are our problems? And now, we have gone from chaos to some kind of thinking around, “Oh, you know what? These are my biggest problems. I need to fix conversational issues, maybe these human handoff issues.” It’s not necessarily the count is the most important thing. It might be something that’s just really bad and you want to fix that, but okay. Now, you have some way of looking at your problem, and now you can think about whether you need evals for some of these.
So there might be some of these things that might be just dumb engineering errors that you don’t need to write an eval for because it’s very obvious on how to fix them. Maybe the formatting error with output, maybe you just forgot to tell the LLM how you want it to be formatted, and you didn’t even say that in the prompt. So just go ahead and fix the prompt maybe, and we can decide, “Okay, do you want to write an eval for that?” You might still want to write an eval for that because you might be able to test that with just code. You could just test the string, does it have the right formatting potentially? Without running an LLM.
So there’s a cost-benefit trade-off to evals. You don’t want to get carried away with it, but you want to usually ground yourself in your actual errors. You don’t want to skip this step. And so the reason I’m kind of spending so much time on this is this is where people get lost. They go straight into evals like, “Let me just write some tests,” and that is where things go off the rails.
Okay. So let’s say we want to tackle one of these things. So for example, let’s say we want to tackle this human handoff issue, and we’re like, “Hmm, I’m not really sure how to fix this. That’s a kind of subjective sort of judgment call on should we be handing off to a human? And I don’t know immediately how to fix it. It’s not super obvious per se. Yeah. I can change my prompt, but I’m not sure. I’m not 100% sure.”
Well, that might be sort of an interesting thing for an LLM as a judge, for example. So there’s different kinds of evals. One is code-based, which you should try to do if you can because they’re cheaper. LLM as a judge is something, it’s like a meta eval. You have to eval that eval to make sure the LLM that’s judging is doing the right thing, which we’ll talk about in a second.
So, okay. LLM as a judge, that’s one thing. Okay. How do you build an LLM as a judge?
Lenny Rachitsky: Before we get into that actually, just to make sure people know exactly what you’re describing there, these two types of evals. One is you said it’s code-based and one is LLM as judge. Maybe Shreya, just help us understand what code-based eval even is? It’s essentially a unit test? Is that a simple way to think about it?
Shreya Shankar: Yeah. Maybe eval is not the right term here, but think automated evaluator. So when we find these failure modes, one of the things we want is, “Okay. Can we now go check the prevalence of that failure mode in an automated way without me manually labeling and doing all the coding and the grouping, and I want to run it on thousands and thousands of traces, I want to run it every week.” That is, okay. You should probably build an automated evaluator to check for that failure mode.
Now, when we’re saying code-based versus LLM-based, we’re saying, “Okay. So maybe I could write a Python function or a piece of code to check whether that failure mode is present in a trace or not.” And that’s possible to do for certain things like checking the output is JSON, or checking that it’s markdown, or checking that it’s short. These are all things you can capture in code or you could approximately capture in code.
When we’re talking about LLM judge here, we’re saying that this is a complex failure mode and we don’t know how to evaluate in an automated way. So maybe we will try to use an LLM to evaluate this very, very narrow, specific failure mode of handoffs.
Lenny Rachitsky: So just to try to mirror back what you’re describing, you want to test what your, say, agent or AI product is doing. You ask it a question, it gets back with something.
One way to test if it’s giving you the right answer is if it’s consistently doing the same thing, that you could write a code to tell you this is true or false. For example, will it ever say there’s a virtual tour? So you could ask it.
Shreya Shankar: Yes.
Lenny Rachitsky: “Do you provide virtual tours?” It says yes or no, and then you could write code to tell you if it’s correct based on that specific answer.
But if you’re asking about something more complicated and it’s not binary, in one world, you need a human to tell you this is correct. The solution to avoid humans having to review all this every time automatically is LLMs replacing human judgment, and you’d call it an LLM as judge. The LLM as being the judge if this is correct or not.
Shreya Shankar: Absolutely. You nailed it.
Lenny Rachitsky: Great.
Shreya Shankar: So people always think, “Oh, this is at least as hard as my problem of creating the original agent.” And it’s not, because you’re asking the judge to do one thing, evaluate one failure mode, so the scope of the problem is very small and the output of this LLM judge is pass or fail. So it is a very, very tightly scoped thing that LLM judges are very capable of doing very reliably.
Lenny Rachitsky: And the goal here is just to have a suite of tests that run before you ship to production that tell you things are going the way you want them to? The way your agent is interacting is correct?
Shreya Shankar: The beautiful thing about LLM judges, you can use them in unit tests or CI, sure, but you could also use it online for monitoring, right? I can sample 1000 traces every day, run my LLM judge, real production traces, and see what the failure rate is there. This is not a unit test, but still now we get an extremely specific measure of application quality.
Lenny Rachitsky: Cool. That’s a really great point because a lot of people just see evals for being this not-real-life thing. It’s a thing that you test before it’s actually in the real world. And what’s actually happening in the real world, you’re saying you should actually do exactly that?
Shreya Shankar: Yeah.
Lenny Rachitsky: Test your real thing running in production? And it’s a daily, hourly sort of thing you could be running?
Shreya Shankar: Totally.
Lenny Rachitsky: Awesome. Okay. Hamel’s got an example of an actual LLM as a judge eval here, so let’s take a look.
Hamel Husain: I love how Shreya really teed it up for me, so thank you so much. So what we have is a LLM as a judge prompt for this one specific failure. Like Shreya said, you would want to do one specific failure and you want to make it binary because we want to simplify things. We don’t want, “Hey, score this on a rating of one to five. How good is it?” That’s just in most cases, that’s a weasel way of not making a decision. Like, “No, you need to make a decision. Is this good enough or not? Yes or no?”
It can be painful to think about what that is, but you should absolutely do it. Otherwise, this thing becomes very untractable, and then when you report these metrics, no one knows what 3.2 versus 3.7 means, so.
Shreya Shankar: Yeah. We see this all the time also, and even with expert-curated content on the internet where it’s like, “Oh, here’s your LLM judge evaluator prompt. Here’s a one-to-seven scale.”
And I always text Hamel like, “Oh, no. Now, we have to fight the misinformation again because we know somebody is going to try it out and then come back to us and say, ‘Oh, I have 4.2 average,’” and we’re going to be like, “Okay.”
Lenny Rachitsky: It’s wild how much drama there is in the evals space. We’re going to get to that. Oh, man.
Meticulously designed to be an intuitive and simple experience, and Mercury brings all the ways that you use money into a single product, including credit cards, invoicing, bill pay, reimbursements for your teammates and capital. Whether you’re a funded tech startup looking for ways to pay contractors and earn yield on your idle cash, or an agency that needs to invoice customers and keep them current, or an e-commerce brand that needs to stay on top of cash flow and access capital, Mercury can be tailored to help your business perform at its highest level. See what over 200,000 entrepreneurs love about Mercury. Visit mercury.com to apply online in 10 minutes. Mercury is a fintech, not a bank. Banking services provided through Mercury’s FDIC insured partner banks. For more details, check out the show notes.
Hamel Husain: Okay, so this is your judge prompt. There’s no one way to do it. It’s okay to use an LLM to help you create it, but again, put yourself in the loop. Don’t just blindly accept what the LLM does, and in all of these cases, that’s what we did. With the axial codes, we iterated on this. You can use an LLM to help you create this prompt, but make sure you read it, make sure you edit it, whatever. This is not necessarily the perfect prompt. This is just the stupid, keeping it very simple just to show you the idea. It’s like, “Okay, for this handoff failure,” I said, “Okay, I want you to output true or false,” it’s a binary judge. That’s what we recommend. Then I just go through and say, “Okay, when should you be doing a handoff?” And I just list them out.
Okay, explicit human requests ignored or looped, some policy-mandated transfer, sensitive resident issues, tool data, unavailability, same day walk-in or tour requests. You need to talk to a human for that, so on and so forth. The idea is, now that I know that this is a failure from my data, I’m interested in iterating on it, because I know this is actually happening all the time. Like Shreya said, it would be nice to have a way not only to evaluate this on the data I have, but also on production data, just to get a sense of, what scales is this happening? Let me find more traces, let me have a way to iterate on this. We can take this prompt and I’m going to use the spreadsheet again. The first step is, okay, when I’m doing this judge… I wrote the prompt.
Now, a lot of people stop there and they say, “Okay, I have my judge prompt. We’re done. Good, let’s just ship it,” and the prompt says… If the judge says it’s wrong, it’s wrong. They just accept it as the gospel, be like, “Okay, the LLM says it’s wrong, it must be wrong. Don’t do that, because that’s the fastest way that you can have evals that don’t match what’s going on, and when people lose trust in your evals, they lose trust in you. It’s really important that you don’t do that, so before you release your LLM as a judge, you want to make sure it’s aligned to the human. How do you do that? You have those axial codes and you want to measure your judge against the axial code, and say like, “Hey, does it agree with me? My own judge, does it agree with me?” Just measure it.
What we have here is, okay, I say, “Assess this LLM trace.” Again, I’m using just spreadsheets here, “Assess this LM trace according to these rules,” and the rules are just the prompt that I just showed you. I ask it, “Okay, is there a handoff error, true or false?” Then this column, let me just zoom in a bit. Column H, I have, “Okay, did this error occur?” Column G is whether I thought the error occurred or not. You can see-
Lenny Rachitsky: You’re going through manually, you do that.
Hamel Husain: Yeah, yeah, which we already did. We already went through it manually. It’s not like we have to do it again, because we have that cheat code from the axial coding, we already did it. You might have to go through it again if you need more data, and there’s a lot of details to this on how to do this correctly. You want to split your data and do all these things, so that you’re not cheating, but I just want to show you the concept. Basically, what you can do is measure the agreement. Now, one thing you should know, as a product manager, is a lot of people go straight to this agreement. They say, “Okay, my judge agrees with the human some percentage of the time.”
Now that sounds appealing, but it’s a very dangerous metric to use, because a lot of times, errors, they only happen on the long tail and they don’t happen as frequently, so if you only have the error 10% of the time, then you can easily have 90% agreement by just having a judge say it passes all the time. Does that make sense? 90% agreement look good on paper, but it might be misleading.
Lenny Rachitsky: It’s rare, it’s a rare error. Yeah.
Hamel Husain: As a product manager or someone, even if you’re not doing this calculation yourself, if someone ever reports to you agreement, you should immediately ask, “Okay, tell me more.” You need to look into it. They give you more intuition, here is like a matrix of this specific judge in the Google sheet, and this is, again, a pivot table, just keeping it dumb and simple. “Okay, on the rows I have, what did the human think? What did I think? Did it have an error, true or false? Then did my judge have an error, true or false?”
Shreya Shankar: The intuition here is exactly what Hamel said, where you need to look at each type of error. When the human said false, but the judge said true, or vice versa, so those non-green diagonals here, and if they’re too large, then go iterate on your prompt, make it more clear to the LLM judge, so that you can reduce that misalignment. You want to get to a point where most… You’re going to have some misalignment, that’s okay. We talk about in our course, also how to code correct that misalignment, but in this stage, if you’re a product manager and the person who’s building the LLM judge eval has not done this, they’re saying like, “It agrees 75% of the time, we’re good.” They don’t have this matrix and they haven’t iterated to make sure that these two types of errors have gone down to zero, then it’s a bad smell. Go and ask them to go fix that.
Lenny Rachitsky: Awesome. That’s a really good tip, what to look for when someone’s doing this wrong.
Shreya Shankar: Yeah.
Lenny Rachitsky: Actually, can you take us back to the LLM as judge prompt? I just want to highlight something really interesting here. I’ve had some guests on the podcast recently who’ve been saying, “Evals are the new PRDs,” and if you look at this, this is exactly what this is. Product managers, product teams, here’s what the product should be, here’s all the requirements, here’s the how it should work. They built a thing and then they test it. Manually, often. What’s cool about this is this is exactly that same thing, and it’s running constantly. It’s telling you, “Here’s how this agent should respond,” and it’s very specific ways. “If it’s this, this, this, do that. If it’s this, this, that, do that.” It’s exactly what I’ve been hearing again and again, you could see right here. This is the purest sense of what a product requirements document should be, is this eval judge that’s telling you exactly what it should be, and it’s automatic and running constantly.
Shreya Shankar: Yeah, absolutely. It’s derived from our own data, so of course, it’s a product manager’s expectations. What I find that a lot of people miss is they just put in what their expectations are before looking at their data, but as we look at our data, we uncover more expectations that we couldn’t have dreamed up in the first place, and that ends up going into this prompt.
Lenny Rachitsky: That is interesting. Your advice is not skip straight to evals and LLM as judge prompts before you build the product, still write traditional one-pagers PRDs to tell your team what we’re doing, why we’re doing it, what success looks like. But then at the end, you could probably pull from that and even improve that original PRD if you’re evolving the product using this process.
Shreya Shankar: I would go even further to say you’re going to improve… It’s going to change. You’re never going to know what the failure modes are going to be upfront, and you’re always going to uncover new vibes that you think that your product should have. You don’t really know what you want until you see it with these LLMs, so you got to be flexible, have to look at your data, have to… PRDs are a great abstraction for thinking about this. It’s not the end all, be all. It’s going to change.
Lenny Rachitsky: I love that, and Hamel’s pulling up some cool research report. What’s this about?
Hamel Husain: This is one of the coolest research reports you can possibly read if you want to know about evals. It was authored by someone named Shreya Shankar.
Shreya Shankar: Oh, my God.
Hamel Husain: And her collaborators. It’s called “Who Validates the Validated?”
Lenny Rachitsky: That’s the best name for a researcher.
Shreya Shankar: Thank you, thank you.
Hamel Husain: I should let Shreya talk about this. I think one of the most important things to pay attention in this paper are the criteria drift, and what she found.
Shreya Shankar: We did this super fun study when we were doing user studies with people who were trying to write LLM judges or just validate their own LLM outputs. I think this was before evals was extremely popular, I feel like, on the internet. We did this project late 2023 was when we started it. But then the thing that really was burning in my mind as a researcher is like, “Why is this problem so hard? We’ve been having machine learning and AI for so long, it’s not new, but suddenly, this time around, everything is really difficult.” We just did this user study with a bunch of developers and we realized, “Okay, what’s new here is that you can’t figure out your rubrics upfront. People’s opinions of good and bad change as they review more outputs, they think of failure modes only after seeing 10 outputs they would never have dreamed of in the first place,” and these are experts. These are people who have built many LLM pipelines and now agents before, and you can’t ever dream up everything in the first place. I think that’s so key in today’s world of AI development.
Lenny Rachitsky: That is a really good point. That’s very much reinforcing what we were just talking about and that’s why I’ll pull this up, is just… Okay-
Shreya Shankar: The research behind it.
Lenny Rachitsky: Yeah, okay, great. You still got to do product the same way, but now you have this really powerful tool that helps you make sure what you’ve built is correct. It’s not going to replace the PRD process. Cool. How many, say, I don’t know, LLM as judge prompts, do you end up with usually say… I don’t know. I know, obviously, depends complexity to the product, but what’s a number in your experience?
Shreya Shankar: For me, between four and seven.
Lenny Rachitsky: That’s it.
Shreya Shankar: It’s not that many, because a lot of the failure modes, as Hamel said earlier, can be fixed by just fixing your prompt. You just didn’t think to put it in your prompts, so now you put it in your… You shouldn’t do an eval like this for everything, just the pesky ones that you’ve described your ideal behavior in your agent prompt, but it’s still failing.
Lenny Rachitsky: Got it. Say you found a problem, you fixed it. In traditional software development, you’d write a unit test to make sure it doesn’t happen again. Is your insight here is, “Don’t even bother writing an eval around that if it’s just gone”?
Shreya Shankar: I think you can if you want to, but the whole game here is about prioritizing. You have finite resources and finite time, you can’t write an eval for everything, so prioritize the ones that are the more pesky areas.
Lenny Rachitsky: Probably the ones that are most risky to your business if they say something like Mecha Hitler, Grok.
Shreya Shankar: Yikes.
Lenny Rachitsky: Cool. Okay, so that’s very relieving, because this prompt was a lot of work to really think through all these details.
Shreya Shankar: But it’s a lot of one-time cost. Right now, forever, you can run this on your application.
Hamel Husain: Okay, data analysis is super powerful, is going to drive lots of improvements very quickly to your application. We showed the most basic kind of data analysis, which is counting, which is accessible to everyone. You can get more sophisticated with the data analysis. There’s lots of different ways to sample, look at data. We made it look easy in a sense, but there’s a lot of skills here to do to it well. Building an intuition and a nose for how to sort through this data. For example, let’s say I find conversational issues, this conversational flow issues. Maybe if I was trying to chase down this problem further, I would think about ways to find other conversational flow issues that I didn’t code. I would maybe dig through the data in several ways, and there’s different ways to go about this. It’s very similar, if not almost exactly similar as traditional analytics techniques that you would do on any product.
Lenny Rachitsky: Give us just a quick sense of what comes next and then let’s talk about the debate around evals and a couple more things.
Shreya Shankar: What comes next after you’ve built your LLM judge? Well, we find that people just try to use that everywhere they can, so they’ll put the LLM judge in unit tests and they will build, “Here are some example traces where we saw that failure, because we labeled it. Now we’re going to make those part of unit tests and make sure that, every time we push a change to our code, these tests are going to pass.” They also use it for online monitoring. People are making dashboards on this, and I think that’s incredible. I think the products that are doing this, they have a very sharp sense of how well their application is performing, and people don’t talk about it, because this is their moat. People are not going to go and share all of these things, because it makes sense. If you are an email-writing assistant, and you’re doing this and you’re doing it well, you don’t want somebody else to go and build an email-writing assistant and then get you out of business.
I really want to stress the point that it’s try to use these artifacts that you’re building wherever possible online, repeatedly use them to drive improvements to your product. Oftentimes, Hamel and I will tell people how to do this up to this very point, and it clicks for people and then they never come back again. Either they have, I don’t know, quit their jobs, they’re not doing AI development anymore, or they know what to do from here on out. I think it’s the latter, but I think it’s very powerful.
Lenny Rachitsky: Just watching you do this really opened my eyes to what this is and how systematic the process is. I always imagine you just sit on a computer, “Okay, what are the things I need to make sure work correctly?” What you’re showing us here is it’s a very simple step-by-step based on real things that are happening in your product, how to catch them, identify them, prioritize them, and then catch them if they happen again and fix them.
Shreya Shankar: Yeah, it’s not magic. Anyone can do this, you’re going to have to practice the skill, like any new skill, you have to practice, but you can do it. I think what’s very empowering now is that product managers are doing this and can do this, and can really build very, very profitable products with this skill set.
Lenny Rachitsky: Okay, great segue to a debate that we got pulled into that was happening on X the other day. I did not realize how much controversy and drama there is around evals. There’s a lot of people with very strong opinions. How about Shreya? Give us just a sense of the two sides of the debate around the importance and value of evals, and then give us your perspective.
Shreya Shankar: Yeah. All right, I’ll be a little bit placating and I say I think everyone is on the same side. I think the misconception is that people have very rigid definitions of what evals is. For example, they might think that evals is just unit tests or they might think that evals is just the data analysis part and no online monitoring or no monitoring of product-specific metrics, like actually number of chats engaged in or whatnot. I think everyone has a different mindset of evals going in, and the other thing I will say is that people have been burned by evals in the past. I think people have done evals badly. One concrete example of this is they’ve tried to do an LLM judge, but it has not aligned with their expectations. They only uncovered this later on and then they didn’t trust it anymore, and then they’re like, “I’m anti evals.”
I 100% empathize with that, because you should be anti Likert scale LLM judge. I absolutely agree with you, we are anti that as well. A lot of the misconception stems from two things, like people having a narrow definition of evals and then people not doing it well and then getting burned and then wanting to avoid other people making that mistake. Then, unfortunately, X or Twitter is a medium where people are misinterpreting what everybody is saying all the time, and you just get all these strong opinions of, “Don’t do evals, it’s bad. We tried it, it doesn’t work. We’re Claude Code,” or whatever other famous product, “And we don’t do evals.” There’s just so much nuance behind all of it, because a lot of these applications are standing on the shoulders of evals. Coding agents is a great example of that, Claude Code. They’re standing on the shoulders of Claude base model… Not base, but the fine-tuned Claude models have been evaluated on many coding benchmarks. Can’t argue against that.
Lenny Rachitsky: Just to make clear exactly what you’re talking about there, one of the heads, I think maybe the head engineer of Claude Code, went on a podcast and he’s like, “We don’t do evals, we just vibe. We just look at vibes,” and vibes meaning they just use it and feel if it’s right or wrong.
Shreya Shankar: I think that works. There’s two things to that, right? One is they’re standing on the shoulders of the evals that their colleagues are doing for coding.
Lenny Rachitsky: Of the Claude foundational model.
Shreya Shankar: Absolutely, right? We know that they report those numbers, because we see the benchmarks, we know who’s doing well on those. The other thing is they are actually probably very systematic about the error analysis to some extent. I bet you that they’re monitoring who is using Claude, how many people are using Claude, how many traps are being created, how long these chats are. They’re also probably monitoring in their internal team, they’re dogfooding. Anytime something is off, they maybe have a cue or they send it to the person developing Claude Code, and this person is implicitly doing some form of hair error analysis that Hamel talked about. All of this is evals, right? There’s no world in which they’re just being like, “I made Claude Code, I’m never looking at anything,” and unfortunately, when you don’t think about that or talk about that, I think that the community…
Most of the community is beginners or people who don’t know about evals and want to learn about it, and it sends the wrong message there. Now, I don’t know what Claude Code is doing, obviously, but I would be willing to bet money that they’re doing something in the form of evals.
Hamel Husain: We’ll also say that coding agents are fundamentally very different than other AI products, because the developer is the domain expert, so you can short circuit a lot of things, and also, the developer is using it all day long, so there’s a type of dogfooding and type of domain expertise that is… You can collapse the activities, you don’t need as much data, you don’t need as much feedback or exploration, because you know, so your eval process should look different.
Lenny Rachitsky: Because you’re seeing the code, you see the code it’s generating. You can tell, “This is great, this is terrible.”
Hamel Husain: Yeah, yeah. I think a lot of people had generalized coding agents, because coding agents are the first AI product released into the wild, and I think it’s a mistake to try to generalize that at large.
Shreya Shankar: The other thing is, yeah, engineers have a dogfooding personality. There are plenty of applications where people are trying to build AI in certain domains and they don’t have dogfooding for doctors, for example, or not out there trying to get all the most incorrect advice from AI and be tolerant and receptive to that. It’s very important to keep, I think these nuanced things in mind.
Lenny Rachitsky: What I’m hearing from you, Shreya, interestingly, is that if humans on the team are doing very close data analysis, error analysis, dogfooding like crazy, and essentially, they’re the human evals and you’re describing that as that’s within the umbrella of evals. You could do it that way if you have time and motivation to do that, or you could set these things up to be automatic.
Shreya Shankar: Absolutely, it’s also about the skills. People who work at Anthropic are very, very highly skilled. They’ve been trained in data analysis or software engineering or AI, and whatnot. You can get there, anyone can get there, of course, by learning the concepts, but most people don’t have that skill right now.
Hamel Husain: Dogfooding is a dangerous one, only because a lot of people will say they’re dogfooding. They’re like, “Yeah, we dogfooded,” but are they, really? A lot of people aren’t really dogfooding it at that visceral level that you would need to close that feedback loop. That’s the only caveat I would add.
Lenny Rachitsky: There’s also this, feels like, straw man argument of evals versus A-B tests. Talk about your thoughts there, because that feels like a big part of this debate. People are having like, “Do you need evals if you have A-B tests that are testing production level metrics?”
Shreya Shankar: A-B tests are, again, another form of evals ,I imagine, right? When you’re doing an A-B test, you have two different experimental conditions and then you have a metric that quantifies the success of something, and you’re comparing the metric. Again, an eval in our mind is systematic measurement of quality, some metric. You can’t really do an A-B test without the eval to compare, so maybe we just have a different weird take on it.
Lenny Rachitsky: Yeah, okay. What I’m hearing is you consider A-B tests as part of the suite of evals that you do. I think when people think A-B tests, it’s like we’re changing something in the product, we’re going to see if this improves some metric we care about. Is that enough? Why do we need to test every little feature? If it’s impacting a metric we care about as a business, we have a bunch of A-B tests that are just constantly running.
Shreya Shankar: This is now a great point. I think a lot of people prematurely do A-B tests, because they’ve never done any error analysis in the first place. They just have hypothetically come up with their product requirements and they believe that, “We should test these things,” but it turns out, when you get into the data, as Hamel showed, that the errors that you’re seeing are not what you thought what the errors might be. They were these weird handoff issues or, I don’t know, the text message thing was strange. I would say that, if you’re going to do A-B tests and they’re powered by actual error analysis as we’ve shown today, then that’s great, go do it. But if you’re just going to do them, which we find that people try to do, just want to do them based on what you hypothetically think is what is important, then I would encourage people to go and rethink that and ground your hypotheses.
Lenny Rachitsky: Do you have thoughts on what Statsig is going to do at OpenAI? Is there anything there that’s interesting? That was a big deal, a huge acquisition. A- B test company people are like, “A-B test, the future.” Thoughts?
Hamel Husain: Just to add to the previous question a little bit, why is there this debate, A-B testing versus evals? I think, fundamentally, evals is… People are trying to wrap their head around how to improve their applications and fundamentally need to do… Data science is useful in products. Looking at data, doing data analytics. There’s many different suite of tools, and you don’t need to invent anything new. Sure, you don’t need necessarily the whole breadth of data science, and it looks slightly different, just slightly, with LLMs. Your tactics might be different, so really what it is is using analytic tools to understand your product. Now, people say the word “Evals,” trying to carve out this new thing, and saying evals and then A-B testing, but if you zoom out, it’s the same data science as before, and I think that’s what’s causing the confusion is, “Hey, we need data science thinking,” and AI product is helpful to have that thinking in AI products like it is in any product is my take on that.
Lenny Rachitsky: That’s a really good take, I think just the word “Evals” triggers people now.
Shreya Shankar: Yeah.
Lenny Rachitsky: If you just call it, “We’re just doing error analysis, doing data science to understand where our product breaks and just setting up tests to make sure we know-”
Shreya Shankar: That’s boring, sounds boring. No, no, no. We need a mysterious term, like “Evals,” to really get the momentum going. Your question about Statsig, I think it’s very exciting. To be honest, I don’t know much about it, because I just imagine that they’re this company that… There’s a tool that many people use, and maybe it just so happened that OpenAI acquired them. I’m sure they’ve been using them in the past, I’m sure OpenAI’s competitors are using Statsig as well, so maybe there is something strategic in that acquisition. I have no idea, I don’t know anything there, but I think those are really the bigger questions for me than, “Is this fundamentally changing A-B testing or making evals more of a priority?” I think they’ve always been a priority, I think OpenAI has always been doing some form of them, and OpenAI has gone so far, historically speaking, as to go and look at all the Twitter sentiment and try to do some retrospective on that, and then tie that back to their products. Certainly, they’re doing-
Then, tie that back to their products. Certainly, they’re doing some amount of evals before they ship their new foundation models, but they’re going so much beyond and being like, “Okay, let’s find all the tweets that are complaining about it, all the Reddit threads that are complaining about it, and go try to figure out what’s going on.” It goes to show that evals are very, very important. No one has really figured it out yet. People are using all the available sources signal that they can to improve their products.
Hamel Husain: What I’ll say is I’m really hopeful that it might shift or create a focus within OpenAI, hopefully. Up until now, a lot of the big labs understandably focused on general benchmarks like MMLU score, human eval, things like that, which are very important for foundation models. Those not very related to product specific evals, like the ones we talked about today, but handoff and stuff like that, they tend not to correlate.
Shreya Shankar: Yeah, they don’t correlate with math problem-solving, sorry to say.
Hamel Husain: Exactly. If you look at the eval products, let’s say the ones up until recently that some of the big labs have, they don’t have error analysis. They have a suite of generic tools, cosine similarity, hallucination score, whatever, and that doesn’t work. It’s a good first stab at it. It’s okay. At least you’re doing something, getting people, maybe it’s like getting people look at data. But eventually, what we hope to see is, okay, a bit more data science thinking in this eval process. That’s hopefully the tools we’ll get to.
Shreya Shankar: Yeah, Pamela and I should not be the only two people on the planet that are promoting a structured way of thinking about application specific evals. It’s mind-boggling to me. Why are we the only two people doing this the whole world? What’s wrong? I hope that we’re not the only people and that more people catch on.
Lenny Rachitsky: The fact that your course on Maven is the number one highest grossing course in Maven, clearly there’s demand and interest, and there’s more people I think on your side. Interestingly, just as an example you’ve been sharing on Twitter that I think is informative, everyone’s been saying how cloud code doesn’t care about evals. They’re all about vibes, and everyone’s like, and they’re the best coding agent out there, so clearly, this is right. More recently, there’s all this talk about Codex, OpenAI Codex being better and everyone’s switching and they’re so pro evals.
Shreya Shankar: I know.
Lenny Rachitsky: Yeah.
Shreya Shankar: It gets me every time. The Internet’s so inconsistent. My favorite thing was yesterday, I believe, a couple of lab mates and I were out getting dessert or something, and somebody said like, “Oh, do you like Codex or Claude better or whatever?” The other person said, “Oh, I like Claude.” Then, someone else said, “But the new version of Codex is better.” Then, the first person said, “Oh, but the last I checked was two days ago, so maybe my thoughts, maybe I’m not up-to-date.” I was like, “Oh, my God.”
Lenny Rachitsky: So true, so true. This is the world we live in. Oh, my God. Okay. I want to ask about just top misconceptions people have with evals and top tips and tricks for being successful. Maybe just share one or two each of each. Let me just start with misconceptions, and maybe I’ll go to the Hamel first. Just what are a couple of the most common misconceptions people have with eval still?
Hamel Husain: The top one is, “Hey, I can just buy a tool, plug it in, and it’ll do the eval for you. Why do I have to worry about this? We live in the age of AI. Can’t the AI just eval it?” That’s the most common misconception, and people want that so much that people do sell it, but it doesn’t work. That’s the first one.
Lenny Rachitsky: Shoot, many humans are still great. I think that’s great news.
Hamel Husain: The second one that I see a lot is, “Hey, just not looking at the data.” In my consulting, people come to me with problems all the time, and the first thing I’ll say is, “Let’s go look at your traces.” You can see their eyes pop open and be like, “What do you mean?” I’m like, “Yeah, let’s look at it right now.” They’re surprised that I am going to go look at individual traces, and it always 100% of the time learn a lot and figure out what the problem is. I think people just don’t know how powerful looking at the data is like we showed on this podcast.
Shreya Shankar: I would agree with that.
Lenny Rachitsky: Those are the top two? Okay.
Shreya Shankar: Yes.
Lenny Rachitsky: Is there anything else or those are the ones solve those problems.
Shreya Shankar: Oh, those are definitely… Then, I guess the third one I would add is, there’s no one correct way to do evals. There are many incorrect ways of doing evals, but there are also many correct ways of doing it. You got to think about where you are at with your product, how much resources you have, and figure out the plan that works best for you. It’ll always involve some form of error analysis as we showed today, but how you operationalize those metrics is going to change based on where you’re at.
Lenny Rachitsky: Amazing. Okay. What are a couple of just tips and tricks you want to leave people with as they start on their eval journey or just try to get better at something they’re already doing?
Shreya Shankar: Tip number one is just don’t be alarmed or don’t be scared of looking at your data. The process, we try to make it as structured as possible. There are inevitably questions that are going to come up. That’s totally fine. You might feel like you’re not doing it perfectly. That’s also fine. The goal is not to do evals perfectly, it’s to actionably improve your product. We guarantee you, no matter what you do, if you’re doing parts of these process, you’re going to find ways of actionable improvement, and then you’re going to iterate on your own process from there.
The other tip that I would say is, we are very pro-AI. Use LLMs to help you organize any thoughts that you have throughout this entire process. This could be everything ranging from initial product requirements. Figure out how to organize them for yourself. Figure out how to improve on that product requirements doc based on the open codes that you’ve created. Don’t be afraid to use AI in ways that present information better for you.
Lenny Rachitsky: Sweet, so don’t be scared. Use LLMs as much as you can throughout the process.
Shreya Shankar: But not to replace yourself.
Lenny Rachitsky: Right. Okay, great. There’s still jobs. It’s great. Hamel.
Hamel Husain: Yeah. Let me actually share my screen, because I want to show something. To piggyback of what Shreya said is, if you heard any phrase in this podcast, you’ve probably heard look at your data more than anything else. It’s so important that we teach that you should create your own tools to make it as easy as possible. I showed you some tools when we’re going through the live example of how to annotate data. Most of the people I work with, they realize how important this is and they vibe code their own tools, or we shouldn’t say vibe code. They make their own tools, and it’s cheaper than ever before because you have AI that can help you.
AI is really good at creating simple web applications that can show you data, that can write to a database. It’s very simple. For the Nurture Boss use case, we wanted to remove all the friction of looking at data. What you see here is just some screenshots of what the application that they created looks like. It’s just, “Okay, they have the different channels, voice, email, text. They have the different threads, they hid the system prompt by default.” Little quality of life improvements. Then, they actually have this axial coding part here where you can see in red the count of different errors. They automated that part in a nice way and they created this within a few hours. It’s really hard to have a one size fits all thing for looking at your data. You don’t have to go here immediately, but something to think about is make it as easy as possible because, again, it’s the most powerful activity that you can engage in. It’s the highest ROI activity you can engage in. With AI, yeah, just remove all the friction.
Lenny Rachitsky: That’s amazing. Again, I think that ROI piece is so important. We haven’t even touched on this enough. The goal here is to make your product better, which will make your business more successful. This isn’t just a little exercise to catch bugs and things like that. This is the way to make AI products better because the experience is how users interact with your AI.
Hamel Husain: Absolutely. If any, we teach our students, “Hey, when you’re doing these evals, if you see something that’s wrong, just go fix it.” The whole point is not to have evals, a beautiful eval suite, where you can point at it, edit it and say, Oh, look at my evals.” No, just fix your application, make it better. If it’s obvious, do it. Totally agree with you.
Lenny Rachitsky: Amazing. A question I didn’t ask, but this is I think something people are thinking about. How long do you spend on this? How long does it usually take to do? The first time
Shreya Shankar: I can answer for myself for applications that I work with. Usually, I’ll spend three to four days really working with whoever to do initial rounds of error analysis. A lot of labeling, feel like we’re in a good place to create the spreadsheet that Hamel had and everyone’s on-board and convinced, and even a few LLM judge evaluators. But this is one-time cost. Once I figured out how to integrate that in unit tests, or I have a script that automatically runs it on samples and I’ll create a Cron Job to just do this every week. I would say it’s like, I don’t know, I find myself probably spending more time looking at data because I’m just data hungry like that. I’m so curious.
I’m like, I’ve gained so much from this process and it’s put me above and beyond in any of my collaborations with folks, so I want to keep doing it, but I don’t have to. I would say maybe 30 minutes a week after that.
Lenny Rachitsky: It’s a week essentially, a week essentially upfront, and then 30 minutes to keep improving on adding to your suite?
Shreya Shankar: Yeah, it’s really not that much time. I think people just get overwhelmed by how much time they spend up front and then thinking that they have to keep doing this all the time.
Lenny Rachitsky: Amazing. Is there anything else that you wanted to share or leave listeners with? Anything else you wanted to double down as a point before we get to a very exciting lightning round?
Hamel Husain: I would say this process is a lot of fun, actually. It’s like, okay, you’re looking at data. Oh, it sounds like you’re annotating things. Okay. Actually, I was just looking at a client’s data yesterday, the same exact process. It’s a application that sends emails, recruiting emails to try to get candidates to apply for a job. We decided to start looking at traces. We jumped right into it. “Hey, let’s look at your traces.” We looked at a trace, the first thing I saw was this email that is worded, “Given your background, blah, blah, blah, blah, blah.” I asked the person right away, and this is where putting your product hat on and just being critical, and this is where the fun part is.
I said, “You know what? I hate this email. Do you like the email, given your background?” When I receive a message given your background, comma, I just delete that. I’m like, “What is this, given your background with machine learning and blah blah?” I’m like, “This is a generic thing.” I asked the person like, “Hey, can we do better than this? This sounds like generic recruiting.” They’re like, “Oh, yeah, maybe.” Because they were proud of it, they’re like, “The AI is doing the right thing, it’s sending this email with the right information, with the right link, with the right name, everything.” That’s where the fun part is, is put your product hat on and get into, is this really good?
Lenny Rachitsky: Something I want to make sure we cover before we get to a very exciting lightning round is, this is just scratching the surface of all the things you need to know to do this well. I think this is the best primer I’ve ever seen on how to do this well.
Shreya Shankar: Nice.
Lenny Rachitsky: But I think we did it. But you guys teach a course that goes much, much deeper for people that really want to get good at this and take this really seriously. Share what else you teach in the course that we didn’t cover, and what else you get as a student being part of the course you teach at Maven.
Shreya Shankar: Yeah, I can talk about the syllabus a little bit, and then Hamel can talk about all the perks. We go through a lifecycle of error analysis, then automated evaluators, then how to improve your application, how do you create that flywheel for yourself? We also have a few special topics that we find pretty much no one has ever heard of or taught before, which is exciting. One is, how do you build your own interfaces for error analysis? We go through actual interfaces that we’ve built and we also live code them on the spot for new data. We show how we use Claude code cursor, whatever we’re feeling in the moment that day to build these interfaces.
We also talk about broadly cost-optimization as well. A couple of people that I’ve worked with, they get to a point where their evals are very good, their product is very good, but it’s all very expensive because they’re using state-of-the-art models. How can we replace certain uses of the most expensive GPT-5, with 5-nano, 4-mini whatnot and save a lot of money, but still maintain the same quality? We also give some tips for that. Hamel, you’re on. We also have many perks.
Lenny Rachitsky: Yeah. Talk about the perks.
Hamel Husain: Okay, the perks. My favorite perk is there’s 160 page book that’s meticulously written, that we’ve created, that walks through the entire process in detail of how to do evals that supplement the course. You don’t have to sit there and take all these notes. We’ve done all the hard work for you and we have documented it in detail and organize things. That is really useful. Another really interesting thing, and something that I got the idea from you, Lenny, is, okay, this is an AI course. Education shouldn’t be this thing where you are only watching lectures and doing homework assignments. Students should have access to an AI that also helps them. What we have done is we’ve, just like there’s the LennyBot that you have.
Lenny Rachitsky: Dot com.
Hamel Husain: Yeah, lennybot.com, we have made the same thing with the same software that you’re using, and we have put everything we’ve ever said about evals into that. Every single lesson, every office hours, every Discord chat, any blogs, papers, anything that we’ve ever said publicly and within our course, we’ve put it in there. We’ve tested it with a bunch of students and they’ve said it’s helpful. We’re giving all students 10 months free unlimited access to that alongside the course.
Lenny Rachitsky: Amazing. Then, you’ll charge for that later down the road?
Hamel Husain: I have no idea. I just take one month at a time. I don’t know where we’re going with that.
Lenny Rachitsky: Eight months and then we’ll have to figure it out. I was thinking this whole interview should have just been our bots talking to each other.
Shreya Shankar: That’s amazing. I would watch that, only for 10 minutes then I don’t know what they’re talking about.
Lenny Rachitsky: Yeah, maybe 30 seconds. Do you guys train it on the voice mode, by the way? That’s my favorite feature of Delphi’s product. If not, you should do that.
Hamel Husain: Oh, I think, I can’t remember, I should look at it.
Lenny Rachitsky: You definitely should. Now that we have this podcast episode, you could use this content to train it. It’s 11Labs powered. It’s so good. Okay, so how do they get to… I guess that’s okay. They get to that once they become, enter your course.
Shreya Shankar: Yeah, sign up for the course and then you’ll get a bunch of emails. Everything will be clear, hopefully.
Lenny Rachitsky: Amazing. Okay.
Shreya Shankar: We also have a Discord of all the students who have ever taken the class. That Discord is so active. I can’t go on vacation without getting notified on the plane.
Lenny Rachitsky: Bittersweet, bittersweet. Incredible. Okay. With that, we’ve reached our very exciting lightning round. I’ve got five questions for you. Are you ready?
Shreya Shankar: Yes. Let’s go.
Lenny Rachitsky: Let’s do it. Okay. I’m going to bounce between you two. Share something if you want. You can pass if you want. First question, Shreya, what are two or three books that you find yourself recommending most to other people?
Shreya Shankar: I like to recommend a fiction book because life is about more than evals. Recently, I read Pachinko by Min Jin Lee. A really great book. Then, I also am currently reading Apple in China, which the name of the author is slipping my mind, but this is more of an exposition, written by a journalist on how Apple did a lot of manufacturing processes in Asia over the last couple, several decades. Very eye-opening.
Lenny Rachitsky: Amazing. Hamel.
Hamel Husain: Yeah, I have them right here. I’m a nerd. Okay, so I’m not as cool as Shreya is. I actually have textbooks, which are my favorite. This one is a very classic one, Machine Learning by Mitchell. Now, it’s theoretical, but the thing I like about it is it really drives home the fact that Occam’s razor is prevalent not only in science, but also in machine learning and AI. A lot of times the simplest, and also engineering, so a lot of times the simpler approach generalizes better. That’s the thing I internalize deeply from that book. I also really like this one. Another textbook. I told you I’m a nerd. This is also a very old one, and this is Norvig algorithms. I really like it because it’s just human ingenuity and it’s lots of clever useful things in computing.
Shreya Shankar: They’re down the street, him and Berkeley.
Lenny Rachitsky: The people that did that research?
Shreya Shankar: Yeah, textbook authors.
Lenny Rachitsky: Super cool. Oh, man, nerds, I love it. Okay, next question. Favorite recent movie or TV show? I’ll jump to Hamel first.
Hamel Husain: Okay, so I’m a dad of two parents. I have two parents. Sorry, two kids. Yeah, I’m a dad of two kids, and I don’t really get the time to watch any TV or movies, so I watch whatever my kids are watching. I’ve watched Frozen three times in the last week.
Lenny Rachitsky: Only three? Oh, okay. In the last week. Okay.
Hamel Husain: That’s my life.
Lenny Rachitsky: Great, Hamel. Frozen. I love it. Okay, Shreya.
Shreya Shankar: Yeah, I don’t have kids, so I can give all these amazing answers. Actually, so my husband and I have been watching The Wire recently. We never actually saw it growing up, so we started watching it and it’s great.
Lenny Rachitsky: I feel like everyone goes through that. Eventually in their life they decide, I will watch The Wire.
Shreya Shankar: I know, so we are in that right now.
Lenny Rachitsky: It’s like a year of your life. It’s great. It’s such a great show. Oh, man. But it’s so many episodes and everyone’s an hour long.
Shreya Shankar: I know. I know.
Lenny Rachitsky: It’s such a commitment.
Shreya Shankar: We get through two or three a week, so we’re very slow.
Lenny Rachitsky: Worth it. Okay, next question. Do you have a favorite product you’ve recently discovered that you really love? We’ll start with Shreya.
Shreya Shankar: Yeah. I really like using Cursor, honestly. Now, Claude Code. I’ll say why. I’m a researcher more so than anything else. I write papers, I write code, I build systems, everything, and I find that a tool… I’m so bullish on AI assisted coding because I have to wear a lot of hats all the time. Now, I can be more ambitious with the things that I build and write papers about, so I’m super excited about those. Cursor was my entry point into this, but I’m starting to find myself always trying to keep up with all these AI assisted coding tools.
Lenny Rachitsky: Hamel?
Hamel Husain: Yeah, I really like Claude Code and I like it because I feel like the UX is outstanding. There’s a lot of love that went into that. It’s just really impressive as a terminal application that is that nice.
Lenny Rachitsky: Ironic that you two both love Claude Code when it’s just built on vibes.
Shreya Shankar: I think it’s false. It’s not just built on vibes.
Lenny Rachitsky: There we go. Okay, two more questions. Hamel, do you have a favorite life motto that you find yourself using in coming back to in work or in life?
Hamel Husain: Keep learning in. Think like a beginner.
Lenny Rachitsky: Beautiful. Shreya?
Shreya Shankar: I like that. For me, it’s to always try to think about the other side’s argument. I find myself sometimes just encountering arguments on the internet, like this race to eval debates and really think, “Okay, put myself in their shoes. There’s probably a generous take, generous interpretation.” I think we’re all much stronger together than if we start picking fights. My vision for evals is not that Hamel and I become billionaires. It is that everyone can build AI products, and we’re all on the same page
Lenny Rachitsky: Slash everyone becomes billionaires.
Shreya Shankar: Yes.
Lenny Rachitsky: Amazing. Final question. When I have two guests on, I always like to ask this question and I’ll start with Hamel. What’s something about Shreya that you like most? What do you like most about Shreya? I’m going to ask her the same question in reverse.
Hamel Husain: Yeah. Shreya is one of the wisest people that I know, especially for being so young relative to me. I feel like she’s much wiser than I am, honestly, seriously. She’s very grounded and has a very even perspective on things. I’m just really impressed by that all the time.
Lenny Rachitsky: Shreya?
Shreya Shankar: Yeah. My favorite thing about Hamel is his energy. I don’t know anybody who consistently maintains momentum and energy like Hamel does. I often think that I would start carrying much less about evals, if not for Hamel. Everyone needs a Hamel in their life, for sure.
Lenny Rachitsky: Well, we all have a Hamel in our life now. This was incredible. This was everything I’d hoped it’d be. I feel like this is the most interesting in-depth consumable primer on evals that I’ve ever seen. I’m really thankful you two made time for this. Two final questions. Where can folks find you? Where can they find the course and how can listeners be useful to you? I’ll start with Shreya.
Shreya Shankar: Yeah, you can reach me via email. It’s on my website. If you Google my name, that is the easiest way to get to my website. You can find the course if you Google AI Evals for engineers and product managers, or just AI Evals course, you’ll find it. We’ll send some links hopefully after this, so it’s easy. How to be helpful? Two things always for me. One is ask me questions when you have them. I’ll try to get to the respond as soon as I can. The other one is tell us your successes. One of the things that keeps us going is somebody tells us what they implemented or what they did, a real case study. Hamel and I gets so excited from these and it really keeps us going, so please share.
Hamel Husain: Yeah, it’s pretty easy to find me. My website is Hamel.dev. I’ll give you the link. You can find me on social media, LinkedIn, Twitter. The thing that’s most helpful is to echo what Shreya said, we would be delighted if we are not the only people teaching evals. We would love other people teach evals. Any kind of blog posts, writing, especially that as you go through this and learn this that you want to share, we would be delighted to help re-share that or amplify that.
Lenny Rachitsky: Amazing. Very generous. Thank you two, so much for being here. I really appreciate it, and you guys have a lot going on, so thank you.
Shreya Shankar: Thanks, Lenny, for having us and for all the compliments.
Lenny Rachitsky: My pleasure. Bye everyone. Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at Lennyspodcast.com. See you in the next episode.
Glossary
| English | 中文 |
|---|---|
| 11Labs | 11Labs(语音 AI 公司,保留原文) |
| agreement | 一致性(指 judge 与人类判断的吻合程度) |
| Andrew Ng | Andrew Ng(知名机器学习学者,保留原名,中文领域亦通行此写法) |
| axial codes | 轴心编码 |
| bad smell | 坏味道(源自软件工程中的 code smell 概念,指值得警惕的信号) |
| benevolent dictator | 仁慈的独裁者 |
| bucket | 桶(分类桶) |
| CI | CI(持续集成,Continuous Integration,保留原文) |
| cohort | 队群 |
| cosine similarity | 余弦相似度 |
| criteria drift | 标准漂移 |
| Cron Job | Cron Job(定时任务,保留原文) |
| Delphi | Delphi(AI 数字分身平台,保留原文) |
| Discord | Discord(社区沟通平台,保留原文) |
| dogfooding | dogfooding(指团队使用自己开发的产品进行测试的做法,保留原文) |
| error analysis | 错误分析 |
| evals | 评估 |
| failure mode | 失败模式 |
| flywheel | 飞轮(指自我强化的正反馈循环) |
| foundation models | 基础模型 |
| Gemini | Gemini |
| generalize | 泛化(机器学习术语,指模型在新数据上的表现能力) |
| hallucination score | 幻觉评分 |
| HubSpot | HubSpot(知名营销与销售软件公司,保留原文) |
| jank | 糙(指粗糙、不完善的表现) |
| Julius AI | Julius AI |
| Jupyter notebook | Jupyter notebook |
| Likert scale | Likert 量表(一种常用评分量表) |
| LLM as a judge | LLM as a judge(用 LLM 充当评判者来评估输出质量的评估方式,保留原文) |
| MMLU | MMLU(大规模多任务语言理解基准测试,保留原文) |
| moat | 护城河(商业竞争壁垒) |
| observability | 可观测性 |
| Occam’s razor | 奥卡姆剃刀(科学哲学原则,指在解释同一现象时,假设越少越好) |
| open coding | 开放编码 |
| pivot table | 数据透视表 |
| prompt | prompt |
| RAG retrieval | RAG 检索 |
| rubrics | 评分标准 |
| Statsig | Statsig(A/B 测试与实验平台公司,保留原文) |
| stochastic | 随机的 |
| surface area | 表面积 |
| taxonomy | 分类体系 |
| theoretical saturation | 理论饱和 |
| trace | trace(追踪) |
| vibe checks | 感觉检查 |
Reformatted by reformat_english.py
为什么 AI 评估是产品建设者最炙手可热的新技能 | Hamel Husain 和 Shreya Shankar(排名第一的评估课程创作者)
关于评估(evals)
Lenny Rachitsky: 要构建出色的 AI 产品,你必须非常擅长构建评估。这是你能从事的投资回报率最高的活动。
Hamel Husain: 这个过程非常有趣,每个做过的人都会立刻上瘾。在构建 AI 应用时,你会学到很多东西。
Lenny Rachitsky: 厉害的地方在于,你不需要反复做很多次。对大多数产品来说,你做一次这个过程,然后就可以在此基础上继续构建。
Shreya Shankar: 目标不是把评估做到完美,而是切实改进你的产品。
Lenny Rachitsky: 我没想到围绕评估有这么多争议和风波,很多人持有非常强烈的观点。
Shreya Shankar: 人们过去在评估上吃过亏——做过糟糕的评估,然后就不信任它了,接着就说”我反对评估”。
Lenny Rachitsky: 人们对评估最常见的几个误解是什么?
Hamel Husain: 排第一的是,“我们生活在 AI 时代,难道不能让 AI 自己来评估吗?“但这行不通。
Lenny Rachitsky: 你在文章中用过一个非常吸引人的概念——“仁慈的独裁者”(benevolent dictator)。
Hamel Husain: 在做开放编码(open coding)的时候,很多团队会被委员会式的决策拖住。在很多情况下,这完全没必要。你不希望把这个过程弄得太昂贵以至于无法执行。你可以指定一个你信任其品味的人,这应该是有领域专业知识的人,通常来说就是产品经理。
今天的嘉宾:Hamel Husain 与 Shreya Shankar
Lenny Rachitsky: 今天的嘉宾是 Hamel Husain 和 Shreya Shankar。过去一年里,这个播客最热门的话题之一就是评估的兴起。Anthropic 和 OpenAI 的首席产品官都表示,评估正在成为产品构建者最重要的新技能。从那以后,这成了我邀请的众多顶级 AI 构建者口中反复出现的主题。两年前,我从未听说过评估这个词。现在它不断出现。上一次出现产品构建者必须掌握才能成功的新技能,是什么时候的事了?
Hamel 和 Shreya 在将评估从一个冷门、神秘的议题转变为 AI 产品构建者最必要的技能之一这件事上,发挥了重要作用。他们教授关于评估的权威在线课程,恰好也是 Maven 平台上排名第一的课程。他们已经教授了超过 2000 名产品经理和工程师,覆盖 500 家公司,包括 OpenAI 和 Anthropic 的大量团队成员,以及其他所有主要 AI 实验室。
在这次对话中,我们做了很多实操演示而非纸上谈兵。我们走了一遍开发有效评估的流程,解释了评估到底是什么、长什么样,讨论了围绕评估的许多主要误解,给了你开始为产品构建评估的前几个步骤,还分享了 Hamel 和 Shreya 过去几年积累的大量最佳实践。这期节目是你能找到的关于评估世界最深入同时又最易懂的入门指南。说实话,它让我都兴奋地想写评估了,尽管我根本没有什么需要写评估的东西。我想你看完之后也会有同样的感觉。
如果你看完这次对话感到兴奋,一定要去看看 Hamel 和 Shreya 在 Maven 上的课程。我们会在节目说明中附上链接。购买课程时使用代码 LENNYSLIST,可以享受 35% 的折扣。下面,请出 Hamel Husain 和 Shreya Shankar。
评估入门
Lenny Rachitsky: Hamel 和 Shreya,非常感谢你们来到这里,欢迎来到播客。
Hamel Husain: 谢谢邀请。
Shreya Shankar: 非常期待。
Lenny Rachitsky: 我比你们更期待。好的,几年前,我从未听说过评估这个词。现在它基本上是我播客上最热门的话题之一——要构建出色的 AI 产品,你必须非常擅长构建评估。而且,事实证明,世界上一些增长最快的公司基本上就是在为 AI 实验室构建、销售和创建评估。我之前刚邀请了 Mercor 的 CEO 来到播客。所以这里确实正在发生一些大事。我想利用这次对话帮助大家深入理解这个领域,但让我们从基础开始。到底什么是评估?对于完全不知道我们在说什么的人,给我们一个快速的理解,让我们先从 Hamel 开始。
Hamel Husain: 没问题。评估是一种系统化地衡量和改进 AI 应用的方式,它完全不需要让人觉得可怕或高不可攀。它的核心其实就是对 LLM 应用做数据分析,以及用系统化的方式来审视这些数据,并在必要的地方创建指标,这样你就能衡量正在发生的事情,然后进行迭代、做实验并改进。
Lenny Rachitsky: 这是一个非常好的宏观理解方式。如果再深入一层,给大家一种更具体的想象和可视化的方式,哪怕有一个示例来展示就更好了,有没有更深层的理解评估的方式?
Hamel Husain: 假设你有一个房地产助手应用,它没有按照你想要的方式工作——它没有按你的期望给客户写邮件,或者没有调用正确的工具,或者出现了各种其他错误。在评估出现之前,你只能靠猜。你可能会修改一个 prompt,然后祈祷你没有因此破坏其他东西,你可能依赖感觉检查(vibe checks),这倒也没什么问题。
评估比感觉检查走得更远
Hamel Husain: 感觉检查本身没问题,一开始你也应该做感觉检查。但它很快就会变得非常难以管理,因为随着应用不断增长,单纯依赖感觉检查实在太困难了。你会感到茫然无措。而评估可以帮助你创建指标,用来衡量应用的运行状况,让你有信心地去改进应用——你有了一个可以据此迭代的反馈信号。
Lenny Rachitsky: 为了让大家更有实感,我们继续用这个房地产助手的例子。假设它在帮客户预约看房或者参加开放日。这个助手在和用户对话、回答问题、推荐房源。作为这个助手的构建者,你怎么知道它给出的是好建议、好回答?它有没有在说完全错误的东西?
所以评估的本质,就是建立一套测试,告诉你这个助手有多频繁地在做你不希望它做的错误事情。“错误”可以有很多种定义:可能是在编造信息,可能是以非常奇怪的方式回答。我对评估的理解,跟我说得对不对啊,简单来说就像是代码中的单元测试。你笑了。你是不是在想,“不,你这个傻瓜。”
Shreya Shankar: 不,我刚才不是那个意思。
Lenny Rachitsky: 好吧好吧好吧,那你来说,这个比喻感觉怎么样?
Shreya Shankar: 好。我同意你一开始说的——我们给了一个非常宽泛的定义。评估是一个衡量应用质量的大光谱,方法多种多样。单元测试确实是其中一种方式。也许你的 AI 助手有一些不可妥协的功能,单元测试能够检查这些。但另一方面,由于这些 AI 助手执行的任务非常开放,你也会想要衡量它们在模糊或含混的事情上表现如何——比如应对新型用户请求,或者判断是否出现了新的数据分布,比如有新类型的用户开始使用你的房地产助手,你甚至不知道会有这样的用户来使用你的产品。然后你突然意识到,“哦,我需要用不同的方式来适应这个新群体。”
评估也可以是一种定期查看数据的方式,用来发现这些新的用户群体。评估还可以是你想要持续追踪的指标,比如你想追踪用户说”是的,点赞,我喜欢这条回复”。这些都是非常基础的东西,不一定与 AI 直接相关,但可以回流到改进你产品的飞轮之中。所以总的来说,单元测试只是那个巨大拼图里很小的一部分。
用真实案例演示评估
Lenny Rachitsky: 很好。你们其实带来了一个评估的具体示例,好让大家看看我们到底在说什么。我们一直在讲大概念,那不如我们打开一个例子,让大家看看,“评估就是这个东西。”
Hamel Husain: 好,让我先稍微铺垫一下背景。呼应一下 Shreya 刚才说的,很重要的一点是,我们不要把评估仅仅等同于测试。很多人会掉进一个常见陷阱——一上来就直奔测试,“我来写几个测试”。但通常这不是你应该做的。你应该先做一些数据分析,来明确你到底应该测试什么。这跟传统的软件工程有所不同,传统工程中你对系统的行为有更多确定性预期。而 LLM 的表面积大得多,随机性也强,所以这里的做法有不同的味道。
今天我要给大家看的例子,确实是一个房地产相关的例子,但类型不同。它来自一家叫 Nurture Boss 的公司。我先共享一下屏幕,给大家看看他们的网站,帮助你们理解这个用例。所以这是我和他们合作过的一家公司,叫 Nurture Boss,它是面向物业经理的 AI 助手,帮助他们管理公寓,涉及各种任务,比如潜在客户跟进、客户服务、预约看房等等。物业经理日常可能做的各种运营工作,它都能帮忙处理。你们可以看到他们的业务范围。这是一个非常好的例子,因为它具有现代 AI 应用的许多复杂性。
它有多种不同的交互渠道——聊天、短信、语音;还有大量的工具调用——预约看房、查询房源信息等等;还有 RAG 检索——获取客户和物业等相关信息。所以作为 AI 应用来说,它相当完整。他们也很大方地允许我使用他们的数据作为教学案例。数据已经做了匿名化处理。今天我要带大家走过的是,好,我们如何开始为 Nurture Boss 构建评估。我们为什么要做这件事?
错误分析:从数据开始
让我们从最初始的阶段开始,也就是我们所说的错误分析——看看应用的数据,首先搞清楚哪里出了问题。我接下来打开一个可观测性工具。这里你用什么工具都可以。我碰巧把这些数据加载到了一个叫 Braintrust 的工具里,但你可以加载到任何工具中。我们并没有偏好哪个工具。在我们和你一起写的那篇博文中,我们用了同样的例子,但用的是 Phoenix Arize,我记得 Aman 在他自己的博文里也用了 Phoenix Arize。还有 LangSmith。这些都是你可以使用的不同工具。
现在你在屏幕上看到的是应用的日志。让我给大家展示一下它是什么样子。你看到的——我把它全屏一下——这是某个客户与 Nurture Boss 应用的一次具体交互,它是所发生一切的详细日志。这叫做一个 trace(追踪),就是一系列事件日志的工程术语。trace 这个概念已经存在很长时间了,但在 AI 应用中它变得尤为重要。
这里包含了 AI 完成其工作所需的全部不同组件、模块和信息,所有内容都被记录下来,我们现在看到的就是这样一个视图。你可以看到这里有一个系统 prompt。系统 prompt 写的是:“你是一个 AI 助手,担任 Retreat at Acme Apartments 的租赁团队成员。“记住我说过这是匿名化的,所以名字是 Acme Apartments。“你的主要职责是回复现有住户和潜在住户的短信。你的目标是提供准确、有用的信息”等等等等。后面还有大量关于行为准则的详细规定。
Lenny Rachitsky: 顺便问一下,这是这家公司实际使用的系统 prompt 吗?
Hamel Husain: 是的,没错。
Lenny Rachitsky: 太棒了,太酷了。
Hamel Husain: 这是真实的系统 prompt。
Lenny Rachitsky: 真了不起,因为很少能看到一个真实公司产品的系统 prompt。那通常被视为他们的皇冠上的珠宝,所以这件事本身就非常酷。
实际 trace 示例
Hamel Husain: 是的,真的非常酷。你可以看到各种不同的功能、不同的使用场景,比如预约看房、处理申请、针对不同角色的沟通指导等等。你可以看到用户直接进来问:“你们有一室一书房的房源吗?我在虚拟看房里看到了。“然后你可以看到 LLM 调用了一些工具。它调用了获取个人信息工具,拉回了那个人的信息,然后又查了社区的可用房源。它实际上是在查询一个数据库,获取那个公寓群的空房信息。
最后 AI 回复说:“我们有好几套一居室,但没有专门标注带书房的。以下是几个选项。”
然后用户说:“有一室一书房的能通知我吗?”
AI 说:“我目前没有一室一书房房源的具体可用信息。”
用户说:“谢谢。”
AI 说:“不客气,如果还有其他问题,随时联系我们。“现在这是一个 trace 的示例,我们正在看的是一个具体的数据点。所以在做 LLM 应用的数据分析时,有一件事非常重要,就是去看数据。你可能会想,“这些日志太多了,乱糟糟的,各种东西都在里面。到底要怎么看这些数据?难道就在数据里淹死吗?怎么分析这些数据?“
错误分析的方法
Hamel Husain: 事实上有一种完全可以驾驭的方法来做这件事,而且不是我们发明的。它在机器学习和数据科学领域已经存在很长时间了,叫做错误分析。具体做法是,征服这类数据的第一步就是写笔记。你得戴上产品经理的帽子——这也是为什么我们要跟你聊这个,因为产品人员必须在场,必须参与到这种工作中来。通常开发者不太适合做这件事,尤其当它不是一个编程类应用的时候。
Lenny Rachitsky: 我复述一下,你说产品人员必须参与的原因是——这就是你产品的用户体验。人们和这个 agent 对话,本质上就是整个产品,所以产品经理深度参与其中是完全合理的。
Hamel Husain: 对。让我们来回顾一下这段对话。用户询问了房源可用性,AI 说:“哦,我们没有这种房源,祝你有美好的一天。“对于一个帮助管理线索的产品来说,这算好吗?你觉得这是我们想要的结果吗?
Lenny Rachitsky: 不太理想。
Hamel Husain: 对,不理想,我很高兴你这么说。很多人会说:“哦,很好啊,AI 做得对。它查了,说了没有可用房源,确实没有嘛。“但戴上产品经理的帽子,你就知道这不对。所以你要做的就是在这里快速写一条笔记。你可以弹出来写一条备注。每个可观测性工具都有写笔记的功能。你不需要去判定到底哪里出了问题——在这个例子中,它确实没有做正确的事——你只需要快速写一条笔记:“应该转交给人工。”
Lenny Rachitsky: 看着这个过程,就像你提到的这点,后面还会展开讲——你这样做,感觉非常手工、不可规模化,但正如你所说,这只是流程的第一步,有一套系统的方法。那只是第一步。
Hamel Husain: 对,而且你不需要对所有数据都这样做。你抽样一部分数据看一下就行,你会惊讶于这样做能学到多少东西。每个这样做的人都会立刻上瘾,他们说:“这是构建 AI 应用时你能做的最有价值的事。“你真的能学到很多,然后你会想:“嗯,这不是我想要的工作方式。好,记下来。“这就是一个例子。
你写完这条笔记,然后我们可以看下一个 trace。这是下一个 trace。我在键盘上按了一个快捷键,让我切回去看一下。
Lenny Rachitsky: 这些工具让你可以很方便地浏览一批数据,快速添加这些笔记。
短信渠道的问题
Hamel Husain: 对。这是另一个例子。类似的系统 prompt。我们不需要再全部过一遍,直接跳到用户的问题。“我一直给你发短信”,是不是很有趣?然后用户说:“请。“好,这个其实是一个应用层面的错误——这是一个短信应用,更准确地说,客户沟通的渠道是短信,所以你会收到一些非常混乱的内容。你可以看到这里内容有点不通。句子被截断了,比如”与此同时”然后就断了,系统不知道怎么回复,因为你知道人们发短信的方式——他们写很短的短语,把一句话拆成四五条来发。所以这个情况下——
Lenny Rachitsky: 那遇到这种情况你怎么处理?
Hamel Husain: 对,这是一种不同类型的错误。
Lenny Rachitsky: 嗯。
Hamel Husain: 这更像是”我们没有正确处理这种交互”,更多是一个技术问题,而不是”AI 没有按照我们的期望去做”这种问题。所以我们也会把它记下来。
Lenny Rachitsky: 但这也很有价值。
Hamel Husain: 对。
Lenny Rachitsky: 在这里能抓到这种问题真的很了不起。否则你根本不知道正在发生这种事。
Hamel Husain: 对,你可能根本不知道有这种情况。所以你只需要说”好的”,写一条笔记,比如”短信导致对话流程体验很差”。
Lenny Rachitsky: 我喜欢你用了”体验很差”这个词,说明这个阶段可以多么不正式。
Hamel Husain: 对,就是要放松。不要想太多。这有一个方法论的。大家总是会问,怎么做?是不是要找出这个 trace 里所有的问题?笔记该写什么?答案是:只写下你看到的第一个错误,最上游的那个错误。不要管其他所有错误,只抓住你看到的第一个问题,然后停下来,继续看下一个。你可以很快上手这个方法。前两三个可能会很痛苦,但之后你可以做得非常快。
幻觉问题
Hamel Husain: 这是另一个例子,我们同样跳过系统 prompt。用户问:“你好,我在找两到三居、一到两个卫生间的房源。你们提供虚拟看房吗?”
然后调用了一堆工具,它回复说:“你好 Sarah,目前我们有一套三居室、两个半卫生间的公寓,租金 2,175 美元。很遗憾目前没有两居室选项。我们确实提供虚拟看房,你可以预约一次看房,“等等。但碰巧的是,虚拟看房根本不存在。
Lenny Rachitsky: 嗯,有意思。
Hamel Husain: 所以它在幻觉(hallucinating)一个不存在的东西。这时候你需要带入你作为工程师甚至产品人员的上下文知识,说:“这有点不对。我们不应该告诉用户有虚拟看房,但我们根本没有这个服务。“
开放编码中 LLM 的局限
Hamel Husain: 所以你会说:“好的,提供了虚拟看房,“然后直接写下笔记。你可以看到我们观察到的错误类型是多种多样的,而且我们实际上在很短的时间内就对应用了解了很多。
Shreya Shankar: 我们在这个阶段经常被问到的一个问题是:“我理解是怎么回事了。能不能让 LLM 来帮我做这个过程?”
Lenny Rachitsky: 嗯,好问题。
Shreya Shankar: 我很喜欢 Hamel 刚才那个最新的例子,因为当我们尝试让 LLM 来做这种错误分析时,通常它只会说这个 trace 看起来没问题,因为它不具备判断某个东西是否有产品层面问题的上下文。比如那个预约看房的幻觉,对吧?我敢打包赌,如果我把这个放到 ChatGPT 里问”有没有错误?“,它会说”没有,做得很好。”
但 Hamel 有上下文,知道”哦,我们实际上没有虚拟看房这个功能”。所以我认为在这些情况下,确保你自己手动做这件事是非常重要的。我们可以稍后再谈什么时候在流程中使用 LLM,但这里头号陷阱就是人们说”让我用 LLM 自动化这个过程”。
Lenny Rachitsky: 你觉得我们会到一个 agent 能做到这一步的阶段吗?就是说它拥有那种上下文?
Shreya Shankar: 哦,不会。不会不会不会。抱歉。错误分析中确实有 LLM 适合做的部分,我们可以在播客后面聊到。但现在,在自由形式的笔记记录这个阶段,不是 LLM 该上场的地方。
Lenny Rachitsky: 明白了。这个步骤就是你们所说的开放编码(open coding),对吗?
Shreya Shankar: 是的,完全正确。
仁慈的独裁者
Lenny Rachitsky: 好。你们在文章中用到的另一个术语我很喜欢,也和这一步相关,就是”仁慈的独裁者”这个概念。也许可以聊聊这是什么,Shreya 你来讲讲?
Shreya Shankar: 好的,这个术语其实是 Hamel 提出来的。
Lenny Rachitsky: 好吧,那还是让 Hamel 来讲。
Hamel Husain: 没问题。而且我们实际上会在这个例子中展示 LLM 自动化,因为我们要把这个例子一路贯穿到底。
Lenny Rachitsky: 太棒了。
Hamel Husain: 仁慈的独裁者只是一个好记的说法,指的是在做开放编码的时候,很多团队会被委员会式的方式拖住。而在很多情况下,那完全没必要。人们会很不舒服,觉得”我们想让所有人都参与进来,我们想让每个人都同意”,等等。你需要穿透这些噪音。在很多组织中,如果你深入观察,尤其是中小型公司,你可以指定一个你信任其品味的人。你可以用很少的人,通常一个人,就能完成这件事。让它切实可行非常重要。你不能让这个过程变得昂贵到无法执行,否则你就亏了。
这就是仁慈的独裁者背后的理念——“嘿,你需要在尽可能多的维度上简化这件事。“我们后面还会谈到另一点,就是构建 LLM 作为评判者的时候,你需要一个二元评分。你不想去想”这是 1、2、3、4 还是 5?“给它打分。你做不到,那会拖慢整个过程。
Lenny Rachitsky: 为了确保仁慈的独裁者这个概念真的很清楚——基本上,这个人就是做笔记的那个人,理想情况下是这个领域的专家。如果是法律方面的,可能是一个法务人员来负责,也可以是产品经理。给我们点建议,这个人应该是什么样的?
Hamel Husain: 对,应该是具备领域专业知识的人。在这个例子中,就是了解公寓租赁业务、有上下文判断这是否合理的人。就像你说的,始终是领域专家。法律方面就是法务人员,心理健康方面就是心理健康专家,无论是精神科医生还是其他人。
Lenny Rachitsky: 明白。
Hamel Husain: 不过通常情况下,是产品经理。
Lenny Rachitsky: 好。所以这里的建议就是选定那个人。可能感觉不太公平,由他一个人说了算、当独裁者,但他是仁慈的,没问题的。
Hamel Husain: 对,没问题的。不求完美。你只是想取得进展、快速获得信号,这样你就知道该做什么,因为一不小心这事可以变得无限昂贵。
更多开放编码的实例
Lenny Rachitsky: 对。好,我们回到你的例子。
Hamel Husain: 好,没问题。这是另一个例子,有人说:“好的,你们有什么优惠吗?“助手或 AI 回答:“我们有 5% 的军人折扣。“用户接着回复,换了话题:“能告诉我有几层楼吗?你们有没有一居室?或者一楼的一居室?“AI 回答:“好的,我们有好几套一居室公寓。“用户想确认:“那些有在一楼的吗?一居室多少钱?“同时,因为这是一个现有住户,所以他们还问”我需要提交一个维修请求。”
你可以看到真实世界的杂乱,而助手直接调用了一个工具说转接电话,但什么都没说。它就突然执行了转接电话,所以我觉得挺糙的。就是——
Lenny Rachitsky: 又一个糙的。
Hamel Husain: 另一种糙,不同类型的糙。所以当你在写开放笔记的时候,你不应该说”糙”,因为我们要做的是理解发生了什么,等我们之后回头看笔记的时候,要知道怎么回事。
所以你只需要写”未与用户确认即转接电话”。不需要完美,你只需要对发生的事有个大概的了解。
理论饱和
Lenny Rachitsky: 好。
Hamel Husain: 那好。假设我们做了,Shreya 和我建议至少做 100 个。问题总是”要做多少个?“所以没有魔法数字。我们说 100 只是因为我们知道,一旦你开始做这件事,做了 20 个之后,你会自动发现它太有用了,你会继续做下去。
所以我们说 100 只是为了在心理上帮你解压,不那么吓人。就像”别担心,你就做 100 个。“实际上有个术语,正确的回答是”持续看 trace,直到你觉得学不到新东西为止。“也许 Shreya 可以讲讲——
Shreya Shankar: 对,确实有个术语——在数据分析和质性分析中叫做理论饱和(theoretical saturation)。意思就是,当你做了所有这些查看数据的过程后,什么时候停下来?就是当你达到了理论上的饱和,或者说你不再发现新的笔记类型、新的概念类型,或者没有什么会对你流程的下一步产生实质性改变的东西了。
培养理论饱和的直觉
Shreya Shankar: 这种直觉的培养需要一点时间,所以通常情况下,人们并不太确定自己是否已经达到了理论饱和。这完全没问题。当你做了两三个例子或轮次之后,你会逐渐培养起这种直觉。很多人会意识到,“哦,好的,我只需要做 40 个,只需要做 60 个。实际上,我只需要做 15 个。” 这说不准,取决于具体的应用场景,也取决于你对错误分析的熟练程度。
Lenny Rachitsky: 你说的”你会想做很多”,我想是因为你会有这样的感觉:“哦,我发现了这么多问题,我得看看还有些什么。”
Shreya Shankar: 没错。
Lenny Rachitsky: 是这样吗?
Shreya Shankar: 没错,而且我保证,到了某个阶段,你就不会再发现新类型的问题了。
Lenny Rachitsky: 好的。假设我们做了 100 个,下一步是什么?
从开放编码到轴心编码
Hamel Husain: 好。你做了 100 个,现在你手里有了所有这些笔记。这时候你可以开始用 AI 来帮你了。不过,就像我们讨论过的,亲自查看数据这个环节很重要,你不希望把这部分过度自动化。
Lenny Rachitsky: 人类仍然会有工作。这是关键收获。太好了。
Hamel Husain: 是的。
Lenny Rachitsky: 就是审查 trace。至少目前还剩这一份工作。很好。
Hamel Husain: 没错。好,你有了所有这些笔记。要把它们变成有用的东西,你可以做基本的计数。基本计数是数据科学中最强大的分析技术,因为它如此简单,而且在很多情况下被低估了,所以对人们来说非常容易上手。
第一件事就是把这些笔记拿过来,用 LLM 对它们进行分类,有很多不同的方法可以做到这一点。就在这期播客录制之前,我试了三种不同的编程 agent 或 AI 工具来对这些笔记进行分类。一种方法是,我把数据上传到一个 cloud project 里,上传了一个包含这些笔记的 CSV 文件,直接从界面导出的。有很多不同的做法,但我展示给你的是最简单、最笨的方法,最基础的做法。
我把 CSV 导进去,说”请分析以下 CSV 文件。“我告诉它有一个 metadata 字段里面包含笔记,但我用的词是”open codes”(开放编码),我说”我有不同的开放编码”——这是一个专业术语。LLM 知道什么是开放编码,也知道什么是轴心编码(axial codes),因为这个概念已经存在很长时间了,所以这些术语帮我快速跳到了我想做的事情上。
Lenny Rachitsky: 太棒了。prompt 的最后是让它创建轴心编码?
Hamel Husain: 对,创建轴心编码。它的作用是——
Shreya Shankar: 也许值得讲讲什么是轴心编码,或者说这一步的意义是什么?你有一堆杂乱的开放编码,你并没有 100 个不同的问题。实际上很多都是重复的,只是因为你用了不同的表述方式。你在开放编码时不应试图创建自己的失败分类体系,你只想把问题记录下来,然后再整理:“最常见的失败模式是什么?”
所以轴心编码的目的,说白了就是一个失败模式。它是一个标签或类别。我们的目标是得到这些失败模式的聚类,找出最普遍的那个,然后去攻克那个问题。
Lenny Rachitsky: 这非常有帮助。基本上就是把所有这些综合起来——
Shreya Shankar: 完全正确。
Lenny Rachitsky: 归纳成类别和主题。太酷了。我们会在节目笔记里附上这个 prompt,这样听众就不用截图再自己打字了。
Hamel Husain: 好主意。然后 Claude 就分析了这个 CSV 文件,决定如何解析它,等等。这些细节我们不需要操心,但它产出了大量的轴心编码。基本上,轴心编码就是类别,就像 Shreya 说的。比如:能力限制、表述不当、流程和协议违规、人工交接问题、沟通质量。它创建了这些类别。
我喜欢所有这些类别吗?并不完全。我喜欢其中一些。这是一个不错的初步尝试。我可能会对它们做一些重命名,因为有些类别太笼统了。比如”能力限制”是什么意思?有点太宽泛了,不够可操作。我希望让它更具可操作性,这样如果我确定它是一个问题,我就知道该怎么处理它。不过这个我们稍后再讨论。你可以用任何工具来做这件事,这是最笨的方法,但有时候笨方法正是一个好的起点。
Lenny Rachitsky: 这正是 LLM 擅长的事情——接收大量信息并加以综合提炼。
Shreya Shankar: 完全正确。帮我们综合提炼以便理解,对吧?注意它并不是自动提出修复方案什么的,那是我们的工作,但现在我们可以更轻松地理清这一堆开放编码了。
在生成轴心编码的这个 prompt 中,还有一个有趣的地方——你可以根据需要写得非常具体。你可以说”我希望每个轴心编码实际上是一个可操作的失败模式”,也许 LLM 会理解并照做;或者”我希望你按照用户故事的阶段来对这些开放编码分组。“所以在这里你可以发挥创意,或者根据你作为产品经理或工程师的具体需求来调整,这将有助于你后续的改进工作。
Lenny Rachitsky: 所以并没有一个确定的 prompt,说”这就是唯一正确的方式”?
Shreya Shankar: 完全没有。
Lenny Rachitsky: 你的意思是你可以迭代,看看什么对你有效?
Shreya Shankar: 完全正确。
Lenny Rachitsky: 有意思的是现有工具并没有做这件事,还是说它们尝试了但做得不好?
Shreya Shankar: 没有,我觉得它们没做。我们一直在到处呼吁,“拜托,拜托——”
Lenny Rachitsky: 哦,真的吗。
Shreya Shankar: “——做这个吧。“我确实觉得这件事有一点难度。Hamel 和我教的评估课程中的经验是,很多人实际上并不知道这些,所以也许是因为人们不了解这些方法,也不知道如何为此构建工具。希望我们能揭开这些魔法的面纱。
Lenny Rachitsky: 再深入确认一下,这并不是所有人都做或都知道的事情。这是你们两位基于在其他公司做数据分析和数据科学的经验总结出来的?
Shreya Shankar: 我想先澄清一下,我们并没有发明错误分析。我们实际上不想发明新东西。那是不好的信号。如果有人向你推销一种全新的方法,而且不以数百年来的理论和文献为基础,那你应该,怎么说呢,对此保持一点警惕。
但我们尝试做的是提炼出:“好的,要理解 LLM 的错误分析,你需要哪些新的工具和技术?“然后我们创建了一套课程或结构化的方法来做这件事。所以这一切都是针对 LLM 的,但开放编码、轴心编码这些术语,都是根植于社会科学的。
Lenny Rachitsky: 太棒了。你们做这件事有意思的地方在于,我也想找个地方试试。我手头并没有什么 AI 产品可以拿来操作,但就是觉得:“这太有意思了。“坐在那里,找出所有遇到的问题,把它们分类,然后尝试修复。
Shreya Shankar: 我很喜欢这个想法。
Lenny Racheshky: Hamel 刚才打开了一个视频。你这里在展示什么?
机器学习中由来已久的传统
Hamel Husain: 对,我打开了一个视频,就是为了印证 Shreya 的观点。我们并没有发明任何东西,所以你在屏幕上看到的是 Andrew Ng,世界上最著名的机器学习研究者之一,坦白说,他教会了很多人机器学习。你可以看到这是一个八年前的视频,他在讲错误分析。这种技术用于分析随机系统已经有很长时间了,我们只是把同样的机器学习理念和原则搬到这里来,因为同样的,这些也是随机系统。
Lenny Rachitsky: 太好了。顺便一提,我们正在联系 Andrew 上我们的播客,正在沟通,所以到时候——
Shreya Shankar: 太好了。
Lenny Rachitsky: ——应该会非常有意思。另外,我很高兴今天刚发布的播客节目出现在你们的推荐信息流里,而且排名很靠前,我对此非常满意。
Hamel Husain: 很不错。推荐算法确实挺好的。
Lenny Rachitsky: 是啊,希望你会点进去。别搞砸我的算法。好,酷。我们已经做了一些综合。我知道我们不会把整个步骤都走一遍,你们有一整套课程,需要好几天才能学完全部流程。关于这个过程,你们还想分享什么?
用 AI 工具完成编码与分类
Hamel Husain: 好的。你可以在任何工具上做这件事,在 ChatGPT 里用同样的 prompt 也完全没问题。你可以看到它生成了轴心编码。我特别喜欢用 Julius AI,这是我最喜欢的工具之一。Julius 是一个使用 notebook 的第三方工具。我个人很喜欢 Jupyter notebook,它更偏向数据科学领域,但现在很多产品经理也在学习使用 notebook,挺酷的。它就像一个有趣的游乐场,你可以在里面写代码、看数据。不过我们不需要深入讲这个,只是想提一下,你可以使用很多工具。AI 在这方面确实很擅长。
那我们进入有趣的部分。好,现在我们有了这些轴心编码。所以我喜欢做的第一件事是——我手上有开放编码,也有轴心编码,假设是从 cloud 项目或 ChatGPT 中生成的——我先把它们收集起来看一眼,这些轴心编码是否合理?我会查看不同轴心编码与开放编码之间的对应关系,然后做一个审视:这些编码好吗?能不能改进?能不能更精细化?能不能让它们更具体、更有可操作性,而不是笼统模糊的?
你可以看到我整理出来的这些编码:行程调度、重新调度的问题、人工交接或转接问题、输出格式错误、对话流程——我们之前在短信场景中看到了对话流程的问题——后续承诺未兑现。
所以基本上,现在你有了这些轴心编码,我把它们收集到一个列表里,这是一个 Excel 公式,把这些编码收集成一个列表,现在我们有了一个逗号分隔的编码列表。然后你只需要做的是:把你记录的那些开放编码拿出来,告诉一个 AI——这里为了简单起见用的是 Gemini——再说一遍,我们尽量保持简单——让它把这些记录分类到以下类别中。
Lenny Rachitsky: 给观看的人说一下,我喜欢你分享的这些不同的 prompt 和公式。这是 Google Sheets 的 AI prompt。
Shreya Shankar: 超级推荐。
Hamel Husain: 基本上,你可以把你的 trace 分类到不同的桶里,我们这里做的就是这件事。我们把我们遇到的所有问题分类到了这些类别中。
Shreya Shankar: 而且这是自动完成的,非常令人兴奋。我的意思是,AI 在做这件事。这再次说明了一点:你的开放编码必须详细。你不能只写一个”糙”,因为如果 AI 读到”糙”,它是没法分类的。即使是人类也不行,对吧?人类得回去回忆你为什么说”糙”,所以在开放编码中保持一定程度的详细很重要。
Lenny Rachitsky: 好的。所以避免使用”糙”这个词。这是个好的经验法则。
Shreya Shankar: 对。或者后面再跟上十个词也行。
Lenny Rachitsky: 哦,好吧。那——
Hamel Husain: 我刚才是在开玩笑。
Lenny Rachitsky: 好。那人们常用的、你觉得不太好的词还有哪些?
Shreya Shankar: 我觉得不是特定词语的问题,而是人们在开放编码中不够详细,所以很难完成分类。
Lenny Rachitsky: 明白了。顺便问一下,之所以要把它们映射回去,是因为比如说 Claude 或 ChatGPT 给了你建议,你做了修改和迭代,所以你不能直接回过头说:“好的,每个桶随便放”——是这样吗?
Hamel Husain: 对,对。
Lenny Rachitsky: 好的。
Hamel Husain: 这个问题问得真的很好。迭代和思考是很有必要的——“我喜欢这些开放编码吗?这些对我来说真的合理吗?“就像 AI 做的任何事情一样,把自己放在中间环节审视一下是很有必要的。
Lenny Rachitsky: 人在回路中。还是有我们发挥的空间。很好。
Shreya Shankan: 如果我想用 AI 来做这个标注,我在这一步还喜欢做的一件事是增加一个叫”以上都不是”的新类别。这样 AI 可以实际输出”以上都不是”这个轴心编码,这会告诉我:“好吧,我的轴心编码不完整。让我们回去看看那些开放编码,找出一些新类别,或者想想怎么重新措辞现有的轴心编码。”
Lenny Rachitsky: 太棒了。而且这个方法很棒的地方在于,你不需要反复做很多次。
Shreya Shankan: 不需要。
持续改进的节奏
Lenny Rachitsky: 对于大多数产品来说,你只需要做一次这个过程,然后在此基础上不断构建,随着时间推移不断调整,对吗?
Shreya Shankan: 完全正确。而且速度会变得很快。人们每周做一次,整个过程三十分钟就能完成,然后你的产品突然就比之前对这些问题一无所知时好了一大截。
Lenny Rachitsky: 是啊。不知道这些问题正在发生,这太荒谬了。看着这个过程进行,我心想:“你怎么能不对自己的产品做这件事呢?”
Shreya Shankan: 很多人完全不知道。
Lenny Rachitsky: 大多数人。是的。这个问题我们后面会聊到,围绕这些内容有一场很值得讨论的辩论。好,酷。你现在有了这个表格。接下来呢?
Hamel Husain: 好的。接下来就是重头戏了。现在是见证奇迹的时刻。我们已经把所有这些编码应用到了我们的 trace 上,都是我们认可的编码。现在,你可以来揭晓答案了——你可以对它们进行计数。
从数据透视表到问题优先级排序
Hamel Husain: 这是一个数据透视表,我们可以直接对这些编码做数据透视,统计各类问题出现的次数。我们发现了什么?在我们分类好的这些 trace 上发现了什么?我们发现了 17 个对话流程问题。我非常喜欢数据透视表,因为你可以做很多很酷的操作。你可以双击这些条目,说”哦,让我看看这些具体内容”,不过这是关于数据透视表有多酷的题外话了。
但现在我们有了一个粗略但清晰的切面,看到了我们的问题是什么。我们从混乱走向了某种有序的思考——“哦,你知道吗?这些是我最大的问题。我需要修复对话类问题,也许是这些人工转接的问题。“不一定数量最多的问题就是最重要的,也可能某个问题虽然数量不多但特别严重,你想优先修复它。但总之,现在你有了一种审视问题的方式,然后可以考虑其中一些是否需要编写评估。
何时需要编写评估
其中有些可能只是低级的工程错误,你不需要为它写评估,因为修复方法非常明显。也许是输出格式错误——你可能只是忘记告诉 LLM 你想要的格式,连 prompt 里都没写。那就直接去改 prompt 就好了,然后我们可以决定:“好吧,要不要为这个写个评估?“你可能还是想写,因为你可以用纯代码来测试。你可以直接检查字符串是否具有正确的格式,而不需要运行 LLM。
所以评估存在成本收益的权衡。你不应该过度沉迷其中,但通常应该以实际错误为出发点。你不想跳过这一步。我之所以在这上面花这么多时间,是因为这正是人们迷失方向的地方。他们直接跳到评估阶段,说”让我来写些测试”,然后事情就脱轨了。
构建 LLM as a Judge
好,假设我们想解决其中一个问题。比如我们想解决这个人工转接的问题,然后我们会想:“嗯,我不太确定怎么修。这涉及到一种主观判断——我们是否应该转接给人工?而且我不能立刻知道怎么修,这不是特别显而易见。是的,我可以改 prompt,但我不确定。我没有百分之百的把握。”
那这可能就是一个适合用 LLM as a judge 来处理的场景。评估有不同的类型。一种是基于代码的,如果可以的话你应该优先尝试这种方式,因为成本更低。LLM as a judge 则是一种元评估——你需要对评估本身再进行评估,确保做评判的 LLM 在做正确的事,这个我们稍后会谈到。
好,LLM as a judge,这是一种方式。那怎么构建 LLM as a judge 呢?
Lenny Rachitsky: 在讲这个之前,为了确保大家完全理解你刚才描述的这两种评估类型——一种是你说基于代码的,一种是 LLM as a judge。也许 Shreya,帮我们理解一下基于代码的评估到底是什么?简单来说就是一个单元测试?可以这样理解吗?
Shreya Shankar: 对。也许”评估”这个词在这里不是最准确的,但可以把它理解为一个自动化评估器。当我们发现这些失败模式后,我们想要的是:“好,现在我们能不能用自动化的方式检查这个失败模式的发生频率,而不需要我手动标注、做所有的编码和分组?我想在成千上万的 trace 上运行,我想每周都跑一遍。“这就是——你应该构建一个自动化评估器来检查那个失败模式。
当我们说基于代码的和基于 LLM 的时候,我们说的是:“好,也许我可以写一个 Python 函数或一段代码来检查某个 trace 中是否存在那个失败模式。“这对于某些情况是可行的,比如检查输出是否为 JSON,或者是否为 markdown,或者是否够简短。这些都是可以用代码捕获的,或者至少近似地用代码捕获的。
当我们谈到 LLM judge 时,我们指的是这是一种复杂的失败模式,我们不知道如何用自动化方式评估。所以也许我们会尝试用 LLM 来评估人工转接这个非常狭窄、特定的失败模式。
Lenny Rachitsky: 为了确认我理解了你描述的内容——你想测试你的 agent 或 AI 产品的输出。你问它一个问题,它返回一些东西。
一种测试它是否给出正确答案的方式是:如果它持续做同一件事,你可以写一段代码来判断结果是否正确。比如,它会不会说有虚拟看房服务?你可以问它。
Shreya Shankar: 是的。
Lenny Rachitsky: “你们提供虚拟看房吗?“它回答是或否,然后你可以用代码根据那个具体答案判断是否正确。
但如果你问的是更复杂的问题,答案不是非此即彼的,在一种情况下你需要一个人来告诉你这是否正确。为了避免每次都需要人工审阅的解决方案就是让 LLM 替代人工判断,你称之为 LLM as a judge——LLM 作为评判者来判断这是否正确。
Shreya Shankar: 完全正确。你说得很准确。
Lenny Rachitsky: 很好。
Shreya Shankar: 很多人总觉得:“哦,这至少和构建原始 agent 的问题一样难。“其实不然,因为你只要求 judge 做一件事——评估一个失败模式——所以问题的范围非常小,而且这个 LLM judge 的输出就是通过或不通过。这是一个范围非常紧密的任务,LLM judge 非常擅长做这种事,而且做得非常可靠。
Lenny Rachitsky: 这里的目标就是拥有一套测试,在你上线到生产环境之前运行,告诉你一切是否按照你期望的方式进行?你的 agent 的交互行为是正确的?
Shreya Shankar: LLM judge 的美妙之处在于,你当然可以在单元测试或 CI 中使用它们,但你也可以在线上用于监控。我可以每天抽样 1000 条 trace,运行我的 LLM judge,用的是真实的生产 trace,看看那里的失败率是多少。这不是单元测试,但我们同样获得了一个极其具体的应用质量衡量指标。
Lenny Rachitsky: 这点非常好,因为很多人只把评估看作一种脱离现实的东西——你在真实世界之前测试的东西。而你说在真实世界中实际发生的,你也应该用完全一样的方式来做?
Shreya Shankar: 是的。
Lenny Rachitsky: 测试你生产环境中实际运行的东西?而且可以是每天、每小时运行的?
Shreya Shankar: 完全可以。
Lenny Rachitsky: 太好了。好,Hamel 这里有一个实际的 LLM as a judge 评估的例子,我们来看看。
一个 LLM as a Judge 评估的实例
Hamel Husain: 我很喜欢 Shreya 帮我把铺垫做好了,非常感谢。我们这里有一个针对这一特定失败模式的 LLM as a judge prompt。正如 Shreya 所说,你应该针对一个特定的失败模式来做,而且要把它变成二元的,因为我们想简化事情。我们不要”嘿,给这个打个一到五的分,看看有多好”。在大多数情况下,那只是一种逃避做决定的狡猾方式。不是的,你需要做一个决定——这够好了吗?是还是否?
Hamel Husain: 想清楚那个标准可能会很痛苦,但你一定要去做。否则这件事会变得非常难以处理,而且当你汇报这些指标的时候,没有人知道3.2和3.7意味着什么。
Shreya Shankar: 是的,这种情况我们见得太多了,甚至网上那些专家策划的内容也是这样,比如”这是你的 LLM judge 评估 prompt,这是一到七分的量表。“我总会给 Hamel 发消息说,“完了,我们又得去纠正那些错误信息了,因为我们知道肯定有人会去试,然后回来跟我们说’我的平均分是4.2’,我们就只能……好吧。”
Lenny Rachitsky: 评估领域的争议真是太多了。我们后面还会聊到这个。天哪。
构建 Judge Prompt 与对齐人类判断
Hamel Husain: 好,这就是你的 judge prompt。没有唯一的正确做法。你可以用 LLM 来帮你创建它,但同样,把自己放在环里。不要盲目接受 LLM 的输出,在我们做的所有这些案例中,我们都是这么做的。借助轴心编码,我们对这个 prompt 进行了迭代。你可以用 LLM 来帮你创建这个 prompt,但一定要读它、编辑它。这不一定是个完美的 prompt。这只是个很简单的示例,只是为了展示思路。就像,“好,针对这个交接失败”,我说,“好,我要你输出 true 或 false”,这是一个二元的 judge。这就是我们推荐的做法。然后我就列出来,“好,什么时候应该做交接?“我就一条条列出来。
明确的、被忽略或循环的人类请求、某些政策规定的转接、敏感住户问题、工具数据不可用、当天的看房或参观请求——这些你需要跟人工沟通,诸如此类。核心思路是,既然我知道数据中存在这个失败,我就有兴趣去迭代它,因为我知道这种情况其实一直在发生。正如 Shreya 所说,如果能有一种方法,不仅在已有数据上评估,还能在生产数据上评估,了解这种情况在什么规模下发生,那就好了。让我找到更多 trace,让我有办法迭代。我们可以拿这个 prompt,再次使用电子表格。第一步是,做这个 judge 的时候……我写了 prompt。
很多人到这里就停了,他们说,“好,我有了 judge prompt,搞定了,直接上线吧。“然后 prompt 说如果 judge 说错了,那就是错了。他们直接把它当作金科玉律,觉得”好,LLM 说错了,那肯定错了。“不要这样做,因为这是让评估和实际情况脱节的最快方式。当人们对你的评估失去信任,他们对你也会失去信任。这一点非常重要,不要这样做。所以在发布你的 LLM as a judge 之前,你要确保它与人类判断对齐。怎么做?你有那些轴心编码,你要拿你的 judge 跟轴心编码比对,看”嘿,它跟我的判断一致吗?我自己的 judge,它同意我吗?“直接测量就行。
我们这里做的是,好,我说”评估这个 LLM trace。“同样,我只是用电子表格,“根据这些规则评估这个 LM trace。“规则就是我刚才展示的那个 prompt。我问它,“好,是否存在交接错误,true 还是 false?“然后这一列,让我放大一点。H 列是”好,这个错误发生了吗?“G 列是——我认为错误是否发生了。你可以看到——
Lenny Rachitsky: 你是手动逐条看的,你是那样做的。
Hamel Husain: 对对,我们已经做过了。我们已经手动过了一遍了。不需要再做一遍,因为我们已经有了轴心编码那个捷径,我们已经做过。如果你需要更多数据,可能需要再过一遍,这里面有很多关于如何正确操作的细节。你要拆分数据,做所有这些事情,确保你没有作弊,但我只是想展示这个概念。基本上,你可以测量一致性。现在有一件事你应该知道,作为产品经理,很多人会直接去看这个一致性。他们说,“好,我的 judge 在一定百分比的情况下跟人类一致。”
这听起来很诱人,但这是一个非常危险的指标,因为很多时候,错误只发生在长尾上,频率并不高,所以如果错误只出现10%的时间,那你只要让 judge 永远说通过,就能轻松达到90%的一致性。明白吗?90%的一致性看起来很好,但可能会有误导性。
Lenny Rachitsky: 因为是罕见错误。对。
Hamel Husain: 作为产品经理,或者即使你自己不做这个计算,如果有人向你汇报一致性,你应该立刻追问,“好,详细说说。“你需要深入看。他们给你更多直觉,这里是这个特定 judge 在 Google 表格中的一个矩阵,同样,这只是一个数据透视表,保持简单。“好,行上是,人类怎么想的?我怎么想的?有没有错误,true 还是 false?然后 judge 有没有错误,true 还是 false?”
Shreya Shankar: 这里的直觉正是 Hamel 说的,你需要看每种类型的错误。当人类说 false 但 judge 说 true,或者反过来,就是这里那些非绿色的对角线格子。如果它们太大了,就去迭代你的 prompt,让 LLM judge 更清楚,这样你就能减少那种不一致。你要达到一个程度,大多数情况……你会有一些不一致,这没关系。我们在课程里也讲了如何用代码纠正那种不一致,但在现阶段,如果你是产品经理,而构建 LLM judge 评估的人没有做这一步,他们只是说”一致性75%,没问题”,但他们没有这个矩阵,也没有迭代以确保这两种类型的错误降到接近零,那就是一个坏味道。去要求他们修复。
Lenny Rachitsky: 太棒了。这是一个很好的提示,告诉你当别人做错了的时候该看什么。
Shreya Shankar: 对。
评估即需求文档
Lenny Rachitsky: 其实,能不能带我们回到那个 LLM as a judge 的 prompt?我想强调一个非常有意思的地方。最近我播客上的一些嘉宾一直在说,“评估就是新的 PRD”,你看看这个,确实正是如此。产品经理、产品团队,这是产品应该是什么样的,这是所有的需求,这是它应该如何运作。他们构建了一个东西,然后测试它。往往是手动的。这个做法厉害的地方在于,它做的完全是一样的事,而且在持续运行。它告诉你,“这个 agent 应该这样回应”,而且是非常具体的方式。“如果是这样、这样、这样,就那样做。如果是这样、这样、那样,就那样做。” 这正是我反复听到的,你可以在这里直接看到。这就是产品需求文档最纯粹的形式——这个评估 judge 在告诉你它应该是什么样,而且是自动化的、持续运行的。
Shreya Shankar: 对,完全同意。它是从我们自己的数据中推导出来的,所以当然,它代表的是产品经理的期望。我发现很多人忽略的一点是,他们先把自己的期望写进去,而不先看数据,但随着我们查看数据,我们会发现更多一开始根本想不到的期望,这些最终都会进入这个 prompt。
Lenny Rachitsky: 这很有意思。你的建议不是在构建产品之前就直接跳到评估和 LLM as a judge prompt,仍然要写传统的单页 PRD 来告诉团队我们在做什么、为什么做、成功的标准是什么。但到了最后,你可以从中提取内容,甚至在使用这个流程迭代产品时改进最初的那份 PRD。
Shreya Shankar: 我甚至更进一步说——你会改进……它会变的。你永远不可能事先知道失败模式会是什么,你总是会发现自己希望产品具备的新感觉。在你亲眼用这些 LLM 看到之前,你并不真正知道自己想要什么,所以你必须保持灵活,必须看数据,必须……PRD 是思考这些问题的一个很好的抽象。但它不是全部,也不是终点。它会变。
Lenny Rachitsky: 说得好。Hamel 刚刚调出了一个很酷的研究报告。这是关于什么的?
Hamel Husain: 如果你想了解评估,这大概是你能读到的最棒的研究报告之一。它的作者是一位名叫 Shreya Shankar 的人。
Shreya Shankar: 天哪。
Hamel Husain: 还有她的合作者。论文题目叫”Who Validates the Validated?”。
Lenny Rachitsky: 这是给研究者的最好名字。
Shreya Shankar: 谢谢,谢谢。
Hamel Husain: 我应该让 Shreya 来讲这个。我觉得这篇论文中最值得关注的一点是标准漂移(criteria drift),以及她的发现。
Shreya Shankar: 我们做了这个超级有趣的研究,当时我们在对一些试图编写 LLM judge 或只是验证自己 LLM 输出的人做用户研究。我觉得那时候是在评估在网上变得极其火爆之前。我们是 2023 年底开始这个项目的。但作为一个研究者,一直在我脑子里燃烧的问题是,“为什么这个问题这么难?我们已经有了机器学习和 AI 这么长时间,它不是新东西,但突然之间,这一次,一切都变得非常困难。” 我们就和一群开发者做了这个用户研究,我们意识到,“好的,新的地方在于你没法事先确定你的评分标准。人们在审阅更多输出后,对好和坏的看法会发生变化,他们只有在看了 10 个输出之后才会想到一些之前根本想不到的失败模式,“而且这些都是专家。这些人之前已经构建过很多 LLM 流水线,现在又构建了 agent,但你永远不可能在最初就把一切都想出来。我觉得这在当今 AI 开发中非常关键。
Lenny Rachitsky: 这确实是个非常好的观点。它非常有力地印证了我们刚才讨论的内容,所以我把它调出来,就是……好的——
Shreya Shankar: 背后是有研究支撑的。
Lenny Rachitsky: 对,好的,没问题。你还是得以同样的方式做产品,但现在你有了一个非常强大的工具,帮助你确保构建的东西是正确的。它不会取代 PRD 流程。好的。那你通常会最终有多少个——比如说——LLM as a judge prompt?我知道这显然取决于产品的复杂度,但根据你的经验,大概是个什么数字?
Shreya Shankar: 对我来说,四到七个之间。
Lenny Rachitsky: 就这些?
Shreya Shankar: 没有那么多,因为很多失败模式,正如 Hamel 之前说的,可以通过直接修改你的 prompt 来修复。你只是没想过把它写进 prompt 里,所以现在把它加进去……你不应该为每件事都做这样的评估,只针对那些棘手的问题——你已经把理想行为写进了 agent prompt,但它仍然会失败的情况。
Lenny Rachitsky: 明白了。假设你发现了一个问题,你修复了它。在传统软件开发中,你会写一个单元测试来确保它不再发生。你这里的观点是不是说,“如果问题已经没了,甚至都不用费心为它写评估”?
Shreya Shankar: 我觉得如果你想写也可以,但这里的关键是关于优先级。你的资源和时间都是有限的,你不可能为所有事都写评估,所以优先处理那些更棘手的领域。
Lenny Rachitsky: 可能是那些如果出了问题对业务风险最大的——比如说了类似 Mecha Hitler 这种话的,Grok。
Shreya Shankar: 哎呀。
Lenny Rachitsky: 好。那这很让人放心,因为这个 prompt 确实花了很大功夫去想清楚所有这些细节。
Shreya Shankar: 但它主要是一次性投入。现在写好了,以后你可以永远在你的应用上运行它。
数据分析的进阶技巧
Hamel Husain: 好,数据分析非常强大,会给你的应用带来非常快速的改进。我们展示了最基本的数据分析方式——计数,这对所有人来说都是可用的。你可以用更复杂的方式来做数据分析。有很多不同的采样和查看数据的方法。我们让它看起来很简单,但要做好其实需要很多技能。需要培养一种直觉和嗅觉来筛选这些数据。比如,假设我发现了对话问题,对话流程的问题。如果我想进一步追踪这个问题,我会想办法找到其他我还没有编码的对话流程问题。我可能会用多种方式深挖数据,有不同的切入路径。这非常类似于——如果不说是完全一样的话——你对任何产品都会做的传统分析技术。
Lenny Rachitsky: 给我们简单说一下接下来是什么,然后我们来聊聊关于评估的争论,还有其他几件事。
LLM judge 的后续应用
Shreya Shankar: 构建好 LLM judge 之后接下来做什么?我们发现人们会尽可能把它用在所有地方——把 LLM judge 放进单元测试里,然后构建出这样的测试集:“这里是一些我们观察到故障的示例 trace,因为我们已经标注过了。现在我们要把它们变成单元测试的一部分,确保每次推送代码变更时,这些测试都能通过。” 他们还用它做线上监控。人们在上面搭建仪表盘,我觉得这太棒了。我认为做到这一点的产品对自身应用的表现有着非常敏锐的感知,但人们不会去谈论这件事,因为这是他们的护城河。人们不会去分享所有这些东西,因为理所当然——如果你是一个邮件写作助手产品,你在做这件事而且做得很好,你不会希望别人也去做一个邮件写作助手然后把你挤出市场。
我真的想强调的一点是:尽量把你构建的这些产物在线上所有能用的地方都用起来,反复使用它们来驱动产品改进。很多时候,Hamel 和我会教人们一路做到这一步,然后他们就会恍然大悟,之后就再也不会回来了。要么是——我不知道——他们辞职了,不再做 AI 开发了,要么就是他们从这里开始已经知道该怎么做了。我觉得是后者,而且我认为这非常强大。
Lenny Rachitsky: 看你们做这件事真的让我大开眼界,让我明白了这是什么,以及这个过程有多么系统化。我以前总是想象你就是坐在电脑前,“好吧,我需要确保哪些东西正常工作?“但你们在这里展示的是一套非常简单的、基于产品中真实发生的情形的循序渐进的方法——如何捕捉问题、识别问题、排列优先级,然后在问题再次发生时抓住它并修复它。
Shreya Shankar: 对,这不是魔法。任何人都能做到,但你需要练习这项技能,就像任何新技能一样,需要练习,但你能做到。我觉得现在非常赋能的一点是,产品经理也在做这件事、也能做这件事,并且真的可以凭借这套技能构建出非常盈利的产品。
关于评估的争议
Lenny Rachitsky: 很好,正好可以过渡到我们前几天在 X 上被卷入的那场争论。我之前没有意识到评估周围有多少争议和风波。很多人持有非常强烈的观点。Shreya,你怎么样?给我们大致讲讲关于评估重要性和价值这场争论的双方立场,然后说说你自己的看法。
Shreya Shankar: 好的。我会稍微和一下稀泥——我觉得所有人其实站在同一边。我认为误解在于人们对评估有着非常僵化的定义。比如,他们可能认为评估就是单元测试,或者认为评估只是数据分析部分,不包括线上监控,也不包括对产品特定指标的监控——比如实际的聊天参与数之类的。我觉得每个人进来时对评估的理解都不一样。另外一点我想说的是,人们过去在评估上栽过跟头。人们做过糟糕的评估。一个具体的例子是他们尝试做了 LLM judge,但它和他们的预期不一致。他们后来才发现这个问题,然后就不再信任它了,于是就说,“我反对评估。”
我百分之百理解这种感受,因为你应该反对基于 Likert 量表的 LLM judge。我完全同意,我们也反对那个。很多误解源于两件事:一是人们对评估的定义过于狭隘,二是人们没做好评估、栽了跟头、然后想避免其他人犯同样的错。然后,不幸的是,X 或者说 Twitter 是一个人们不断误读彼此言论的平台,于是你就看到各种强烈的主张——“不要做评估,那是坏的。我们试过了,不管用。我们是 Claude Code,“或者随便哪个知名产品,“我们不做评估。“这背后其实有非常多的微妙之处,因为很多这些应用本身就是站在评估的肩膀上的。编程代理就是一个很好的例子,Claude Code 就是如此。它们站在 Claude 基础模型……不是基础模型,而是经过微调的 Claude 模型的肩膀上,这些模型已经在很多编程基准测试上经过了评估。这一点是无法否认的。
Lenny Rachitsky: 为了让你说的更清楚——Claude Code 的负责人之一,我想应该是首席工程师,上了一个播客节目说,“我们不做评估,我们只靠感觉。“所谓感觉就是他们只是使用它,凭感觉判断对不对。
Shreya Shankar: 我觉得这是行得通的。这里有两点。第一,他们站在同事们为编程所做的评估的肩膀上。
Lenny Rachitsky: 也就是 Claude 基础模型的评估。
Shreya Shankar: 当然,对吧?我们知道他们公布了这些数字,因为我们能看到基准测试结果,我们知道谁在上面表现好。第二点是,他们在某种程度上可能对错误分析是非常系统化的。我敢打赌他们在监控谁在使用 Claude、有多少人在使用、产生了多少 trace、这些聊天有多长。他们在内部团队中可能也在做 dogfooding。任何时候有什么不对劲,他们可能有一个队列,或者把问题发给开发 Claude Code 的人,这个人就在隐式地做某种形式的错误分析——就是 Hamel 谈到的那种。所有这些都是评估,对吧?不存在一个世界是他们只是在说,“我做了 Claude Code,然后我什么都不看。“而不幸的是,当人们不去思考或谈论这些的时候,我觉得社区……
社区中的大多数人其实是初学者,或者是不了解评估但想学习评估的人,这给他们传递了错误的信息。当然,我不知道 Claude Code 具体在做什么,但我愿意打赌他们在以某种形式做评估。
Hamel Husain: 我还想说,编程代理与其他 AI 产品有着根本性的不同,因为开发者本身就是领域专家,所以你可以走很多捷径。而且,开发者一整天都在使用它,所以有一种 dogfooding 和领域专业性的类型……你可以把这些活动压缩,不需要那么多数据,不需要那么多反馈或探索,因为你自己就知道,所以你的评估流程理应看起来不一样。
Lenny Rachitsky: 因为你能看到代码,你看到它生成的代码。你能判断,“这个很好,这个很糟糕。”
Hamel Husain: 对,对。我觉得很多人把编程代理的做法做了过度推广,因为编程代理是第一批发布到野外的 AI 产品,我认为把它的做法大规模推广到其他领域是一个错误。
Shreya Shankar: 另外还有一点,工程师有 dogfooding 的性格特质。但在很多应用领域,人们在某些领域尝试构建 AI,却没有 dogfooding 的条件——比如面向医生的产品就不行,不可能让医生们出去接受 AI 给出的所有错误建议还对它保持包容和接纳。我认为把这些微妙之处牢记在心非常重要。
Lenny Rachitsky: 有趣的是,Shreya,我从你这里听到的是,如果团队中的人类在做非常细致的数据分析、错误分析、疯狂地 dogfooding,本质上他们自己就是人类评估,而你把这也纳入了评估的范畴。如果你有时间和动力,可以这么做;或者你也可以把这些东西设置成自动化的。
Shreya Shankar: 完全同意,这也关乎技能。在 Anthropic 工作的人非常、非常高素质。他们受过数据分析、软件工程、AI 等方面的训练。当然,任何人都可以通过学习概念达到那个水平,但大多数人目前不具备这种技能。
Dogfooding 的局限性
Hamel Husain: Dogfooding 是一个需要警惕的做法,因为很多人会说自己在做 dogfooding。他们说”对,我们 dogfood 了,“但真的做了吗?很多人并没有在那种真正切身的程度上做 dogfooding,以此来闭合反馈回路。这是我想补充的唯一一点保留。
评估与 A/B 测试之争
Lenny Rachitsky: 还有一个感觉像是稻草人论证的话题——评估(evals)与 A/B 测试之间的对立。谈谈你们的看法?因为这似乎是这场讨论中很大的一部分。人们在争论,“如果你有 A/B 测试在检验生产环境指标,还需要评估吗?”
Shreya Shankar: A/B 测试,我想,同样也是评估的另一种形式,对吧?当你做 A/B 测试时,你有两种不同的实验条件,然后有一个量化某个东西成功与否的指标,你在比较这个指标。同样,在我们看来,评估就是对质量的系统性测量,某种指标。你不可能在没有评估可比较的情况下做 A/B 测试,所以也许我们只是对这个有一个不同的、有点奇怪的看法。
Lenny Rachitsky: 好的。我听到的是你们把 A/B 测试视为所做的一系列评估中的一部分。我想人们想到 A/B 测试的时候,会说我们在产品中改变了一些东西,看看是否能改善我们关心的某个指标。这样够吗?为什么还需要测试每一个小功能?如果它影响的是我们作为企业关心的某个指标,我们有一堆 A/B 测试在持续运行。
Shreya Shankar: 这就引出了一个很好的观点。我认为很多人过早地做 A/B 测试,因为他们根本没做过任何错误分析。他们只是假设性地提出了产品需求,然后认为”我们应该测试这些东西”,但实际上,当你深入数据时,就像 Hamel 所展示的那样,你看到的错误并不是你以为会出现的那些。而是一些奇怪的交接问题,或者,我不知道,文本消息那件事很奇怪。我想说的是,如果你要做 A/B 测试,而且它们是由实际的错误分析驱动的,就像我们今天展示的那样,那很好,去做吧。但如果你只是想做 A/B 测试——我们发现很多人确实在这么做——仅仅基于你假设性地认为重要的东西,那我会鼓励大家回去重新思考,让你的假设有据可依。
Statsig 被 OpenAI 收购
Lenny Rachitsky: 你们对 Statsig 被 OpenAI 收购后要做什么有什么看法吗?有什么有意思的东西吗?这可是件大事,一笔巨额收购。一家 A/B 测试公司被收购了,人们开始说”A/B 测试就是未来。“有什么想法?
Hamel Husain: 先补充一点上一个问题,为什么会有评估与 A/B 测试之争这个讨论?我认为,从根本上说,评估是……人们在试图搞清楚如何改进他们的应用,从根本上需要做的是……数据科学在产品中是有用的。看数据,做数据分析。有很多不同的工具套件,你不需要发明任何新东西。当然,你未必需要数据科学的全部广度,而且使用 LLM 时看起来会略有不同,只是略有不同而已。你的具体策略可能不同,所以本质上就是用分析工具来理解你的产品。现在人们说”评估”这个词,试图划分出一个新领域,然后说评估与 A/B 测试如何如何,但如果你退后一步看,这跟以前的数据科学是一样的。我认为造成困惑的原因就是——“嘿,我们需要数据科学思维,“AI 产品需要这种思维,就像任何产品都需要一样。这是我的看法。
Lenny Rachitsky: 这个观点非常好。我觉得只是”评估”(Evals)这个词现在会触发人们的敏感反应。
Shreya Shankar: 是啊。
Lenny Rachitsky: 如果你只是说,“我们只是在做错误分析,做数据科学来理解我们的产品在哪里出问题,然后设置测试来确保我们知道——”
Shreya Shankar: 那太无聊了,听起来很无聊。不不不。我们需要一个神秘的术语,比如”评估”(Evals),才能真正获得动力。
关于你问的 Statsig,我觉得非常令人兴奋。说实话,我对它了解不多,因为我只是把它想象成这样一家公司——有一个很多人在用的工具,也许碰巧 OpenAI 收购了他们。我确定他们之前就在使用,我确定 OpenAI 的竞争对手也在使用 Statsig,所以也许这次收购有某种战略考量。我完全不知道,我对这方面没有任何信息。但对我来说这些才是更大的问题,而不是”这是否从根本上改变了 A/B 测试,或者让评估变得更加优先?“我认为评估一直都很重要,我认为 OpenAI 一直在做某种形式的评估,而且从历史上看 OpenAI 甚至做得更远——他们会去看所有 Twitter 上的情绪倾向,尝试做一些回顾分析,然后把它与产品关联起来。当然,他们在发布新的基础模型(foundation models)之前会做一定量的评估,但他们走得更远,会说,“好,让我们找出所有抱怨它的推文,所有抱怨它的 Reddit 帖子,然后去搞清楚到底是怎么回事。“这说明评估非常、非常重要。目前还没有人真正搞明白。人们在使用所有能获取的信号源来改进他们的产品。
对评估工具未来的期望
Hamel Husain: 我想说的是,我对这件事抱有很大期望——它也许能在 OpenAI 内部转移或创造一个重心,希望如此。到目前为止,大型实验室(labs)很自然地将注意力集中在通用基准测试(benchmarks)上,比如 MMLU 分数、human eval 之类的,这些对于基础模型来说非常重要。但那些跟产品特定的评估——比如我们今天讨论的交接之类的问题——关联度不高,它们往往不相关。
Shreya Shankar: 对,它们跟数学问题解决能力不相关,不好意思说句实话。
Hamel Husain: 没错。如果你看看那些评估产品,比如说大型实验室直到最近才有的那些,它们没有错误分析功能。它们有一套通用工具——余弦相似度(cosine similarity)、幻觉评分(hallucination score)之类的,但这不奏效。这是第一次尝试,还不错。至少你在做点什么,让人们在看数据。但最终,我们希望看到的是评估过程中融入更多的数据科学思维。希望我们的工具能达到那一步。
Shreya Shankar: 对,Pamela 和我不应该是这个星球上仅有的两个在推广一种结构化的方式来思考应用特定评估的人。这让我觉得不可思议。为什么全世界只有我们两个人在做这件事?这不对劲。我希望我们不是仅有的两个人,希望更多人能跟上来。
评估的需求与市场热度
Lenny Rachitsky: 你们在 Maven 上的课程是 Maven 平台上收入最高的课程,这本身就说明需求是真实存在的,而且我认为站在你们这边的人越来越多。有趣的是,你一直在 Twitter 上分享的一个例子我觉得很有启发——所有人都在说 Claude Code 根本不在乎评估,他们只讲感觉(vibes),然后大家说,他们是最好的编程 agent,所以显然这是对的。但最近,所有人都在谈论 Codex,说 OpenAI Codex 更好,大家纷纷转过去,而他们非常推崇评估。
Shreya Shankar: 我知道。
Lenny Rachitsky: 是啊。
Shreya Shankar: 这每次都让我哭笑不得。互联网上的人太不一致了。我最喜欢的一件事是昨天,我和几个实验室同学出去吃甜品什么的,有人问:“你更喜欢 Codex 还是 Claude?“另一个人说:“我喜欢 Claude。“然后又有人说:“但新版 Codex 更好。“接着第一个人说:“哦,但我上次看是两天前了,所以我的看法——可能我信息不够新了。“我当时心想,天哪。
Lenny Rachitsky: 确实如此,确实如此。这就是我们生活的世界。天哪。好,我想问一下关于评估最常见的误解,以及做好评估的最高效技巧。也许你们各自分享一两条。先从误解开始,我先问 Hamel。人们围绕评估最常见的误解有哪些?
关于评估的常见误解
Hamel Husain: 排第一的是:“嘿,我只需要买个工具,插上去,它就会替我做评估。我为什么要操心这个?我们活在 AI 时代了,AI 不能直接做评估吗?“这是最常见的误解,而且人们太想要这个东西了,以至于真的有人卖这种东西,但它不奏效。这是第一个。
Lenny Rachitsky: 瞎说,人类还是很厉害的。我觉得这是个好消息。
Hamel Husain: 第二个我经常看到的是:“根本不看数据。“在我的咨询工作中,人们带着问题来找我,我第一句话就是:“让我们去看看你的 trace。“你会看到他们眼睛瞪大,“什么意思?“我说,“对,现在就看。“他们很惊讶我会去看单条 trace,而每次——百分之百——都能从中学到很多、找到问题所在。我觉得人们就是不知道看数据有多强大,就像我们在这个播客里演示的那样。
Shreya Shankar: 我同意。
Lenny Rachitsky: 这就是前两个?好的。
Shreya Shankar: 对。然后,我想补充的第三个误解是:做评估没有唯一正确的方法。不正确的方法有很多,但正确的方法也不止一种。你需要思考你的产品现在处于什么阶段、你有多少资源,然后制定最适合你的方案。它总是会涉及某种形式的错误分析,就像我们今天展示的那样,但具体如何将这些指标落地运营,会根据你所处的阶段而变化。
评估的实用建议
Lenny Rachitsky: 太好了。好,有什么技巧和建议想留给大家的?无论他们刚开始接触评估,还是想改进现有的做法?
Shreya Shankar: 第一条建议就是:不要害怕看你的数据,不要感到恐慌。我们尽量把这个流程做得结构化,但不可避免地会有各种问题冒出来,这完全没问题。你可能觉得自己做得不完美,这也完全没问题。目标不是完美地做评估,而是切实改善你的产品。我们向你保证,不管你怎么做,只要你做了这个流程中的部分环节,你就会找到可以切实改进的地方,然后你会在自己的流程上不断迭代。
另一条建议是,我们非常支持用 AI。在整个过程中,用 LLM 来帮你整理思路。这可以是任何事情——从最初的产品需求开始,想清楚如何为自己组织这些需求,想清楚如何根据你创建的开放编码来改进产品需求文档。不要害怕用 AI 来帮你更好地呈现信息。
Lenny Rachitsky: 很好,所以不要害怕,在整个过程中尽可能多地用 LLM。
Shreya Shankar: 但不是用它替代你自己。
Lenny Rachitsky: 对。好,太好了。工作还在,太好了。Hamel?
打造看数据的工具
Hamel Husain: 对。让我分享一下屏幕,因为我想展示点东西。在 Shreya 说的之上,如果你在这期播客里记住了一句话,很可能就是”看你的数据”这几个字。这件事太重要了,以至于我们教学中建议你应该自己创建工具,让这件事变得尽可能简单。之前做现场演示时我展示了一些标注数据的工具。跟我合作的大多数人,他们意识到这件事的重要性后,会用 vibe coding 做自己的工具——或者说,他们自己动手做工具。现在做这个比以往任何时候都便宜,因为 AI 可以帮你。
AI 非常擅长创建简单的 Web 应用——展示数据、写入数据库,都很简单。以 Nurture Boss 的案例来说,我们想消除看数据的所有阻力。你看到的这些截图就是他们自己做的应用的样子。它就是——好,有不同的渠道:语音、邮件、短信;有不同的对话线程;系统 prompt 默认隐藏起来。一些小的体验改进。然后他们还有这个轴心编码的部分,你可以看到红色的错误计数。他们把那部分自动化了,做得很漂亮,而且几个小时就做出来了。要做一个通用的”看数据”工具真的很难,你不必一开始就走到这一步,但值得思考的是:让看数据这件事尽可能简单,因为,再说一次,这是你能从事的最强大的活动,是 ROI 最高的活动。有了 AI,就把所有阻力都去掉吧。
Lenny Rachitsky: 太棒了。再说一次,我觉得 ROI 这一点非常重要,我们甚至还没充分展开。这里的目标是让你的产品变得更好,从而让你的业务更成功。这不仅仅是一个用来抓 bug 之类的小练习。这是让 AI 产品变好的方法,因为用户体验就是用户与你的 AI 交互的方式。
Hamel Husain: 完全同意。甚至可以说,我们教学生的是——“嘿,当你在做这些评估时,如果你看到有什么问题,直接去修。“整个重点不是拥有一套漂亮的评估套件可以指给别人看、炫耀说:“看我的评估。“不是的,直接修你的应用,让它变得更好。如果问题是明显的,直接做。完全同意你。
Lenny Rachitsky: 太好了。还有一个问题我没问,但我觉得大家都在想——做这件事要花多长时间?通常第一次做需要多久?
时间投入与持续性
Shreya Shankar: 我可以就我自己参与的应用来回答。通常,我会花三到四天时间,和相关人员一起做初步的错误分析。大量标注工作,感觉我们到了一个可以创建 Hamel 那种电子表格的阶段,大家都认可并信服了,甚至搭建了几个 LLM judge 评估器。但这属于一次性成本。一旦我搞清楚如何将其集成到单元测试中,或者写了一个自动对样本运行的脚本,再设一个 Cron Job 让它每周自动执行,之后我大概——我不知道,我发现自己可能花更多时间看数据,因为我就是对数据有种饥渴感。我就是太好奇了。我从这个过程中获益良多,它让我在与任何人的合作中都远超他人,所以我想继续做下去,但并不是非做不可。之后大概每周花 30 分钟吧。
Lenny Rachitsky: 所以基本上前期需要一周时间,然后每周花 30 分钟来持续改进和扩充你的评估套件?
Shreya Shankar: 对,真的不需要那么多时间。我觉得人们只是被前期投入的时间吓到了,然后以为必须一直这样做下去。
Lenny Rachitsky: 太棒了。还有什么想分享的,想留给听众的吗?在进入非常令人期待的闪电问答环节之前,还有什么想重点强调的观点吗?
过程中的乐趣
Hamel Husain: 我想说的是,这个过程其实非常有趣。就是——好吧,你要看数据,哦,听起来像是在做标注。实际上,我昨天就在看一个客户的数据,完全一样的流程。那是一个发送邮件的应用,发招聘邮件,试图吸引候选人来应聘职位。我们决定开始看 trace。“来,看看你的 trace。“我们看了一条 trace,我第一眼看到的就是一封措辞为”Given your background,blah blah blah”的邮件。我立刻问了对方,这就是你戴上产品经理的帽子、保持挑剔的时候,也是有趣的部分所在。我说,“你知道吗?我很讨厌这封邮件。你自己喜欢这封’modify your background’的邮件吗?“当我收到一条”Given your background”开头的消息,我直接就删掉了。我心想,“这是什么,‘鉴于你在机器学习方面的背景’之类的?“我觉得这就是模板化的东西。我问对方,“我们能做得更好吗?这听起来像是千篇一律的招聘邮件。“他们说,“哦,对,也许吧。“因为他们之前还挺自豪的,觉得”AI 做的没错啊,发了正确的邮件、正确的信息、正确的链接、正确的名字,一切都对。“有趣的地方就在这里——戴上产品经理的帽子,深入去看:这真的够好吗?
Lenny Rachitsky: 在进入令人期待的闪电问答环节之前,我想确保我们提到一点——这只是做好这件事所需知识的冰山一角。但我认为这是我所见过的关于如何做好评估的最佳入门指南。
Shreya Shankar: 好的。
课程与深入学习
Lenny Rachitsky: 我想我们做到了。但你们两位教授了一门课程,对那些真正想精通这件事、认真对待的人,会深入得多。请分享一下课程中你们教了但我们没涵盖的内容,以及作为 Maven 上你们课程的学生还能获得什么。
Shreya Shankar: 我可以谈谈教学大纲,然后 Hamel 来说课程的各种福利。我们按照一个完整的生命周期来讲:错误分析,然后是自动化评估器,然后是如何改进你的应用,如何为自己建立那种飞轮。我们还有一些专题内容,据我们发现几乎没人听说过、也没人教过,这让人很兴奋。一个是,如何为错误分析构建你自己的界面。我们会展示我们实际构建的界面,还会现场用新数据实时编码。我们展示如何使用 Claude code cursor,或者当天心情想用的任何工具来构建这些界面。我们还广泛讨论了成本优化。我合作过的一些人,他们到了一个阶段:评估做得很好,产品也很好,但一切都非常昂贵,因为他们在用最先进的模型。我们怎样才能在某些场景下把最昂贵的 GPT-5 替换成 5-nano、4-mini 之类的,省下很多钱,同时保持相同的质量?我们也会给一些这方面的建议。Hamel,到你了。我们还有很多福利。
Lenny Rachitsky: 好,说说那些福利。
Hamel Husain: 好,福利。我最喜欢的福利是一本 160 页的书,写得非常细致,是我们创建的,逐步讲解整个评估流程的细节来配合课程。你不用坐在那里做所有笔记。我们已经替你做了所有艰苦的工作,详细记录并整理好了一切。这个真的非常有用。另一个很有意思的东西,也是我从你这里得到的灵感,Lenny——这是一门 AI 课程。教育不应该只是看讲座和做作业。学生应该也能接触到一个帮助他们学习的 AI。我们做的事情是——就像你有 LennyBot 一样。
Lenny Rachitsky: .com。
Hamel Husain: 对,lennybot.com,我们用和你相同的软件做了同样的东西,把我们对评估说过的一切都放了进去。每一节课、每一次答疑、每一次 Discord 聊天,任何博客、论文,我们公开说的和课程内讲的所有内容,全部放进去了。我们找了一批学生测试过,他们觉得很有帮助。我们给所有学生 10 个月的免费无限制访问权限,随课程一起。
Lenny Rachitsky: 太棒了。然后之后会开始收费?
Hamel Husain: 我也不知道。我一个月一个月来。我不知道这个东西会走向哪里。
Lenny Rachitsky: 八个月之后再说吧。我刚才在想,这整个访谈其实可以就让我们的 bot 互相聊天。
Shreya Shankar: 那太有意思了。我会看,但只看 10 分钟,之后我就不知道它们在说什么了。
Lenny Rachitsky: 嗯,可能 30 秒吧。话说你们用语音模式训练了吗?那是 Delphi 产品中我最喜欢的功能。如果没有的话,你们应该试试。
Hamel Husain: 哦,我想——我不记得了,我得去看看。
Lenny Rachitsky: 你们一定要做。现在有了这期播客的内容,你可以用它来训练。是 11Labs 驱动的,效果非常好。好的,那他们怎么才能——我想这样就行——他们报名课程之后就能获得这些。
Shreya Shankar: 对,报名课程后会收到一系列邮件。一切都会很清楚,希望如此。
Lenny Rachitsky: 太棒了。好的。
Shreya Shankar: 我们还有一个 Discord,所有上过课的学生都在里面。那个 Discord 非常活跃。我连度假坐飞机的时候都逃不掉通知。
Lenny Rachitsky: 又甜蜜又痛苦。太不可思议了。好的,到此为止,我们进入了非常令人期待的闪电问答环节。我有五个问题。准备好了吗?
Shreya Shankar: 准备好了。来吧。
闪电问答
Lenny Rachitsky: 开始吧。好的,我在你们两位之间轮流提问。想说就说,想跳过也行。第一个问题,Shreya,你最常推荐给别人的两三本书是什么?
Shreya Shankar: 我喜欢推荐一本小说,因为生活不只有评估。最近我读了 Min Jin Lee 的《Pachinko》,非常棒的一本书。另外我还在读《Apple in China》,作者名字一时想不起来了,但这更像是一部纪实作品,由一位记者撰写,讲述 Apple 过去几十年在亚洲如何开展大量制造流程,非常开眼界。
Lenny Rachitsky: 太棒了。Hamel?
Hamel Husain: 好的,我就带在身边。我是个书呆子。好吧,我没 Shreya 那么酷。我其实最喜欢的是教科书。这本是非常经典的一本,Mitchell 的《Machine Learning》。它是偏理论的,但我喜欢的地方在于,它真正让人深刻理解奥卡姆剃刀原理不仅适用于科学,也适用于机器学习和 AI,还有工程领域——很多时候,更简单的方法泛化能力更强。这是我从那本书里深深内化的一个道理。我还很喜欢这本,又一本教科书。我说了我是个书呆子嘛。这本也很老了,是 Norvig 的《Algorithms》。我喜欢它因为它展现了人类的智慧,里面有大量巧妙而实用的计算技巧。
Shreya Shankar: 他就在街那边,他和 Berkeley。
Lenny Rachitsky: 做那项研究的人?
Shreya Shankar: 对,教科书作者。
Lenny Rachitsky: 太酷了。天哪,书呆子们,我爱了。好,下一个问题。最近最喜欢的电影或电视剧?先从 Hamel 开始。
Hamel Husain: 好的,我是一个两个孩子的爸爸——不好意思,两个孩子。对,我是两个孩子的爸爸,基本上没时间看电视或电影,所以就看孩子们在看什么。上周我把《Frozen》看了三遍。
Lenny Rachitsky: 才三遍?哦好吧,上周。好吧。
Hamel Husain: 这就是我的生活。
Lenny Rachitsky: 很好,Hamel。《Frozen》,我喜欢。好,Shreya。
Shreya Shankar: 我没有孩子,所以能给出精彩的答案。实际上,我和我丈夫最近在看《The Wire》。我们小时候没看过,所以开始看了,真的很棒。
Lenny Rachitsky: 我感觉每个人都会经历这个阶段——人生中总会有一刻决定,我要看《The Wire》了。
Shreya Shankar: 对,我们现在就处于那个阶段。
Lenny Rachitsky: 那基本上要花你一年时间。很棒,真的是一部好剧。天哪,但集数太多了,每集都一小时。
Shreya Shankar: 我知道,我知道。
Lenny Rachitsky: 真是一个巨大的承诺。
Shreya Shankar: 我们一星期能看两三集,进度很慢。
Lenny Rachitsky: 值得的。好,下一个问题。有没有最近发现并且特别喜欢的产品?先从 Shreya 开始。
Shreya Shankan: 说实话,我非常喜欢用 Cursor。还有 Claude Code。我来说说为什么。我首先是一个研究者,我写论文、写代码、搭建系统,什么都做。我发现一个工具……我非常看好 AI 辅助编程,因为我总是需要身兼数职。现在,我对自己构建的东西和写论文的对象可以更有野心了,所以我对这些工具非常兴奋。Cursor 是我的入门点,但我发现自己一直在努力跟上所有这些 AI 辅助编程工具的更新。
Lenny Rachitsky: Hamel?
Hamel Husain: 我也很喜欢 Claude Code,我喜欢它是因为我觉得它的 UX 非常出色。里面倾注了很多心血。作为一个终端应用能做得这么精致,真的令人印象深刻。
Lenny Rachitsky: 讽刺的是,你们两位都爱 Claude Code,而它不过是靠感觉检查做出来的。
Shreya Shankar: 我觉得这是错的,它不只是靠感觉检查。
Lenny Rachitsky: 好吧。还有两个问题。Hamel,你有没有最喜欢的座右铭,在工作或生活中经常回想起来的?
Hamel Husain: 持续学习。像初学者一样思考。
Lenny Rachitsky: 很好。Shreya?
Shreya Shankar: 我喜欢这个。对我来说,就是始终试着站在对方的角度思考。我有时会在网上看到一些争论,比如关于评估竞赛的辩论,然后我会认真想,“好吧,设身处地想一想,也许有一个善意的解读。“我觉得我们团结在一起远比互相争斗更强大。我对评估的愿景不是让 Hamel 和我成为亿万富翁,而是让每个人都能构建 AI 产品,大家达成共识。
Lenny Rachitsky: 以及每个人都成为亿万富翁。
Shreya Shankar: 对。
Lenny Rachitsky: 太棒了。最后一个问题。当我有两位嘉宾时,我总是喜欢问这个问题。先从 Hamel 开始。你最喜欢 Shreya 的哪一点?Shreya,我等下也会问你同样的问题。
Hamel Husain: 好的。Shreya 是我认识的最有智慧的人之一,尤其是考虑到她比我年轻那么多。说实话,我觉得她比我明智得多。她非常脚踏实地,看问题非常平和。我一直对此印象深刻。
Lenny Rachitsky: Shreya?
Shreya Shankar: 我最喜欢 Hamel 的地方是他的能量。我不认识任何人能像 Hamel 那样持续保持冲劲和活力。我经常想,如果不是因为 Hamel,我对评估的热忱可能早就消退了。每个人的生活中都需要一个 Hamel,这是肯定的。
Lenny Rachitsky: 好了,现在我们每个人的生活中都有一个 Hamel 了。这期太棒了,完全是我期望的样子。我觉得这是我见过的最有趣、最深入、最易懂的评估入门。真的非常感谢你们两位抽出时间。最后两个问题。大家在哪里可以找到你们?在哪里可以找到课程?听众怎样才能帮到你们?先从 Shreya 开始。
Shreya Shankar: 你可以通过电子邮件联系我,地址在我的网站上。Google 我的名字就能找到我的网站,这是最简单的方式。课程的话,如果你 Google “AI Evals for engineers and product managers”,或者直接搜 “AI Evals course”,就能找到。我们之后会发一些链接,方便大家找到。如何帮到我们?对我来说一直有两件事。一是有问题时尽管问我,我会尽快回复。另一件是告诉我们你们的成功案例。支撑我们继续做下去的一件事,就是有人告诉我们他们实施了什么、做了什么,一个真实的案例。Hamel 和我看到这些会非常兴奋,真的能让我们继续走下去,所以请多多分享。
Hamel Husain: 找到我很容易。我的网站是 Hamel.dev,我会把链接给你们。也可以在社交媒体上找到我,LinkedIn、Twitter。最有帮助的事情,也是 Shreya 说的——如果教评估的不只是我们两个人,我们会非常高兴。我们希望其他人也来教评估。任何博客文章、写作,尤其是你们在学习过程中想要分享的内容,我们都很乐意帮忙转发和推广。
Lenny Rachitsky: 太棒了,非常慷慨。非常感谢你们两位来到这里。我真的很感激,你们两位事情那么多,谢谢。
Shreya Shankar: 谢谢 Lenny 邀请我们,也谢谢你的所有赞美。
Lenny Rachitsky: 我的荣幸。大家再见。非常感谢收听。如果你觉得这期节目有价值,可以在 Apple Podcasts、Spotify 或你喜欢的播客应用上订阅。也请考虑给我们评分或留下评论,这真的能帮助更多听众发现这个播客。你可以在 Lennyspodcast.com 找到所有往期节目或了解更多关于节目的信息。下期再见。
术语表
| 原文 | 中文 |
|---|---|
| 11Labs | 11Labs(语音 AI 公司,保留原文) |
| agreement | 一致性(指 judge 与人类判断的吻合程度) |
| Andrew Ng | Andrew Ng(知名机器学习学者,保留原名,中文领域亦通行此写法) |
| axial codes | 轴心编码 |
| bad smell | 坏味道(源自软件工程中的 code smell 概念,指值得警惕的信号) |
| benevolent dictator | 仁慈的独裁者 |
| bucket | 桶(分类桶) |
| CI | CI(持续集成,Continuous Integration,保留原文) |
| cohort | 队群 |
| cosine similarity | 余弦相似度 |
| criteria drift | 标准漂移 |
| Cron Job | Cron Job(定时任务,保留原文) |
| Delphi | Delphi(AI 数字分身平台,保留原文) |
| Discord | Discord(社区沟通平台,保留原文) |
| dogfooding | dogfooding(指团队使用自己开发的产品进行测试的做法,保留原文) |
| error analysis | 错误分析 |
| evals | 评估 |
| failure mode | 失败模式 |
| flywheel | 飞轮(指自我强化的正反馈循环) |
| foundation models | 基础模型 |
| Gemini | Gemini |
| generalize | 泛化(机器学习术语,指模型在新数据上的表现能力) |
| hallucination score | 幻觉评分 |
| HubSpot | HubSpot(知名营销与销售软件公司,保留原文) |
| jank | 糙(指粗糙、不完善的表现) |
| Julius AI | Julius AI |
| Jupyter notebook | Jupyter notebook |
| Likert scale | Likert 量表(一种常用评分量表) |
| LLM as a judge | LLM as a judge(用 LLM 充当评判者来评估输出质量的评估方式,保留原文) |
| MMLU | MMLU(大规模多任务语言理解基准测试,保留原文) |
| moat | 护城河(商业竞争壁垒) |
| observability | 可观测性 |
| Occam’s razor | 奥卡姆剃刀(科学哲学原则,指在解释同一现象时,假设越少越好) |
| open coding | 开放编码 |
| pivot table | 数据透视表 |
| prompt | prompt |
| RAG retrieval | RAG 检索 |
| rubrics | 评分标准 |
| Statsig | Statsig(A/B 测试与实验平台公司,保留原文) |
| stochastic | 随机的 |
| surface area | 表面积 |
| taxonomy | 分类体系 |
| theoretical saturation | 理论饱和 |
| trace | trace(追踪) |
| vibe checks | 感觉检查 |
此文档由 AI 分片翻译(translate_long_document)