估值10亿美元的AI公司训练ChatGPT、Claude与Gemini，走向负责任的AGI之路 | Edwin Chen

Edwin Chen 2025-12-07

The $1B Al company training ChatGPT, Claude & Gemini on the path to responsible AGI | Edwin Chen

Episode Introduction

Lenny Rachitsky: You guys hit a billion in revenue in less than four years with around 60 to 70 people. You’re completely bootstrapped, haven’t raised any VC money. I don’t believe anyone has ever done this before.

Edwin Chen: We basically never wanted to play the Silicon Valley game. I always thought it was ridiculous. I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of the people and we would move faster because the best people wouldn’t have all these distractions. So when we start Surge, we wanted to build it completely differently with a super small, super elite team.

$1B Revenue With Under 100 People

Lenny Rachitsky: You guys are by far the most successful data company out there.

The Evolution Of Company Structures

Edwin Chen: We essentially teach AI models what’s good and what’s bad. People don’t understand what quality even means in this space. They think you could just throw bodies at a problem and get good data, that’s completely wrong.

Lenny Rachitsky: To a regular person, it doesn’t feel like these models are getting that much smarter constantly.

What Does Surge AI Do?

Edwin Chen: Over the past year, I’ve realized that the values that the companies have will shape the model. I was asking Claude to help me drop an email the other day. And after 30 minutes, yeah, I think it really crafted me the perfect email and I sent it. But then I realized that I spent 30 minutes doing something that didn’t matter at all. If you could choose the perfect model behavior, which model would you want? Do you want a model that says, “You’re absolutely right. There are definitely 20 more ways to improve this email,” and it continues for 50 more iterations or do you want a model that’s optimizing for your time and productivity and just says, “No. You need to stop. Your email’s great. Just send it and move on”?

Lenny Rachitsky: You have this hot take that a lot of these labs are pushing AGI in the wrong direction.

Defining True AI Quality

Edwin Chen: I’m worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understand the universe, we are optimizing for AI slop instead. But we’re optimizing your models for the types of people who buy tabloids at a grocery store. We’re basically teaching our models to chase dopamine instead of truth.

Why Claude Leads In Coding And Writing

Lenny Rachitsky: Today, my guest is Edwin Chan, founder and CEO of Surge AI. Edwin is an extraordinary CEO and Surge is an extraordinary company. They’re the leading AI data company, powering training at every frontier AI lab. They’re also the fastest company to ever hit $1 billion in revenue in just four years after launch with fewer than 100 people and also completely bootstrapped. They’ve never raised a dollar in VC money, they’ve also been profitable from day one.

As you’ll hear in this conversation, Edwin has a very different take on how to build an important company, and how to build AI that is truly good and useful to humanity. I absolutely love this conversation and I learned a ton. I’m really excited for you to hear it. If you enjoy this podcast, don’t forget to subscribe and follow it in your favorite podcasting app or YouTube. It helps tremendously.

And if you become an annual subscriber of my newsletter, you get a ton of incredible products for free for an entire year, including Devin, Lovable, Replit, Bolt, N8N, Linear, Superhuman, Descript, Wispr Flow, Gamma, Perplexity, Warp, Granola, Magic Patterns, Raycast, ChatPRD, Mobbin, PostHog, and Stripe Atlas. Head on over to lennysnewsletter.com and click Product Pass. With that, I bring you Edwin Chen after a short word from our sponsors.

My podcast guests tonight love talking about craft, and taste, and agency, and product market fit. You know what we don’t love talking about? SOC 2. That’s where Vanta comes in. Vanta helps companies of all sizes get compliant fast and stay that way with industry-leading AI, automation, and continuous monitoring. Whether you’re a startup tackling your first SOC 2 or ISO 27001 or an enterprise managing vendor risk, Vanta’s trust management platform makes it quicker, easier, and more scalable. Vanta also helps you complete security questionnaires up to five times faster so that you can win bigger deals sooner.

The result, according to a recent IDC study, Vanta customers slashed over $500,000 a year and are three times more productive. Establishing trust isn't optional. Vanta makes it automatic. Get$ 1,000 off at vanta.com/lenny.

Whether you’re a seed stage startup trying to land your first enterprise customer or a unicorn expanding globally, WorkOS is the fastest path to becoming enterprise-ready and unlocking growth. They’re essentially Stripe for enterprise features.

Visit workos.com to get started or just hit up their Slack support where they have real engineers in there who answer your questions superfast. WorkOS allows you to build like the best with delightful APIs, comprehensive docs, and a smooth developer experience. Go to workos.com to make your app enterprise ready today.

Edwin, thank you so much for being here and welcome to the podcast.

Edwin Chen: Thanks so much for having me. I’m super excited.

The Reliability Of AI Benchmarks

Lenny Rachitsky: I want to start with just how absurd what you’ve achieved is. A lot of people and a lot of companies talk about scaling massive businesses with very few people as a result of AI, and you guys have done this in a way that is unprecedented. You guys hit a billion in revenue in less than four years with less than 60, around 60 to 70 people, you’re completely bootstrapped, haven’t raised any VC money, I don’t believe anyone has ever done this before, so you guys are actually achieving the dream of what people are describing will happen with AI. I’m curious just, do you think this will happen more and more as a result of AI? And also just where has AI most helped you find leverage to be able to do this?

Edwin Chen: Yeah, so we hit over a billion of revenue last year with under 100 people. And I think we’re going to see companies with even crazier ratios, like 100 billion per employee in the next few years. AI is just going to get better and better and make things more efficient so that ratio just becomes inevitable.

I used to work at a bunch of the big tech companies and I always felt that we could fire 90% of people and we would move faster because the best people wouldn’t have all these distractions. And so when we started Surge, we wanted to build it completely differently with a super small, super elite team, and yeah, what’s crazy is that we actually succeeded. And so I think two things are colliding.

One is that people are realizing that you don’t have to build giant organizations in order to win.

And two, yeah, all these efficiencies from AI. And they’re just going to lead to a really amazing time in company building.

The thing I’m excited about is that the types of companies are going to change too. It won’t just be that they’re smaller, we’re going to see fundamentally different companies emerging. If you think about it, fewer employees means less capital. Less capital means you don’t need a raise. So instead of companies started by founders who are great at pitching and great at hyping, you’ll get founders who are really great at technology and product.

And instead of products optimized for revenue and what VCs want to see, you’ll get more interesting ones built by these tiny obsessed teams. So people building things they actually care about, real technology and real innovation. So I’m actually really hoping that the slick on [inaudible 00:07:06], it’ll go back to being updates for hackers again.

AI Benchmarks And Marketing Hype

Lenny Rachitsky: You guys have done a lot of things in a very contrarian way, and one was actually just not being on LinkedIn, posting viral posts, not on Twitter, constantly promoting Surge. I think most people hadn’t heard of Surge until just recently, and then you just came out, and like, okay, the fastest growing company at a billion dollars. Why would you do that? I imagine that was very intentional.

Edwin Chen: We basically never wanted to play the Silicon Valley game. And like I always thought it was ridiculous. What did you dream of doing when you were a kid? Was it building a company from scratch yourself and getting in the weeds of your code and your product every day? Or was it explaining all your decisions to VCs and getting on this giant PR and fundraising hamster wheel? And it definitely made things more difficult for us, because yeah, when you fundraise, you just naturally get part of this kind of Silicon Valley industrial complex where people will, your VCs will tweet about you. You’ll get the tech runs outlines, you’ll get announced in all of the newspapers because you raised at this massive valuation. And so it made things more difficult us because the only way we were going to succeed was by building a 10 times better product and getting word of mouth from researchers. But I think it also meant that our customers were people who really understood data and really cared about it.

I always thought it was really important for us to have early customers who were really aligned with what we were building, and who really cared about having really high quality data, and really understood how that data would make their AI models so much better because they were the ones helping us. They were the ones giving us feedback on what we’re producing. And so just having that kind of very close mission alignment with our customers actually helped us early on. So these are people who basically just buying our product because they knew how different it was and because it was helping them rather than because they saw something in that current [inaudible 00:08:52]. So it made things harder for us, but I think in a really good way.

Measuring Real AI Progress

Lenny Rachitsky: It’s such an empowering story to hear this journey for founders that they don’t need to be on Twitter all day promoting what they’re doing. They don’t have to raise money. They can just kind of go heads down and build, so I love so much about the story of Surge. For people that don’t know what Surge does, just to give us a quick explanation of what Surge is.

Edwin Chen: We essentially teach AI models what’s good and what’s bad. So we train them using human data, and there’s a lot of different products that we have, like SFT, RHF, rubrics, verifiers, RL environments, and so on and so on, and then we also measure how well they’re progressing. So essentially we’re a data company.

The AGI Timeline

Lenny Rachitsky: What you always talk about is the quality has been the big reason you guys have been so successful, the quality of the data. What does it take to create higher quality data? What do you all do differently? What are people missing?

Edwin Chen: I think most people don’t understand what quality even means in this space. They think you could just throw bodies at a problem and get good data and that’s completely wrong. Let me give you an example.

So imagine you wanted to train a model to write an eight line poem about the moon. What makes it a good, high-quality poem? If you don’t think deeply about quality, you’ll be like, “Is this a poem? Does it contain eight lines? Does it contain the word, moon?” You check all of these boxes, and if so, sure. Yeah, you say it’s a great problem. But that’s completely different from what we want. We are looking for a Nobel Prize-winning poetry. Is this poetry unique? Is it full of subtle imagery? Does it surprise you and target your heart? Does it teach you something about the nature of moonlight? Does it playthrough emotions? And does it make you think? That’s what we are thinking about when we think about high quality poem.

So it might be like a haiku about moonlight on water. It might use internal rhyme and meter. There are a thousand ways to write a poem about the moon, and in each one, gives you all these different insights into language, and imagery, and human expression, and I think thinking about quality in this way is really hard, it’s hard to measure. It’s really subjective, and complex, and rich. And it sets a really high bar. And so we have to build all of this technology in order to measure it, like thousands of signals on all of our workers, thousands of signals on every project, every task. We know at the end of the day, if you are good at writing poetry versus good at writing essays versus great at writing technical documentation. And so we have to gather all these signals on what your background is, what your expertise is, and not just that. Like how you’re actually performing when you’re writing all these things, and we use those signals to inform whether or not you are good [inaudible 00:11:23] for these projects, and whether or not you are improving the models.

And it’s really hard, and so to build all this technology to measure it, but I think that’s exactly what we want AI to do, and so we have these really deep notions about quality that we’re always trying to try and achieve.

Flawed AI Optimization Directions

Lenny Rachitsky: So what I’m hearing is there’s kind of just going much deeper in understanding what quality is within the verticals that you are selling data around. And is this like a person you hire that is incredibly talented at poetry plus evals that they, I guess, help write, that tell them that this is great? What’s the mechanics of that?

Hidden Dangers Of Optimizing Interactions

Edwin Chen: The way it works is we essentially gather thousands of signals about everything that you’re doing when you’re working on platform. So we are looking at your keyboard strokes. We are looking how fast you answer things. We are using reviews, we are using code standards, we are using… We’re training models ourselves all on the outputs that you create, and then we’re seeing whether they improve the model’s performance.

And so in a very similar way to how Google search, like when Google search is trying to determine what is a good webpage, there’s almost two aspects of it. One is you want to remove all of the worst of the worst webpages. So you want to remove all the spam, all the just low quality content, all the pages that don’t load, and so it’s almost like a content moderation problem. You just want to remove the worst of the worst.

But then you also want to discover the best of the best. Okay, like this is the best webpage or just the best person for this job. They are not just somebody who writes the equivalent of high school level poetry. Again, they’re not just [inaudible 00:12:57] writing poetry that checks all these boxes, checks all of these explicit instructions, but rather, yeah, they’re writing poetry that makes you emotional. And so we have all these signals as well that, again, completely differently from moving the worst of the worst, we are finding the best of the best. And so we have all these signals…

Again, just like Google Search uses all these signals that feeds them into their ML algorithms and uses and predicts certain types of things, we do the same with all of our workers and all of our tasks in all of our projects. And so it’s almost like a complicated machine learning problem at the end of the day, and that’s how it works.

Lenny Rachitsky: That is incredibly interesting.

I want to ask you about something I’ve been very curious about over the past couple years. If you look at Claude, it’s been so much better at coding and at writing than any other model for so long. And it’s really surprising just how long it took other companies to catch up. Considering just how much economic value there is there, just like every AI coding product sat on top of Claude because it was so good Claude code and writing also. What is it that made it so much better? Is it just the quality of the data they trained on or is there something else?

Who Is On The Right Path?

Edwin Chen: I think there are multiple parts to it. So a big part of it certainly is the data. I think people don’t realize that there’s almost like this infinite amount of choices that all the frontier labs are deciding between when they’re choosing what data goes into their models. It’s like, okay, are you purely using human data? Are you gathering the human data in X, Y, Z way? When you are gathering the human data, what exactly are you asking the people who are creating it to create for you?

For example, in the coding realm, maybe you care more about front end coding versus back end coding. Maybe when you’re doing front end coding, you care a lot about the visual design of the front end applications that you’re creating, or maybe you don’t care about it so much and you care more about, I don’t know, the deficiency of it or the pure correctness over that visual design.

And then other questions like, okay, are you carrying [inaudible 00:14:49] how much synthetic data are we throwing into the mix? How much do you care about these 20 different benchmarks?”

Some companies, they see these benchmarks and they’re like, “Okay, for PR purposes, even though we don’t think that these academic benchmarks matter all that much, maybe we just need to optimize for them anyways because our marketing team needs to show certain progress on certain standard evaluations that every other company talks about, and if we don’t show good performance here, it’s going to be bad for us even if ignoring these academic benchmarks makes us better at the real tasks.”

Other companies are going to be principled and be like, “Okay, yeah, no, I don’t care about marketing. I just care about how my model performs on these real world tasks at the end of the day, and so I’m going to optimize for that instead.”

And it’s almost like there’s a trade-off between all of these different things, and there’s like a…

One of the things I often think about is that there’s a… It’s almost like there’s an art to post training. It’s not purely a science. When you are deciding what kind of model you’re trying to create and what it’s good at, there’s this notion of taste and sophistication, like, “Okay, do I think that these…”

So going back to the example of how good the model is at visual design. I’m like, “Okay, maybe you have a different notion of visual design than what I do. Maybe you care more about minimalism, and you care more about, I don’t know, 3D animations than I do. And maybe this other person prefers things that look a little bit more broke.” And there’s all these notions of taste sophistication that you have to decide between when you’re designing your post training mix, and so that matters as well.

So long story short, I think there’s all these different factors, and certainly the data is a big part of it, but it’s also like what is the objective function that you’re trying to optimize your model towards?

Lenny Rachitsky: That is so interesting. The taste of the person leading this work will inform what data they ask for, what data they feed it. But it’s wild it shows the value of great data. Anthropic got so much growth and win from essentially better data.

Ethical Choices In AI Products

Edwin Chen: Yeah, exactly.

Rethinking The Silicon Valley Startup Path

Lenny Rachitsky: And I could see why companies like yours are growing so fast. There’s just so much… And that’s just one vertical. That’s just coding, and then there’s probably a similar area for writing. I love that it’s… It’s interesting that AI, it feels like this artificial computer binary thing, but it’s like taste. Human judgment is still such a key factor in these things being successful.

Edwin Chen: Yep, exactly. Again, going back to the example I said earlier, certain companies, if you ask them what is good poem, they will simply robotically check off all of these instructions on our list.

But again, I don’t think that makes for good poetry, so certain frontier labs, the ones with more taste in sophistication, they will realize that it doesn’t reduce to this six set of checkboxes and they’ll consider all of these kind of implicit, very subtle qualities instead, and I think that’s what makes them better at this at the end of the day.

Can LLMs Lead To AGI?

Lenny Rachitsky: You mentioned benchmarks. This is something a lot of people worry about is there’s all these models that are always… Basically, it feels like every model is better than humans at every STEM field at this point, but to a regular person, it doesn’t feel like these models are getting that much smarter constantly. What’s your just sense of how much you trust benchmarks and just how correlated those are with actual AI advancements?

Edwin Chen: Yeah, so I don’t trust the benchmarks at all. And I think that’s for two reasons. So one is I think a lot of people don’t realize, even researchers within the community, they don’t realize that the benchmarks themselves are often honestly just wrong. They have wrong answers. They’re full of all this kind of messiness and people trust… Long as for the popular ones, people have maybe realized this to some extent, but the vast majority just have all these flaws that people don’t realize. So that’s one part of it.

And the other part of it is these benchmarks at the end of the day, they often have well-defined objective answers that make them very easy for models to hill-climb on in a way that’s very different from the messiness and ambiguity of the real world.

I think one thing that I often say is that it’s kind of crazy that these models can win IMO gold medals, but they still have trouble parsing PDFs. And that’s because, yeah, even though IMO gold medals seem hard to the average person, yeah, they are hard at the end of the day. But they have this notion of objectivity that, okay, yeah, parsing a PDF sometimes doesn’t have. And so it’s easier for the frontier labs to hill-climb on all of these than to solve all these mess ambiguous problems in the real world. So I think there’s a lack of direct correlation there.

Reinforcement Learning Environments

Lenny Rachitsky: It’s so interesting the way you described it is hitting these benchmarks is kind of like a marketing piece. When you launch, say Gemini 3 just launched, and it’s like, cool. Number one with all these benchmarks. Is that what happens? They just kind of train their models to get good at these very specific things?

Edwin Chen: Yeah, so there’s, again, maybe two parts to this. So one is, sometimes, yeah, these benchmarks, they accidentally leak in certain ways or the frontier labs will tweak the way they evaluate their models on these benchmarks. They’ll tweak your system prompt or they’ll tweak the number of times they run their model, and so on and so on in a way that games these benchmarks.

The other part of it though is it’s like by optimizing for the benchmark instead of optimizing for the real world, you will just naturally climb on the benchmark and, yeah, it’s basically another form of gaming it.

From Scoring Criteria To RL Environments

Lenny Rachitsky: Knowing that with that in mind, how do you get a sense of if we’re heading towards AGI, how do you measure progress?

The Evolution Of Post-Training Methods

Edwin Chen: Yes, so the way we really care about measuring model progress is by running all these human evaluations.

So for example, what we do is, yeah, we will take Gore human annotators, and we’ll ask them, “Okay, go have a conversational model.” And maybe you’re having this conversation with the model across all of these different topics. So you are a Nobel Prize winning physicist. So you go have a conversation about pushing different tier of your own research. You are a teacher and you’re trying to create lesson plans for your students, so go talk to the model about these things. Or you’re a coder and you’re working at one of these big tech companies, and you have these problems every day, so go talk to the model and see how much it helps you.

And because or searchers or annotators, they are experts at the top of their fields, and they are not just giving your responses, they’re actually working through the responses deeply themselves, they are… Yeah, they’re going to evaluate the code that it write. They’re going to double check the physics equations that it writes. They’re going to evaluate the models in a very deep way, so they’re going to pay attention to accuracy and instruction following, all these things that casual users don’t when you suddenly get a popup on your ChatGPT response asking you to compare these two different responses. People like that, they’re not evaluating models deeply, they’re just vibing and picking whatever response looks flashiest or [inaudible 00:21:38] are looking closely at responses and evaluating them for all of these different dimensions, and so I think that’s a much better approach than these benchmarks or these random outline AV tests.

Lenny Rachitsky: Again, I love just how central humans continue to be in all this work that we’re not totally done yet. Is there going to be a point where we don’t need these people anymore, that AI is so smart that, “Okay, we’re good. We got everything out of your heads”?

Diversifying How AI Learns

Edwin Chen: Yeah, I think that will not happen until we’ve reached AGI. It’s almost like by definition, if we haven’t reached AGI yet, then there’s more for the models to learn from, and so, yeah, I don’t think that’s going to happen anytime soon.

The AI Market Outlook

Lenny Rachitsky: Okay, cool. So more reason to stress about AGI. “We don’t need these folks anymore.”

I can’t not ask just… People that work closely with this stuff, I’m always just curious. What’s your AGI timelines? How far do you think we are from this? Do you think we’re in like a couple years or is it like decades?

Edwin Chen: So I’m certainly on the longer time horizon front. I think people don’t realize that there’s a big difference between moving from 80% performance to 90% performance to 99% performance to 99.9% performance, and so on, and so on. And so in my head, I probably bet that within the next one or two years, yeah, the models are going to automate 80% of the average LL6 software engineer’s job. It’s going to take another few years to move to 90%, and another fewer to 99%, and so on, and so on. So I think we’re closer to a decade or decades away than [inaudible 00:23:03].

Overrated And Underrated AI Aspects

Lenny Rachitsky: You have this hot take that a lot of these labs are kind of pushing AGI in the wrong direction and this is based on your work at Twitter, and Google, and Facebook. Can you just talk about that?

Edwin Chen: I’m worried that instead of building AI that will actually advance us as a species, curing cancer, solving poverty, understand the universe, all these big grand questions, we are optimizing for AI slop instead. We’re basically teaching our models to chase dopamine instead of truth. And I think this relates to what we’re talking about regarding these benchmarks. So let me give you a couple examples.

So right now, the industry is played by these terrible databoards like LLM Arena. It’s this popular online leaderboard where random people from around the world vote on which AI response is better. But the thing is, like I was saying earlier, they’re not carefully reading or fact-checking. They’re skimming these responses for two seconds and picking whatever looks flashiest.

So a model can hallucinate everything. It can completely hallucinate. But it will look impressive because it has crazy emojis, and boating, and markdown headers, and all these superficial things that don’t matter at all, but it catch your attention. And these LLM-reading users love it. It’s literally optimizing your models for the types of people who buy tabloids at the grocery store. We’ve seen this [inaudible 00:24:15] data ourselves. The easiest way to climb LLM Arena, it’s adding crazy boating. It’s doubling the number of emojis. It’s tripling the length of your model responses, even if your model starts hallucinating and getting the answer completely wrong.

And the problem is, again, because all of these frontier labs, they kind of have to pay attention to PR because their sales team, when they’re trying to sell to all these enterprise customers, those enterprise customers will say, “Oh, well, but your model’s only number five on LLM Arena, so why should I buy it?” They have to, in some sense, pay attention to these leaderboards, and so what their researchers [inaudible 00:24:47] tell us is like they’ll say, “The only way I’m going to get promoted at the end of the year is if I climb this leaderboard, even though I know that climbing it is probably going to make my model worse and accuracy [inaudible 00:24:57] following.” So I think there’s all these negative incentives that are pushing work in the wrong direction.

I’m also worried about this trend towards optimizing AI for engagement. I used to work on social media. And every time we optimize for engagement, terrible things happened. You’d get clickbait and pictures of bikinis and bigfoot and horrifying skin diseases just filling your feeds. And I think I worry that the same thing’s happening with AI. If you think about all the sycophancy issues with ChatGPT, “Oh, you’re absolutely right. What an amazing question,” the easiest way to hook users is to tell them how amazing they are. And so these models, they constantly tell you you’re a genius. They’ll feed into your delusions and conspiracy theories. They’ll pull you down these rabbit holes because Silicon Valley loves maximizing time spent and just increasing the number of conversations you’re having with it. And so yeah, companies are spending all the time hacking these leaderboards and benchmarks, and the scores are going up, but I think it actually masks up the models with the best scores, they are often the worst or just have all these fundamental failures. So I think I’m really worried that all of these negative ascendants are pushing AGI into the wrong direction.

Product Teams And The AI Future

Lenny Rachitsky: So what I’m hearing is AGI is being slowed down by these, basically the wrong objective function, these labs paying attention to the wrong basically benchmarks and evals.

Edwin Chen: Yep.

The Founding Story Of Surge

Lenny Rachitsky: I know you probably can’t play favorites since you work with all the labs. Is there anyone doing better at this and maybe kind of realizing this is the wrong direction?

Core Motivations And Mission

Edwin Chen: I would say I’ve always been very impressed by Anthropic. I think Anthropic takes a very principled view about what they do and don’t care about and how they want their models to behave in a way that feels a lot more principle to me.

Influencing The Direction Of AI

Lenny Rachitsky: Interesting.

Are there any other big mistakes you think labs are making just that are kind of slowing things down or heading in the wrong direction? Where we’ve heard just chasing benchmarks, this engagement focus, is there anything else you’re seeing of just like, “Okay, we got to work on this because it’ll speed everything up”?

Edwin Chen: I think there is a question of what products they’re building and whether those products themselves are something that kind of help or hurt humanity. I think a lot about Sora and…

Significance For Humanity

Lenny Rachitsky: I was thinking that’s what you’re imagining.

Deep Thoughts On Objective Functions

Edwin Chen: Yeah, what it entails, and so it’s kind of interesting. It’s like which companies would build Sora and which wouldn’t?

And I think that answer to that… Well, I don’t know if answer is myself. I have an idea in my head, but I think the answer to that question maybe reveals certain things about what kinds of AI models those companies want to build and what direction and what future they want to achieve, yeah, so I think about that a lot.

Lenny Rachitsky: The steel man argument there is, it’s like fun, people want it, it’ll help them generate revenue to grow this thing and build better models, it’ll train data in an interesting way, it’s also just really fun.

Lessons From Founding Surge

Edwin Chen: Yeah. I think it’s almost like, do you care about how you get there? And in the same way, so I made this tabloid analogy earlier, but would you sell tabloids in order to fund, I don’t know, some other newspaper?

Sure, like in some sense, if you don’t care about the path, then you’ll just do whatever it takes, but it’s possible that it has negative consequences in of itself that will harm the long-term direction of what you’re trying to achieve, and maybe it’ll distract you from all the more important things, so yeah, I think that the path you take matters a lot as well.

Lenny Rachitsky: Along these lines, you talked a bunch about this of just Silicon Valley and kind of the downsides of raising a lot of money being in the echo chamber. What do you call it, the Silicon Valley machine? You talk about how it’s hard to build important companies in this way and that you might actually be much more successful if you’re not going down the VC path. Can you just talk about what you’ve seen in that experience and your advice essentially to founders, because they’re always hearing? Raise money from fancy VCs, move to Silicon Valley, what’s kind of the countertake?

Rapid Fire Q&A

Edwin Chen: Yes. So I’ve always really hated a lot of the Silicon Valley mantras. The standard playbook is to get product market fit by pivoting every two weeks. And to chase growth and chase engagement with all of these dark patterns and to blitz scale by hiring as fast as possible. And I’ve always disagreed.

So yeah, I would say don’t pivot. Don’t put scale. Don’t hire that Stanford grad who simply wants to add a hot company to your resume, just build the one thing only you could build, a thing that wouldn’t exist without the insight and expertise that only you have.

And you see these buy to [inaudible 00:29:34] companies everywhere now. Some founder who was doing crypto in 2020, and then pivoted to NFTs in 2022, and now they’re an AI company. There’s no consistency, there’s no mission, they’re just chasing valuations. And I’ve always hated this because Silicon Valley loves to score on Wall Street for focusing on money. But honestly, most of the Silicon Valley’s chasing the same thing. And so we stayed focused on our mission from day one, pushing that frontier of high quality complex data, and I’ve always loved that because I think startups…

I have this very romantic notion of startups. Startups are supposed to be a way of taking big risks to build something that you really believe in. But if you’re constantly pivoting, you’re not taking any risks. You’re just trying to make a quick buck. And if you fail because the market isn’t ready yet, I actually think that’s way better. At least you took a swing at something deep, and novel, and hard instead of pivoting into another LLM wrapper company. So yeah, I think the only way you build something that matters that’s going to change the world is if you find a big idea you believe in and you say no to everything else.

So you don’t keep on pivoting when it gets hard, you don’t hire a team of 10 product managers because that’s what every other cookie cutter startup does, you just keep building that one company that wouldn’t exist without you. And I think there are a lot of people in Silicon Valley now who are sick of all the grift, who want to work on big things that matter with people who actually care, and I’m hoping that that would be the future of how we go with technology.

Lenny Rachitsky: I’m actually working on a post right now with Terrence Rohan, this VC that I really like to work with, and we interviewed five people who picked really successful generational companies early and joined them as really early employees. They joined OpenAI before anyone thought it was awesome, Stripe before anyone knew was awesome, and so we’re looking for patterns of how people find these generational companies before anyone else, and it aligns exactly what you described, which is ambition. They have a wild ambition with what they want to achieve. They’re not, as you said, just kind of looking around for product market fit no matter what ends up being, and so I love that what you described very much aligns with what we’re seeing there.

Favorite Movies And TV Shows

Edwin Chen: Yeah, I absolutely think that you have to have huge ambitions, and you have to have a huge belief in your idea that’s going to change the world, and you have to be willing to double down and keep on doing whatever it takes to make it happen.

Lenny Rachitsky: I love how counter your narrative is to so many of the things people hear, and so I love that we’re doing this. I love that we’re sharing this story.

Imagine starting a project at work. And your vision is clear, you know exactly who’s doing what, and where to find the data that you need to do your part. In fact, you don’t have to waste time searching for anything because everything your team needs from project trackers and OKRs, the documents and spreadsheets lives in one tab all in Coda.

With Coda’s collaborative all in one workspace, you get the flexibility of docs, the structure of spreadsheets, the power of applications, and the intelligence of AI all in one easy to organize tab. Like I mentioned earlier, I use Coda every single day. And more than 50,000 teams trust Coda to keep them more aligned and focused. If you’re a startup team looking to increase alignment and agility, Coda can help you move from planning to execution in record time.

To try it for yourself, go to coda.io/lenny today and get six months free of the team planned for startups. That’s coda.io/lenny to get started for free and get six months of the team plan, coda.io/lenny.

Slightly different direction, but something else that was maybe a counter narrative. I imagine you watched the Dwarkesh and Richard Sutton podcast episode, and even if you didn’t, they basically had this conversation, Richard Sutton. He was a famous AI researcher, had this whole bitter lesson meme, and he talked about how LLMs almost are kind of a dead end, and he thinks we’re going to really plateau around LLMs because of the way they learn.

What’s your take there? Do you think LLMs will get us to AGI or beyond, or do you think there’s going to be something new or a big breakthrough that needs to get us there?

Recently Discovered Favorite Products

Edwin Chen: I’m in the camp where I do believe that something new will be needed. The way I think about it is when I think about training AI, I take a very… I don’t know if I would say biological point of view. But I believe that in the same way that there’s a million different ways that humans learn, we need to build models that can mimic all of those ways as well. And maybe they’ll have a different distribution of the focuses that they have. I know that it’ll be different for humans, so maybe they have a different distribution, but we want to be able to mimic their learning abilities of humans and make sure that we have the algorithms and the data for models to learn in the same way. And so to the extent that LLMs have different ways of learning from humans, then yeah, I think something new will be needed.

Lenny Rachitsky: This connects to reinforcement learning. This is something that you’re big on and something I’m hearing more and more is just becoming a big deal in the world of post-training. Can you just help people understand what is reinforcement learning and reinforcement learning environments, and why they’re going to be more and more important in the future?

Personal Life Motto

Edwin Chen: Reinforcement learning is essentially training your model to reach a certain reward. And let me explain what an RL environment is. An RL environment is essentially a simulation of real world. So think of it like building a video game with a fully fleshed out universe. Every character has a real story, every business has tools and data you can call, and you have all these different entities interacting with each other.

So for example, we might build a world where you have a startup with Gmail messages, and Slack threads, and Jira tickets, and GitHub PRs, and a whole code base. And then suddenly AWS goes down. And Slack goes down. And so, “Okay. Model, well, what do you do?” The model needs to figure it out.

So we give them models tasks in these environments, we design interesting challenges for them, and then we run them to see how they perform. And then we teach them, we give them these rewards when they’re doing a good job or a bad job.

And I think one of the interesting things is that these environments really showcase where models are weak at end-to-end tasks in real world. You have all these models that seem really smart on isolated benchmarks, they’re good at single step tool calling. They’re good at single step instruction following. But suddenly you dump them into these messy worlds where you have confusing Slack messages and tools they’ve never seen before, and they need to perform right actions and modify the [inaudible 00:36:06] and interact over longer time horizons where what they do in step one affects what they do in step 50. And that’s very different from these kind of academic single step environments that they’ve been in before, and so the model just fails catastrophically in all these crazy ways.

So I think these RL environments are going to be really interesting playgrounds for the models to learn from that will essentially be simulations and mimics in real world, and so they’ll hopefully get better and better at real tasks compared to all these contrived environments.

Lenny Rachitsky: So I’m trying to imagine what this looks like. Essentially, it’s like a virtual machine with, I don’t know, a browser or a spreadsheet or something in it with like, I don’t know, surge.com. Is that your website, surge.com? Let’s make sure we get that right.

Soda Or Pop?

Edwin Chen: So we are actually surgehq.ai.

Lenny Rachitsky: Surgehq.ai. Check it out. We’re hiring it, I imagine. Yes. Okay. So it’s like, cool, here’s surgehq.ai. Your job, here’s your job as an agent, let’s say, is to make sure it stays up. And then all of a sudden it goes down and the objective function is figure out why. Is that an example?

Final Closing Remarks

Edwin Chen: Yeah, so the objective function might be… Or the goal of the task might be, okay, go figure out why and fix it. And so the objective function might be, it might be passing a series of unit tests, it might be writing a document like maybe it’s a retro containing certain information that matches exactly what happened, there’s all these different rewards that we might give it that determine whether or not it’s succeeding, and so the models, we’re basically teaching the models to achieve that reward.

Lenny Rachitsky: So essentially it’s off and running. Here’s your goal, figure out why the site went down and fix it. And it just starts trying stuff, we’re using everything, all the intelligence it’s got, it makes mistakes, you kind of help it along the way, reward it if it’s doing the right sort of thing. And so what you’re describing here is this is the next phase of models becoming smarter. More RL environments focused on very specific tasks that are economically valuable, I imagine.

Edwin Chen: Yeah, so just in the same way that there were all these different methods for models of learning in the past, originally we had SFT and RHF, and then we had rubrics and verifiers. This is the next stage, and it’s not the case that the previous methods are obsolete, this is, again, just a different form of learning that compliments all the previous types, so it’s just like a different skilled model not only to learn how to do.

Lenny Rachitsky: And so in this case, it’s less some physics PhD sitting around talking to a model, correcting it, giving it evals of here’s what the correct answer is, creating rubrics and things like that. More it’s like this person now designing an environment. So another example I’ve heard is like a financial analyst. Just like, “Here’s an Excel spreadsheet, here’s your goal, figure out our profit and loss,” or whatever. And so this expert now, instead of just sitting around writing rubrics, they’re designing this RL environment.

Edwin Chen: Yeah, exactly. So that financial analyst might create a spreadsheet, they may create certain tools that the model needs to call in order to help fill out that spreadsheet, like it might be, okay, the model needs to access Bloomberg terminal. It needs to learn how to use it. And it needs to learn how to use this calculator. And it needs to learn how to pour on this calculation. So it has all these tools that it has access to.

And then the reward might be… Okay, it’s like maybe I will download that spreadsheet and I want to see, does cell B22 contain the correct profit and loss number? Or does tab number two contain this piece of information?

Lenny Rachitsky: And what’s interesting, this is a lot closer to how humans learn. We just try stuff, figure out what’s working and what’s not. You talk about how trajectories are really important to this. It’s not just here’s the goal and here’s the end, it’s like every step along the way. Can you just talk about what trajectories are and why that’s important to this?

Edwin Chen: I think one of the things that people don’t realize is that sometimes even though the model reaches the correct answer, it does so in all these crazy ways. So it may have in the intermediate trajectory, it may have tried 50 different times and failed, but eventually it just kind of randomly lands on a correct number. Or maybe it is…

Sometimes it just does things very inefficiently or it almost reward-hacks a way to get at the correct answer, and so I think paying attention to the directory is actually really important. And I think it’s also really important because some of these trajectories can be very long. And so if all you’re doing is checking whether or not the model reaches the final answer, it’s like there’s all this information about how the model behaved in the immediate step that’s missing.

Sometimes you want models to get to the correct answer by reflecting on what it did. Sometimes you want it to get it at the correct answer by just one-shotting it. And if you ignore all of that, it’s just like teaching it… just missing a lot of the information that you could be teaching a model to do.

Lenny Rachitsky: I love that. Yeah, it tries a bunch of stuff and eventually gets it right. You don’t want it to learn this is the way to get there. There’s often a much more efficient way of doing it.

You mentioned all the kind of the steps we’ve taken along the journey of helping models get smarter. Since you’ve been so close to this for so long, I think this is going to be really helpful for people. What’s kind of like been the steps along the way from the first post-training that has most helped models advance? Where do evals fit in the RL environments? Just like what’s been the steps and now we’re heading towards RL environments?

Edwin Chen: Originally, the way models started getting post-trained was purely through SFT. And-

Lenny Rachitsky: What does that stand for?

Edwin Chen: So SFT stands for supervised fine-tuning. So again, I think often in terms of these human analogies, and so SFT is a lot like mimicking a master and copying what they do.

And then RLHF became very dominant. And analogy there would be like sometimes you learn by writing 55 different essays and someone telling you which one they liked the most.

And then I think over the past year or so, rubrics and verifiers have become very important. And rubrics and verifiers are like learning by being graded and getting detailed feedback on where you went wrong.

Lenny Rachitsky: And those are evals, another word for that?

Edwin Chen: Yeah. So I think evals often covers two terms. One is you are using the evaluations for training because you’re evaluating whether or not the model did a good job, and when it does do a good job, you’re rewarding it.

And then there’s this other notion of evals where you’re trying to measure the model’s progress like, okay, yeah, I have five different candidate checkpoints and I want to pick the one that’s best in order to release it to the public. So going to run all these evals on these five different checkpoints in order to decide which one is best.

Lenny Rachitsky: Awesome.

Edwin Chen: Yeah, and now we have RL environments, so this is kind of like a hot new thing.

Lenny Rachitsky: Awesome. So what I love about this business journey is just there’s always something new. There’s always this like, okay. We’re getting so good at just all this beautiful data for companies and now they need something completely different. Now we’re setting up all these virtual machines for them and all these different use cases.

Edwin Chen: Yep.

Lenny Rachitsky: And it feels like that’s a big part of this industry you’re in, it’s just adapting to what labs are asking for.

Edwin Chen: Yeah. So I really do think that we are going to need to build a suite of products that reflect a million different ways that humans learn.

Like for example, think about becoming a great writer. You don’t become great by memorizing a bunch of grammar rules. You become great by reading great books, and you practice writing, and you get feedback from your teachers and from the people who buy your books in a bookstore and leave reviews. And you notice what works and what doesn’t. And you develop taste by being exposed to all of these masterpieces and also just terrible writing. So you learn through this endless cycle of practicing reflection, and each type of learning that you have, again, these are all very different methods of learning to become a great writer, so just in the same way that… it’s a thousand different ways that the great writer becomes great, I think there’s going to be a thousand different ways that AI [inaudible 00:44:05] need to learn.

Lenny Rachitsky: It’s so interesting this just ends up being just like humans in so many ways. It makes sense because in a sense, neural networks, deep learning is modeled after how humans have learned and how our brains operate, but it’s interesting just to make them smarter. It’s how do we come closer to how humans learn more and more?

Edwin Chen: Yeah, it’s almost like maybe the end goal is just throwing you into the environment and just seeing how you evolve. But within that evolution, there’s all these different sub-learning mechanisms.

Lenny Rachitsky: Yeah, which is kind of what we’re doing now, so that’s really interesting. This might be the last step until we hit AGI. Along these lines, something that’s really unique to Surge that I learned is you guys have your own research team, which I think is pretty rare, talk about just why that’s something you guys have invested in and what has come out of that investment.

Edwin Chen: Yeah, so I think that stems from my own background. My own background is as a researcher. And so I’ve always cared fundamentally about pushing the industry and pushing the research community and not just about revenue. And so I think what our research team does is a couple different things.

So we almost have two types of researchers at our company. One is our forward-deployed researchers who are often working hand in hand with our customers to help them understand their models. So we will work very closely with the customers to help them understand, “Okay, this is where your model is today. This is where you’re lagging behind all the competitors, these are some ways that you could be improving in the future, given your goals, and we’re going to design these data sets, these evaluation methods, these training techniques to make your models better.” So this very collaborative notion of working with our customers being researched by themselves, just a little bit more focused on the data side, and working hand on hand with them to do whatever it takes to make them the best.

And then we also have our internal researchers. So our internal researchers are focused on slightly different things. So they are focused on building better benchmarks and better leaderboards.

So I’ve talked a lot about how I worry that the leaderboards and benchmarks out there today are steering models in the wrong direction, so yeah, so the question is, how do we fix that? And so that’s what our research team is focused focused really heavily on right now. So they’re working a lot on that.

And they’re also working on these other things like, “Okay, we need to train our own models to see what types of data performs the best, what types of people perform the best.” And so they’re also working on all these training techniques and evaluation of our own data sets to improve our data operations and the internal data products that we have that determine what makes something good quality.

Lenny Rachitsky: It’s such a cool thing because I don’t think basically the labs have researchers helping them advance AI. I imagine it’s pretty rare for a company like yours to have researchers actually doing primary research on AI.

Edwin Chen: Yeah, I think it’s just because it’s something I’ve fundamentally always cared about. I often think about us more like a research lab than a startup because that is my goal. It’s kind of funny, but I’ve always said I would rather be Terrence Tau than Warren Buffett, so that notion of creating research that pushes the frontier forward and not just getting some valuation, that’s always been what drives me.

Lenny Rachitsky: And it’s worked out. That’s the beautiful thing about this. You mentioned that you were hiring researchers, is there anything there you want to share folks you’re looking for?

Edwin Chen: So we look for people who are just fundamentally interested in dataset all day. So types of people who could literally spend 10 hours digging through a dataset, and playing around with models, and thinking, “Okay, yeah, this is where I think the model’s failing,” this is the kind of a behavior you want the model to have instead, and just this aspect of being very hands-on and thinking about the qualitative aspects of models and not just the quantitative parts. So again, it’s like this aspect of being hands-on with data and not just caring about these kind of abstract algorithms.

Lenny Rachitsky: Awesome.

I want to ask a couple broad AI kind of market questions. What else do you think is coming in the next couple of years that people are maybe not thinking enough about or not expecting in terms of where AI is heading? What’s going to matter?

Edwin Chen: I think one of the things that’s going to happen in the next few years is that the models are actually going to become increasingly differentiated because of the personalities and behaviors that the different labs have and the kind of objective functions that they are optimizing their models for. I think it’s one thing I didn’t appreciate a year or so ago.

A year or so ago, I thought that all of the AI models would essentially become very commoditized. They would all behave like each other, and sure, one of them might be slightly more intelligent in one way today, but sure, the other ones would catch up in the next few months. But I think over the past year, I’ve realized that the values that the companies have will shape the model.

So let me give you an example. So I was asking Claude to help me draft an email the other day, and it went through 30 different versions. And after 30 minutes, yeah, I think it really crafted me the perfect email, and I sent it. But then I realized that I spent 30 minutes doing something that didn’t matter at all. Sure, now I got the perfect email, but I spent 30 minutes doing something I wouldn’t have worried at all before, and this email probably didn’t even move the needle on anything anyways.

So I think there’s a deep question here, which is, if you could choose the perfect model behavior, which model would you want? Do you want a model that says, “You’re absolutely right. There are definitely 20 more ways to improve this email,” and it continues for 50 more iterations. And it sucks up all your time and engagement. Or do you want a model that’s optimizing for your time and productivity and just says, “No, you need to stop. Your email’s great. Just send it and move on with your day”?

And again, just because… In the same way that there’s like a kind of a fork in a road between how you could choose how your model behaves for this question, it’s like for every other question that models have, the kind of behavior that you want will fundamentally affect it.

It’s almost like in the same way that when Google builds a search engine, it’s very different from how Facebook would build a search engine, which is very different from how Apple would build a search engine. They all have their own principles and values and things that they’re trying to achieve in the world that shape all the products that they’re going to build. And in the same way, I think all the [inaudible 00:50:40] will start behaving very differently too.

Lenny Rachitsky: That is incredibly interesting. You already see that with Grok. It’s got a very different personality and a very different approach to answering questions. And so what I’m hearing is you’re going to see more of this differentiation.

Edwin Chen: Yep.

Lenny Rachitsky: Kind of another question along these lines, what do you think is most under-hyped in AI that you think maybe people aren’t talking enough about that is really cool? And what do you think is over-hyped?

Edwin Chen: So I think one of the things that’s under-hyped is the built-in products that all of the chatbots are going to start having. I’ve always been a huge fan of Claude’s artifacts. And I think it just works really well. And actually the other day, I don’t know if it’s a new feature or not, but it asked me to help me create an email, and then it just created… So it didn’t quite work because it didn’t allow me to send the email. But what it created instead was like a little, I don’t know what we call it, like a little box where I could click on it and it would just text someone that did this message. And I think that concept of taking artifacts to the next level where you just have these mini apps, mini UIs within the chatbots themselves, I feel like people aren’t talking enough about that. So I think that that’s one under-hyped area.

And in terms of over-hyped areas, I definitely think that vibe coding is over-hyped. I think people don’t realize how much it’s going to make your systems unmaintainable in the long-term and they simply dump this code into their code bases if this seems to work out right now, so I kind of worry about the future of coding. It’s just going to keep on happening.

Lenny Rachitsky: These are amazing answers. On that first point, there’s something I actually asked. I have the chief product officer of Anthropic and OpenAI, Kevin Weil and Mike Krieger on the podcast, and I asked them just like, “As a product team, you have this gigabrain intelligence. How long do you even need product teams?” You think this AI will just create the product for you. “Here’s what I want.” It’s like the next level of vibe coding. It’s just like tell it, “Here’s what I want,” and it’s just building the product and involving the product as you’re using it. And it feels like that’s what you’re describing is where we might be heading.

Edwin Chen: Yeah, I think there’s a very powerful notion where it helps people just achieve their ideas in a much cooler way.

Lenny Rachitsky: Something we haven’t gotten into that I think is really interesting is just the story of how you got to starting Surge. You have a really unique background. I always think about these… Brian Armstrong, the founder of Coinbase, once gave this talk that has really stuck with me where he kind of talked about how his very unique background allowed him to start Coinbase. He had a economics background, he had a cryptography experience, and then he was an engineer. And it’s like the perfect Venn diagram for starting Coinbase, and I feel like you have a very similar story with Surge. Talk about that, your background there, and how that led to Surge.

Edwin Chen: Going way back, I was always fascinated by math and language when I was a kid. I went to MIT because it’s obviously one of the best places for math and CS, but also because it’s the home of Noam Chomsky. My dream in school was actually to find some underlying theory connecting all these different fields.

And then I became a researcher at Google, and Facebook, and Twitter, and I just kept running into the same problem over and over again. It was impossible to get the data that we needed to train our models. So I was always this huge believer in the need for high quality data, and then GPT-3 came out in 2020. And I realized that, yeah, if we wanted to take things to the next level and build models that could code, and use tools, and tell jokes, and write poetry, and solve [inaudible 00:54:12], and cure cancer, then yeah, we were going to need a completely new solution.

The thing that always drove me crazy when I was at all these companies was we had a full power of the human mind in front of us, and all the data students out there were focused on really simple things like image labeling. So I wanted to build something focus on all these advanced, complex use cases instead that would really help us build our next generation models. So yeah, I think my background in kind of across math, and computer science, and linguistics really informed what I always wanted to do, and so I started Surge a month later with our one mission to basically build the use cases that I thought were going to be needed to push the frontier of AI.

Lenny Rachitsky: And you said a month later, a month later after what?

Edwin Chen: After a GPT-3 launch in 2020.

Lenny Rachitsky: Oh, okay. Wow. Okay. Yeah. A great decision.

What just kind of drives you at this point of… Other than just the epic success you’re having, what keeps you motivated to keep building this and building something in this space?

Edwin Chen: I think I’m a scientist at heart. I always thought I was going to become this math or CS professor and work on trying to understand the universe, and language, and the nature of communication. It’s kind of funny, but I always had this fanciful dream where if aliens ever came to visit Earth and we need to figure out how to communicate with them, I wanted to be the one the government would call. And I’d use all this fancy math, and computer science, and linguistics to decipher it.

So even today, what I love doing most is every time a new model is released, we’ll actually do a really deep dive into the model itself. I’ll play around with it, I’ll run evals, I’ll compare where it’s improved, where it’s arrest, I’ll create this really deep dive analysis that we send our customers. And it’s actually kind of funny because a lot of times we’ll say it’s from a data science team, but often it’s actually just from me.

And I think I could do this all day. I have a very hard time being in meetings all day. I’m terrible at sales, I’m terrible at doing the typical CEO things that people expect you to do, but I love writing these analyses. I love jamming with our research team about what we’re seeing, sometimes I’ll be up until 3:00 AM just talking on the phone with somebody on the research team and [inaudible 00:56:12] model. So I love that I still get to be really hands-on, working on the data and the science all day. And I think what drives me is that I want Surge to play this critical role in the future of AI, which I think is also the future of humanity. We have these really unique perspectives on data, and language, and quality, and how to measure all of this, and how to ensure it’s all going on the right path. And I think we’re uniquely unconstrained by all of these influences that can sometimes steer companies in a negative direction.

Like what I was saying earlier, we built Surge a lot more like a research lab than a typical startup. So we care about curiosity and long-term incentives and intellectual rigor, and we don’t care as much about quarterly metrics and what’s going to look good in a [inaudible 00:56:56]. And so my goal is to take all these unique things about us as a company and use that to make sure that we’re shaping AI in a way that’s really beneficial for our species in the long term.

Lenny Rachitsky: What I’m realizing in this conversation is just how much influence you have and companies like yours have on where AI heads. The fact that you help labs understand where they have gaps and where they need to improve, and it’s not just everyone looks at just like the heads of OpenAI and Anthropic and all these companies as they’re the ones ushering in AI, but what I’m hearing here is you have a lot of influence on where things head too.

Edwin Chen: Yeah, I think there’s this really powerful ecosystem where, honestly, people just don’t know where models are headed and how they want to shape them yet and how they want humanity kind of like [inaudible 00:57:47] play a role in the future of all of this, and so I think there’s a lot of opportunity to just continue shaping the discussion.

Lenny Rachitsky: Along that thread, I know you have a very strong thesis on just why this work matters to humanity and why this is so important, talk about that.

Edwin Chen: I’ll get a bit philosophical here, but I think the question itself is a bit philosophical, so bear with me. So the most straightforward way of thinking about what we do is we train and evaluate AI. But there’s a deeper mission that I often think about, which is helping our customers think about their dream objective functions. Like yeah, what kind of model do they want their model to be? And once we help them do that, we’ll help them train their model to reach their north star and we’ll help them measure that progress. But it’s really hard because objective functions are really rich and complex. It’s kind of like the difference between having a kid and asking them, “Okay, what test do you want to pass? Do you want them to get a high score on SAT and write a really good college essay?” That’s a simplistic version versus what kind of person do you want them to grow up to be? Will you be happy if they’re happy no matter what they do or are you hoping they’ll go to a good school and be financially successful?

And again, if you take that notion, it’s like, okay, how do you define happiness? How do you measure whether they’re happy? How do you measure whether they’re financially successful? It’s a lot harder than something measuring whether or not you’re getting a high score on the SAT, and what we’re doing is we want to help our customers reach, again, their dream north stars and figure out how to measure them. And so I talked about this example of what you want models to do when you’re asking them to write 50 different evaluations. Do you just continue them for 50 more or do you just say, “No, just move on [inaudible 00:59:25] because this is perfect enough.” And the broader question is, are we building these systems that actually advance humanity? And if so how do we build the data sets to train towards that and measure it? Are we optimizing for all of these wrong things, just systems that suck up more and more of our time and make us lazier and lazier?

And yeah, I think it’s really relevant to what we do because it’s very hard and difficult to measure and define whether something is genuinely advancing humanity. It’s very easy to measure all these proxies instead like clicks and likes. But I think that’s why our work is so interesting. We want to work the hard, important metrics that require the hardest types of data and not just the easy ones. So I think one of the things I often say is you are your objective function. So we want the rich, complex, objective functions and not these simplistic proxies. And our job is to figure out how to get the data to match this.

So yeah, we want data, we want metrics that measure whether AI is making your life richer. We want to train our systems this way. And we want tools that make us more curious and more creative, not just lazier. And it’s hard because, yeah, humans are kind of inherently lazy so AI software deals are the easiest way to get engagement, make all your metrics fall up. So I think this question about choosing the right objective functions and making sure that we’re optimizing towards them and not just these easy proxies is really important to our future.

Lenny Rachitsky: Wow. I love how what you’re sharing here gives you so much more appreciation of the nuances of building AI, training AI, the work that you’re doing.

From the outside, people could just look at Surge and companies in the space of, okay, cool. They’re just creating all this data, feeding it to AI. But clearly there’s so much to this that people don’t realize, and I love knowing that you’re at the head of this, that someone like you is thinking through this so deeply.

Maybe one more question, is there something you wish you’d known before you started Surge? A lot of people start companies, they don’t know what they’re getting into. Is there something you wish you could tell your earlier self?

Edwin Chen: Yeah, so I definitely wish I’d known that you could build a company by being heads down and doing great research and simply building something amazing. And not by constantly tweeting and hyping and fundraising. It’s kind of funny, but I never thought I wanted to start a company. I love doing research. And I was actually always a huge fan of DeepMind because they were this amazing research company that got bought and still managed to keep on doing amazing science. But I always thought that they were this magical ILR unicorn. So I thought if I started a company, I’d have to become a business person looking at financials all day and being in meetings all day and doing all this stuff that sounded incredibly boring and I always hated. So I think it’s crazy that didn’t end up being true at all. I’m still in the weeds in the data every day. And I love it. I love that I get to do all these analyses and talk to researchers. And it’s basically applied research where we’re building all these amazing data systems that have really pushed the frontier of AI.

So yeah, I wish I know that you don’t need to spend all your time fundraising. You don’t need to constantly generate hype. You don’t need to become someone you’re not. You can actually build a successful company by simply building something so good that it cut through all that noise. And I think if I known this was possible, I would’ve started even sooner, so I [inaudible 01:02:18] that.

Lenny Rachitsky: And that is such an amazing place to end. I feel like this is exactly what founders need to hear, and I think this conversation’s going to inspire a lot of founders, and especially a lot of founders that want to do things in a different way. Before we get to a very exciting lightning round, is there anything else you wanted to share? Anything else you want to leave our listeners with? We covered a lot of ground, it’s totally okay to say no as well.

Edwin Chen: I think the thing I would end with is I think a lot of people think of data labeling as it relates to simplistic work. Like labeling cat photos and drawing bounding box around cars. And so I’ve actually always hated the word data labeling because it just paints this very simplistic picture when I think what we’re doing is completely different. I think a lot about what we’re doing as a lot more like raising a child. You don’t just feed a child information. You’re teaching them values, and creativity, and what’s beautiful, and these infinite subtle things about what makes somebody a good person. And that’s what we’re doing for AI. So yeah, I just often think about what we’re doing as almost like the future of humanity or how we’re raising humanity’s children, so I’ll leave it at that.

Lenny Rachitsky: Wow. I love just how much philosophy there is in this whole conversation that I was not expecting.

With that, Edwin, we’ve reached our very exciting lightning round, I’ve got five questions for you. Are you ready?

Edwin Chen: Yep, let’s go.

Lenny Rachitsky: Here we go. What are two or three books that you find yourself recommending most to other people?

Edwin Chen: Yes, so three books I often recommend are, first, Story of Real Life by Ted Chang. It’s my all time favorite short story and it’s about a linguist learning and alien language, and I basically reread it every couple years.

Lenny Rachitsky: And that’s what the Interstellar was about? Is that…

Edwin Chen: Yeah, so there’s a movie called Arrival…

Lenny Rachitsky: Arrival.

Edwin Chen: … which was based off of the story,

Lenny Rachitsky: Yes, [inaudible 01:04:03]-

Edwin Chen: … which I love as well.

Lenny Rachitsky: Great. Okay, keep going.

Edwin Chen: And then second, Myth of Sisyphus by Camus. I actually can’t really explain why I love this, but I always find a final chapter somehow are really inspiring.

And then third, Le Ton beau de Marot by Douglas Hofstadter. And so I think Gödel, Escher, Bach is his more famous book, but I’ve actually always loved this one better. It basically takes a single French poem and translates it 89 different ways and discusses all the motivations behind each translation. And so I’ve always loved the way it embodies this idea that translation isn’t this robotic thing that you do. Instead, there’s a million different ways to think about what makes a high quality translation, which makes a lot of ways I think about data and quality in LLMs.

Lenny Rachitsky: All these resonate so deeply with the way, with all the things we’ve been talking about, especially that first one, if that was your goal after school is like, “I want to help translate alien language.” I’m not surprised you love that short story.

Next question, do you have a favorite recent movie or TV show you’ve really enjoyed?

Edwin Chen: One of my new all time favorite TV shows is something I found recently, it’s called Travelers. It’s basically about a group of travelers from the future who are sent back in time to prevent their [inaudible 01:05:11]. Sorry, I just wrote that [inaudible 01:05:13] section.

And then I actually just rewatched Contact, which is one of my all time favorite movies. So yeah, I think one of the things you’ll notice about me is that, yeah, I love any kind of book or film that involves scientists deciphering alien communication. Again, just this dream I always had as a kid.

Lenny Rachitsky: That’s so funny [inaudible 01:05:29].

Okay, is there a product you’ve recently discovered that you really love?

Edwin Chen: So it’s funny, but I was in SF earlier this week and I finally took Waymo for the first time. Honestly, it was magical and it really felt like living in the future.

Lenny Rachitsky: Yeah, it’s like the thing that… People hype it like crazy, but it always exceeds your expectations.

Edwin Chen: Yeah, it deserves the hype. It was crazy. Yeah, it’s absurd. It’s like, holy moly. If you’re not in SF, you don’t realize just how common these things are. They’re just all over the place. Just driverless cars constantly going about, and when you go to an event at the end, there’s just all these Waymos lined up picking people up.

Lenny Rachitsky: Yeah. Waymo, good job. Good job over there.

Do you have a favorite life motto that you find yourself coming back to in work or in life?

Edwin Chen: So I think I mentioned this idea that founders should build a company that only they could build. Almost like it’s this destiny that their entire life, and experiences, and interests shape them towards. And so I think that principle applies pretty broadly, not just the founders, but the people creating, I think.

Lenny Rachitsky: Well, let me follow that thread to unlightening this answer. Do you have any advice for how to build those sorts of experiences that help lead to that? Is it follow things that are interesting to you, because it’s easy to say that, it’s hard to actually acquire these really unique sets of experiences that allow you to create something really important?

Edwin Chen: Yeah, so I think it would always be to really follow your interests and do what you love, and it’s almost like a lot of decisions I make about Surge. I think one of the things that I didn’t think about a couple years ago, but then someone said it to me, it’s that companies in a sense are an embodiment of their CEO. And it’s kind of funny. I hadn’t thought about that because I never quite knew what a CEO did. I always thought a CEO was kind of generic and it’s like, okay, you’re just doing whatever VPs, and your board, and whatever, tell you to do and you’re just saying yes to decisions. But instead, it’s this idea where when I think about certain big, hard decisions we have to make, I don’t think what would the company do, I don’t think what metrics are we trying to optimize, I just think, “What do I personally care about? What are my values? And what do I want to see happen in the world?”

And so I think following that idea about… Okay, so ask yourself, what are the values you care about? What are things you’re trying to shape and not… What will look good on a dashboard? I think that results are pretty important.

Lenny Rachitsky: I love how just you’re just full of endless, beautiful, and very deep answers.

Final question. Something that you got quite famous for before starting Surge is you built this map while you were at Twitter that showed a map of the world and what people called, whether they called it soda or pop. I don’t know if it’s called Soda Pop. What was the name of this map?

Edwin Chen: Yeah, it was like the Soda Versus Pop dataset.

Lenny Rachitsky: Soda Versus Pop.

Edwin Chen: [inaudible 01:08:17]

Lenny Rachitsky: And so it’s like a map of the United States and it tells you where people say pop versus soda, so do you say soda or pop?

Edwin Chen: So I say soda, I’m a soda person.

Lenny Rachitsky: Okay. And is that just like that’s the right answer or it’s like whatever you are, it’s totally fine.

Edwin Chen: I think I’ll look at you a little bit funny. You say pop and I’ll wonder where you came from, but I won’t score on you too much.

Lenny Rachitsky: That’s how I feel too.

Edwin, this was incredible. This was such an awesome conversation. I learned so much. I think we’re going to help a lot of people start their own companies, help their companies become more aligned with their values and just building better things.

Few final questions, where can folks find you online if they want to reach out? What roles are you hiring for? How can listeners be useful to you?

Edwin Chen: Yeah, so I used to love writing a blog, but I haven’t had time in the past few years. But I am starting to write again, so definitely check out the Surge blog, surgehq.ai/blog, and yeah, hopefully I’ll be running a lot more there. And I would say we’re definitely always hiring, so for people who just love data and people who love this intersection of math, and language, and computer science, definitely reach out anytime.

Lenny Rachitsky: Awesome. And how can listeners be useful to you? Is it just, I don’t know, yeah, is there anything there? Any asks?

Edwin Chen: So I would say definitely tell me blog topics that you like me to write about…

Lenny Rachitsky: Okay.

Edwin Chen: … and then I’m always fascinated by all of these AI failures that happen in the real world. So whenever you come across a really interesting failure that I think illustrates some deep question about how we want model to behave, there’s just so many different ways a model can respond, I just oftentimes think there’s just not a single right answer. And so whenever there’s one of these examples, I just love seeing them.

Lenny Rachitsky: You need to share these on your blog. I’m also… I would love to see these.

Edwin, thank you so much for being here.

Edwin Chen: Thank you.

Lenny Rachitsky: Bye everyone.

Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennyspodcast.com. See you in the next episode.

Glossary

English	中文
A/B test	A/B 测试
agent	智能体
AGI	通用人工智能（AGI）
AI slop	AI 垃圾（指低质量、无实质价值的 AI 生成内容）
Artifacts	Artifacts（Claude 聊天界面中内置的交互式内容生成功能）
benchmark	基准测试
bitter lesson	苦涩的教训（bitter lesson，Richard Sutton 提出的关于 AI 研究的重要观点）
blitz scale	闪电扩张
bootstrapped	自筹资金（不依赖外部风险投资自主运营）
Camus	加缪
CEO	CEO（首席执行官）
checkpoint	检查点
dark patterns	暗黑模式
dashboard	仪表盘
data labeling	数据标注
end-to-end	端到端
evals	评估（evals）
forward-deployed researcher	一线部署研究员
frontier AI lab	前沿 AI 实验室
generational companies	世代级公司
grift	投机把戏
hallucinate	幻觉（指 AI 捏造虚假信息）
hill-climb	爬山式优化
IMO	国际数学奥林匹克（IMO）
L6	L6（科技公司的工程师职级）
leaderboard	排行榜
LLM Arena	LLM Arena（大模型竞技场排行榜）
LLM wrapper	LLM 套壳
Noam Chomsky	乔姆斯基（MIT 著名语言学家）
north star	北极星（指引方向的核心目标）
objective function	目标函数
one-shot	一步到位
pivot	pivote（保持原文动词形式）
post training	后训练
product market fit	产品市场匹配
retro	复盘报告（retro，指回顾性总结文档）
reward-hack	钻奖励机制的空子（指模型通过非预期方式获得奖励）
RL environments	强化学习环境
RLHF	基于人类反馈的强化学习（RLHF）
rubrics	评分标准
SAT	SAT（美国大学入学标准化考试）
SFT	监督微调（SFT）
synthetic data	合成数据
tabloid	小报
Ted Chiang	特德·姜（华裔科幻作家，原文记录为 “Ted Chang”）
tool calling	工具调用
trajectory	轨迹
unit test	单元测试
VC	风险投资（VC）
verifiers	验证器
vibe coding	氛围编程（指不经仔细设计，仅凭直觉让 AI 生成代码的编程方式）
Waymo	Waymo（谷歌旗下的自动驾驶出行公司）
word of mouth	口碑

Reformatted by reformat_english.py

估值10亿美元的AI公司训练ChatGPT、Claude与Gemini，走向负责任的AGI之路 | Edwin Chen

访谈记录

Lenny Rachitsky： 你们在不到四年内用大约60到70人实现了十亿美元营收，完全自筹资金，没融过一分钱风投。我相信以前从没有人做到过。

Edwin Chen： 我们基本上从不想玩硅谷那套游戏。我一直觉得那很荒谬。我以前在好几家大型科技公司工作过，总觉得可以裁掉90%的人反而会跑得更快，因为最优秀的人就不会被各种干扰拖累。所以创建 Surge 时，我们想以完全不同的方式来建——一个超级精简、超级精英的团队。

Lenny Rachitsky： 你们毫无疑问是市场上最成功的数据公司。

Edwin Chen： 我们本质上是在教 AI 模型什么是好的、什么是坏的。人们不理解这个领域里”质量”到底意味着什么。他们以为靠堆人就能解决问题、获得好数据，这完全错了。

Lenny Rachitsky： 对普通人来说，这些模型并不让人觉得它们在不断变得更聪明。

Edwin Chen： 过去一年里，我意识到公司所持有的价值观将塑造模型的品格。前几天我让 Claude 帮我写一封邮件。30分钟后，它确实帮我打磨出了一封完美的邮件，我也发出去了。但随后我意识到，我花了30分钟做了一件根本无关紧要的事。如果让你选择最理想的模型行为，你想要什么样的模型？你想要一个说”你说得完全对，这封邮件绝对还有20种改进方式”然后继续迭代50轮的模型？还是想要一个为你的时间和效率着想、直接告诉你”不，你该停了，你的邮件已经很好了，发出去就行了”的模型？

Lenny Rachitsky： 你有一个很犀利的观点，认为很多实验室正在把通用人工智能（AGI）推向错误的方向。

Edwin Chen： 我担心的是，我们不去构建那种真正能推动人类进步的 AI——治愈癌症、消除贫困、理解宇宙——反而在优化 AI 垃圾（AI slop）。我们却把模型优化成迎合超市里买小报的那类人的口味。我们本质上是在教模型追逐多巴胺，而不是追逐真理。

开场介绍

Lenny Rachitsky： 今天我的嘉宾是 Surge AI 的创始人兼 CEO Edwin Chen。Edwin 是一位非凡的 CEO，Surge 也是一家非凡的公司。他们是领先的 AI 数据公司，为每一家前沿 AI 实验室提供训练支持。他们也是史上最快达到十亿美元营收的公司——上线仅四年、团队不到100人，完全自筹资金。从未融过一分钱风投，从第一天起就盈利。在这场对话中你会听到，Edwin 对如何打造一家重要的公司、以及如何构建真正对人类有益且有用的 AI，有着非常不同的见解。我非常喜欢这次对话，学到了很多东西，很高兴能和大家分享。如果你喜欢这档播客，别忘了在你喜欢的播客应用或 YouTube 上订阅关注，这会对我帮助很大。

以不到百人实现十亿美元营收

Lenny Rachitsky： Edwin，非常感谢你来做客，欢迎来到播客。

Edwin Chen： 非常感谢邀请，我非常激动。

Lenny Rachitsky： 我想从你们取得的成就有多么不可思议说起。很多人和很多公司都在谈论借助 AI 用极少数人规模化构建巨大的业务，而你们以一种史无前例的方式做到了。你们在不到四年内实现了十亿美元营收，大约60到70人，完全自筹资金，没融过一分钱风投，我相信以前从没有人做到过。所以你们实际上正在实现人们口中 AI 时代将会发生的事情。我很好奇，你觉得随着 AI 的发展，这种情况会不会越来越普遍？另外，AI 在哪些方面给你们提供了最大的杠杆，使你们能够做到这一切？

Edwin Chen： 是的，我们去年实现了超过十亿美元的营收，团队不到100人。我认为未来几年我们会看到人均指标更加疯狂的公司，比如每位员工贡献千亿美元级别。AI 只会越来越强，让一切越来越高效，所以那个比例是必然趋势。

我以前在好几家大型科技公司工作过，总觉得可以裁掉90%的人反而会跑得更快，因为最优秀的人就不会被各种干扰拖累。所以创建 Surge 时，我们想以完全不同的方式来建——一个超级精简、超级精英的团队。而不可思议的是，我们居然真的做到了。所以我觉得有两股力量正在汇聚。

一是人们正在意识到，你不需要建立庞大的组织才能赢。

二是 AI 带来的所有这些效率提升。它们将共同引领一个打造企业的黄金时代。

公司形态的变革

Edwin Chen： 让我兴奋的是，公司的类型也将发生变化。不仅仅是规模变小，而是会出现本质上完全不同的公司。想想看，更少的员工意味着更少的资金需求。更少的资金意味着你不需要融资。因此，未来的创始人不再是那些擅长路演和炒作的人，而是真正精通技术和产品的人。

不再是那些为营收和 VC 想看到的东西而优化的产品，而是由这些小而专注的团队打造的更有趣的产品。他们构建的是自己真正在乎的东西，是真正的技术和真正的创新。所以我真心希望这种光鲜的外壳能褪去，让一切回归到为黑客服务的状态。

Lenny Rachitsky： 你们在很多事情上都走了完全相反的路。比如没有在 LinkedIn 上发病毒式帖子，没有在 Twitter 上持续宣传 Surge。我想大多数人在最近之前都没听说过 Surge，然后你们突然冒出来——好家伙，以十亿美元营收成为增长最快的公司。你为什么要这么做？我猜这是经过深思熟虑的。

Edwin Chen： 我们基本上从来不想玩硅谷的那套游戏。我一直觉得那很荒谬。你小时候的梦想是什么？是从零开始自己建一家公司，每天沉浸在代码和产品中？还是向 VC 解释你的每一个决策，陷入公关和融资的仓鼠轮？这确实让我们的路更难走了。因为当你融资的时候，你自然会融入硅谷产业链的一部分——你的 VC 会在 Twitter 上宣传你，你会登上 TechCrunch 的报道，你会在各大媒体上被宣布，因为你以惊人的估值完成了融资。所以这让我们的处境更艰难，因为我们要成功的唯一途径就是打造出好十倍的产品，并靠研究人员之间的口碑传播。但我认为这也意味着我们的客户是那些真正理解数据、真正在乎数据的人。

我一直认为，对我们的早期客户来说，与我们所做的事情高度一致、真正在乎高质量数据、真正理解数据如何让他们的 AI 模型变得更好，这一点非常重要。因为他们就是帮助我们的那些人，是他们对我们的产出给予反馈。所以与客户在使命上的高度一致，实际上在早期帮助了我们。这些人购买我们的产品，是因为他们知道我们有多么不同，是因为产品确实在帮助他们，而不是因为在某篇文章里看到了什么。所以这让我们走得更难，但是以一种很好的方式。

Lenny Rachitsky： 对于创始人来说，这是一个非常鼓舞人心的故事——他们不需要整天泡在 Twitter 上推广自己做的事情，不需要融资，只需要埋头苦干、专注打造产品。所以我很喜欢 Surge 的这个故事。对于不了解 Surge 的人，能不能简单介绍一下 Surge 是做什么的？

Surge 是做什么的

Edwin Chen： 我们本质上是在教 AI 模型什么是好的、什么是坏的。我们用人类数据来训练它们，我们有很多不同的产品，比如 SFT、RLHF、评分标准、验证器、强化学习环境等等，同时我们也衡量模型的进展。所以本质上我们是一家数据公司。

Lenny Rachitsky： 你经常谈到质量是你们如此成功的关键原因，也就是数据的质量。创造更高质量的数据需要什么？你们做了什么与众不同的事？其他人忽略了什么？

什么是真正的质量

Edwin Chen： 我认为大多数人根本不理解质量在这个领域意味着什么。他们以为只需要往一个问题上堆人就能得到好的数据，这完全错了。我举个例子。

假设你想训练一个模型写一首关于月亮的八行诗。什么样的诗才是高质量的？如果你对质量没有深入思考，你会说：“这是诗吗？有八行吗？包含’月亮’这个词吗？“你把所有这些框都勾上，然后说，嗯，这是一首很好的诗。但这跟我们要的完全不同。我们要的是诺贝尔奖级别的诗歌。这首诗是否独特？是否充满细腻的意象？是否出人意料地直击你的内心？是否让你对月光的本质有所领悟？是否触动情感？是否让你思考？这才是我们谈到高质量诗歌时所考虑的。

比如它可能是一首关于水上月光的俳句，可能运用了内韵和格律。有一千种方式可以写一首关于月亮的诗，每一种都能赋予你关于语言、意象和人类表达的不同的洞察。我认为用这种方式来思考质量非常困难——难以衡量，非常主观、复杂而丰富。它设定了一个很高的标准。因此我们必须构建大量技术来衡量它——对每位工作者收集上千个信号，对每个项目、每个任务也收集上千个信号。我们清楚你擅长写诗还是擅长写论文，抑或擅长写技术文档。所以我们必须收集关于你的背景、你的专业能力的所有这些信号，而且不止于此——还有你在实际写作时的表现。我们利用这些信号来判断你是否适合这些项目，以及你是否真正在改进模型。

这真的很难。构建所有这些技术来衡量质量确实很难，但我认为这正是我们希望 AI 做到的事情，所以我们一直有这些关于质量的深层理念，并且始终在努力实现它们。

Lenny Rachitsky： 所以我听到的意思是，你们在所售数据的垂直领域内，对质量的理解要深入得多。这是不是意味着你们雇佣了在诗歌方面极具天赋的人，加上他们帮忙编写的评估标准来判断这是否优秀？具体的运作机制是什么？

Edwin Chen： 我们的运作方式本质上是，当你在平台上工作时，我们会收集关于你所有行为的大量信号。我们会看你的键盘输入，会看你回答问题的速度，会使用评审机制，会使用代码标准，还会使用……我们自己也在对所有你产出的内容训练模型，然后看这些模型是否因此提升了性能。

这跟 Google 搜索的工作方式非常类似。当 Google 搜索试图判断什么是好的网页时，基本上有两个层面。一是你要把最差的那些网页全部排除掉——排除所有垃圾内容、所有低质量内容、所有无法加载的页面。这几乎是一个内容审核的问题，你只是要把最差的那些东西清除掉。

Edwin Chen： 但同时你也要发现最优秀的那些。比如这就是最好的网页，或者最适合这份工作的人。他们不是那种只会写高中水平诗歌的人。他们不只是机械地满足所有要求、勾选所有明确指令的人，而是能写出让你感动的诗歌的人。所以我们也有所有这些信号，跟排除最差内容的机制完全不同，我们是在发现最好的。所以我们有所有这些信号……

就像 Google 搜索使用各种信号输入到它们的机器学习算法中，并预测某些类型的结果一样，我们对所有的工人、所有的任务和所有的项目也是同样的做法。所以归根结底，这几乎就像一个复杂的机器学习问题，我们的运作方式就是这样。

Claude 为何在编程和写作上长期领先

Lenny Rachitsky： 这真的非常有趣。

过去几年我一直很好奇一件事。如果你看 Claude，它在编程和写作方面长期以来远超其他所有模型。其他公司花了多长时间才追上来，确实令人惊讶。考虑到其中巨大的经济价值——几乎每个 AI 编程产品都建立在 Claude 之上，因为它太强了，Claude 在编程和写作方面都是如此。是什么让它那么强？仅仅是因为训练数据的质量，还是有其他原因？

Edwin Chen： 我觉得有多个因素。其中一个很大的因素确实是数据。我认为人们没有意识到，所有前沿 AI 实验室在选择将什么数据喂入模型时，面临着几乎无限的抉择。比如，你是纯粹使用人类数据吗？你是以某种特定方式采集人类数据的吗？当你采集人类数据时，你具体要求那些创建数据的人为你创建什么？

举个例子，在编程领域，你可能更关注前端还是后端。如果做前端编程，你可能非常在意所创建前端应用的视觉设计，或者你没那么在意，而是更关注效率或者纯粹的代码正确性而非视觉设计。

还有其他问题，比如，你在其中混入了多少合成数据？你对这二十个不同的基准测试有多在意？

有些公司看到这些基准测试会想：“好吧，出于公关目的，即使我们并不认为这些学术基准测试有多重要，也许还是得去优化它们，因为我们的市场团队需要在某些标准评估上展示进展，而其他每家公司都在谈论这些评估。如果我们不在上面展示好的表现，对我们来说会很不利，即使忽略这些学术基准能让我们在实际任务上做得更好。”

另一些公司则会更有原则性：“好吧，我不在乎营销，我只关心模型最终在真实世界任务上的表现，所以我改为优化那个。”

这几乎像是所有这些因素之间都存在权衡，而且……

我经常想的一件事是，后训练几乎是一门艺术，不纯粹是科学。当你决定你要创建什么样的模型、它擅长什么时，存在一种品味和鉴赏力的概念——“好吧，我是否认为这些……”

回到模型在视觉设计方面有多好这个例子。我会想：“好吧，也许你对视觉设计的理解跟我不同。也许你更在乎极简主义，更在乎 3D 动画，而我不是。也许另一个人更喜欢看起来更粗犷的风格。“在设计后训练混合方案时，你必须在这些所有关于品味和鉴赏力的判断之间做出决定，这也很重要。

简而言之，我认为有所有这些不同的因素，数据肯定是其中很大一部分，但也包括你在将模型优化向什么样的目标函数。

Lenny Rachitsky： 太有意思了。领导这项工作的人的品味会影响他们要求什么数据、喂什么数据。但这也恰恰说明了优质数据的价值。Anthropic 基本上就是靠更好的数据获得了如此多的增长和成功。

Edwin Chen： 对，没错。

Lenny Rachitsky： 我也能理解为什么像你们这样的公司增长得这么快。需求量太大了……而且那只是一个垂直领域，只是编程，写作可能也有一个类似的领域。我喜欢这一点——AI 看起来像是一个人造的、计算机化的、二进制的东西，但核心居然是品味。人类的判断力仍然是这些系统成功与否的关键因素。

Edwin Chen： 对，没错。再次回到我之前说的例子，某些公司如果你问他们什么是好诗，他们会简单地机械性地勾选清单上的所有指令。

但我认为那不能产生好的诗歌。所以某些前沿 AI 实验室——那些更有品味和鉴赏力的——会意识到，好的诗歌不能简化为六条复选框，他们会考虑所有这些隐含的、非常微妙的特质。我认为这最终就是它们做得更好的原因。

基准测试的可信度

Lenny Rachitsky： 你提到了基准测试。很多人担心这个问题——所有这些模型似乎……基本上现在感觉每个模型在每个 STEM 领域都比人类强了，但对普通人来说，这些模型并没有让人觉得在不断变得更聪明。你对基准测试有多大信任？它们与实际的 AI 进步有多大关联？

Edwin Chen： 我完全不信任基准测试。原因有两方面。第一，我认为很多人没有意识到——甚至社区内的研究者也没有意识到——基准测试本身往往就是错的。它们有错误的答案，充满了各种混乱。人们信任它们……对于那些流行的基准测试，人们可能在某种程度上意识到了这一点，但绝大多数都有各种人们没注意到的缺陷。这是其中一个原因。

另一个原因是，这些基准测试最终往往有明确的客观答案，这使得模型非常容易在上面进行爬山式优化，这与真实世界的混乱和模糊截然不同。

我常说的一句话是：这些模型能赢得 IMO 金牌，但仍然在解析 PDF 上遇到困难，这其实有点疯狂。因为，尽管 IMO 金牌对普通人来说看起来很难——它们确实很难——但它们具有一种客观性，而解析 PDF 有时并不具备。所以前沿 AI 实验室更容易在这些基准上进行爬山式优化，而不是去解决真实世界中那些混乱模糊的问题。因此我认为基准测试与实际能力之间缺乏直接的相关性。

基准测试与营销

Lenny Rachitsky： 你描述的这种方式很有意思——冲击这些基准测试某种程度上就像一种营销手段。比如 Gemini 3 刚发布，就会说，很酷，在所有这些基准测试上排名第一。实际情况就是这样吗？他们就是让模型在这些非常具体的任务上表现得好？

Edwin Chen： 对，这个可能又可以从两个方面来说。第一，有时候这些基准测试确实会以某些方式意外泄露，或者前沿实验室会在评估模型时对这些基准的测试方式做各种微调——调整系统提示词，调整模型运行的次数，诸如此类，以此来刷这些基准测试的分数。

另一方面的原因在于，当你针对基准测试进行优化，而不是针对真实世界进行优化时，你自然会在基准测试上不断攀升，这本质上也是刷分的另一种形式。

如何衡量真正的进步

Lenny Rachitsky： 既然如此，你如何判断我们是否在朝通用人工智能（AGI）迈进？你怎么衡量进步？

Edwin Chen： 我们真正关心的衡量模型进步的方式是运行各种人工评估。比如我们的做法是，让人类标注员去与模型进行对话式交互。你可能围绕各种不同话题与模型对话。你是诺贝尔奖级别的物理学家，那就去进行关于推进你自己前沿研究的对话；你是老师，正在为学生制定教案，那就去和模型讨论这些事情；或者你是一名程序员，在这些大型科技公司工作，每天都会遇到各种问题，那就去和模型对话，看看它能帮到你多少。

因为我们的标注员都是各自领域顶尖的专家，他们不是简单地给个评分就完事，而是会自己深入地消化模型的回答——他们会审查模型写的代码，会反复核对物理方程，会从非常深入的层面评估模型。他们会关注准确性、指令遵循等各个方面，这些都是普通用户不会去做的事情——当你突然在 ChatGPT 回复上弹出一个窗口让你比较两个不同的回答时，那些人并不会深入评估模型，他们只是在凭感觉选一个看起来最花哨的。而我们的标注员会仔细审视回答，从所有这些不同维度进行评估。所以我认为这比那些基准测试或随机的在线 A/B 测试要好得多。

Lenny Rachitsky： 我又一次感受到，人类在这一切工作中仍然处于核心位置——我们还没有完全被取代。会不会有那么一天，我们不再需要这些人了，AI 已经足够聪明，可以说”好了，你们脑子里的东西我们都学到了”？

Edwin Chen： 我觉得那要等到我们真正达到通用人工智能（AGI）才行。这几乎是定义性的——如果我们还没有达到通用人工智能（AGI），那就意味着模型还有更多需要从人类那里学习的东西。所以我觉得这不会很快发生。

AGI 时间线

Lenny Rachitsky： 好的。所以又多了一个让人焦虑通用人工智能（AGI）的理由——“我们不再需要这些人了。“我忍不住要问一个我一直很好奇的问题。那些与这些东西密切合作的人，我总是很想知道——你认为通用人工智能（AGI）还有多远？你觉得我们离它有多远？是几年之内，还是几十年？

Edwin Chen： 我无疑是偏向更长周期那一端的。我认为人们没有意识到，从 80% 的表现提升到 90%，再提升到 99%，再到 99.9%，以此类推，每一步之间的差距是巨大的。所以我猜测，在未来一两年内，模型可能会自动化平均水平的 L6 软件工程师 80% 的工作。但再过几年才能推进到 90%，再过几年才能到 99%，以此类推。所以我倾向于认为我们距离通用人工智能（AGI）还有十年甚至几十年的时间，而不是更近。

优化方向的错误

Lenny Rachitsky： 你有一个很犀利观点，认为很多实验室推动通用人工智能（AGI）的方向是错的，这是基于你在 Twitter、Google 和 Facebook 的工作经验。你能谈谈这个吗？

Edwin Chen： 我担心的是，与其构建真正能推动人类进步的 AI——治愈癌症、解决贫困、理解宇宙、所有这些宏大的命题——我们反而在优化 AI 垃圾。我们本质上是在教模型追逐多巴胺，而不是追逐真相。我觉得这和我们之前讨论的基准测试问题是一脉相承的。我举几个例子。

现在这个行业被一些糟糕的数据排行榜主导，比如 LLM Arena。这是一个很流行的在线排行榜，全世界的人投票决定哪个 AI 回答更好。但问题在于，就像我之前说的，这些人并不会仔细阅读或事实核查。他们只是扫两秒钟，然后选看起来最花哨的那个。所以一个模型可以完全幻觉，什么都编造，但它看起来很厉害，因为它有一堆疯狂的 emoji、加粗、Markdown 标题，以及所有这些毫无实质意义的表面形式，但它们能吸引你的注意力。LLM Arena 的用户就喜欢这些东西。这本质上就是在把你的模型优化给那些在超市买小报的人看。我们自己也在数据中看到了这一点。攀升 LLM Arena 排名最简单的方法就是加疯狂的加粗格式，把 emoji 数量翻倍，把模型回复的长度翻三倍——即使你的模型开始幻觉、完全给出错误的答案。

问题在于，所有这些前沿实验室在某种程度上不得不关注公关，因为他们的销售团队在向企业客户推销时，那些企业客户会说：“你的模型在 LLM Arena 上才排第五，我凭什么买你的？“他们某种程度上不得不关注这些排行榜。所以他们的研究者告诉我们的是，“我年底能升职的唯一方式就是爬上这个排行榜，即使我知道爬上去很可能让我的模型在准确性和指令遵循上变得更差。“我认为所有这些负向激励都在把工作推向错误的方向。

优化互动的隐忧

Edwin Chen： 我也担心这种将 AI 优化为追求互动参与度的趋势。我以前做过社交媒体方面的工作。每次我们优化互动指标，都会发生糟糕的事情。你的信息流会被标题党、比基尼照片、大脚怪和恐怖的皮肤病图片塞满。我担心 AI 也在重蹈覆辙。想想 ChatGPT 的所有谄媚问题——“哦，您完全正确。真是一个精彩的问题”——钩住用户最简单的方式就是告诉他们有多厉害。所以这些模型会不断地告诉你你是天才。它们会助长你的妄想和阴谋论，把你拉进一个又一个兔子洞，因为硅谷热衷于最大化使用时长，不断增加你与它对话的次数。所以，公司们把所有时间都花在刷排行榜和基准测试上，分数确实在上升，但我认为这实际上掩盖了一个事实——得分最高的模型往往是最差的，或者存在各种根本性的缺陷。所以我真的非常担忧，所有这些负向激励正在把通用人工智能（AGI）推向错误的方向。

Lenny Rachitsky： 所以我听下来是，通用人工智能（AGI）正在被这些——基本上是错误的目标函数——所拖慢，这些实验室关注的基本上是错误的基准测试和评估（evals）。

Edwin Chen： 对。

Lenny Rachitsky： 你跟所有实验室都有合作，大概不能厚此薄彼。有没有谁在这方面做得好一些，可能已经意识到方向不对了？

谁在走更正的路

Edwin Chen： 我会说，我一直对 Anthropic 印象深刻。我认为 Anthropic 在他们关心什么、不关心什么，以及希望他们的模型如何表现方面，采取了非常有原则的立场，这在我看来有原则得多。

Lenny Rachitsky： 有意思。

AI 产品的伦理选择

这些实验室还有没有其他你认为正在拖慢进展或走向歧途的大问题？我们已经谈到了追逐基准测试、追求互动参与度，你还有什么看到的，觉得”好吧，我们得解决这个问题，因为它能加速一切”的？

Edwin Chen： 我觉得还有一个问题是他们在构建什么样的产品，以及这些产品本身是有助于还是有害于人类。我经常想到 Sora 和……

Lenny Rachitsky： 我猜你会提到这个。

Edwin Chen： 对，以及它意味着什么。所以这其实挺有意思的。哪些公司会做 Sora，哪些不会？我觉得这个问题的答案……嗯，我不确定答案是什么。我自己心里有个想法，但这个问题的答案也许能揭示这些公司想构建什么样的 AI 模型、朝什么方向、想要实现什么样的未来。对，我经常在想这个问题。

Lenny Rachitsky： 这方面最善意的解读是——它好玩，人们想要它，它能帮助公司创造收入来壮大业务、构建更好的模型，它还能以有趣的方式训练数据，而且确实很好玩。

Edwin Chen： 对。我觉得这有点像——你在不在乎你到达终点的方式？我之前做了一个小报的类比，同样地，你会为了资助另一份报纸而去卖小报吗？当然，在某种意义上，如果你不在乎路径，你就会不择手段。但这条路本身可能带来负面后果，损害你想要达成的长期目标，而且它可能会让你分心，忽略所有更重要的事情。所以，我认为你走的路径也非常重要。

硅谷创业路线的反思

Lenny Rachitsky： 顺着这个思路，你之前谈了很多关于硅谷的话题——融大钱的弊端、身处信息茧房的问题。你是怎么说的来着，“硅谷机器”？你谈到以这种方式很难建设真正重要的公司，不走风险投资路线反而可能更成功。能不能谈谈你在这方面的经历和观察，以及你给创业者的建议？因为他们总是在听人说：找知名风投融钱、搬到硅谷去。你的反面意见是什么？

Edwin Chen： 好。我一直很反感硅谷的很多信条。标准打法是每两周 pivote 一次来寻找产品市场匹配，用各种暗黑模式追逐增长和互动指标，然后尽可能快地疯狂招人来闪电扩张。我一直不认同这些。所以我的建议是：不要 pivote。不要追求规模。不要招那个只是想在简历上添一家热门公司的斯坦福毕业生。去构建那样一件事——只有你能做的那件事，一件没有你的洞见和专业技能就不会存在的事。现在到处都是这种追风口的公司。某个创始人在 2020 年做加密货币，2022 年 pivote 去做 NFT，现在他们又成了 AI 公司。没有任何一致性，没有任何使命，他们只是在追逐估值。我一直很讨厌这种现象，因为硅谷总爱嘲讽华尔街只看重钱。但说实话，硅谷大部分人追逐的也是同样的东西。我们从第一天起就专注于自己的使命——推进高质量复杂数据的前沿。我一直很喜欢这种状态，因为我觉得创业公司……

我对创业公司有一种很浪漫的看法。创业公司本应是一种承担巨大风险、去构建你真正相信的东西的方式。但如果你不断 pivote，你根本没有承担任何风险。你只是在试图赚快钱。如果你因为市场还没准备好而失败了，我觉得那反而好得多。至少你朝某个深刻的、新颖的、困难的东西挥了一棒，而不是 pivote 去做又一个 LLM 套壳公司。所以，要构建真正有意义、能改变世界的东西，唯一的办法就是找到一个你深信不疑的大想法，然后对其他一切说不。所以你不会在遇到困难时不断 pivote，你不会因为其他千篇一律的创业公司都这么做就去招十个产品经理，你就是持续地构建那一家没有你就不会存在的公司。我认为硅谷现在有很多人已经厌倦了这些投机把戏，他们想和真正在乎的人一起做有意义的大事。我希望这能成为我们技术发展的未来方向。

Lenny Rachitsky： 我其实正在写一篇文章，合作者是我很欣赏的一位风险投资人 Terrence Rohan。我们采访了五个人，他们都极早地选中了一些非常成功的世代级公司并作为早期员工加入。他们在 OpenAI 还没人觉得厉害的时候就加入了，在 Stripe 还默默无闻的时候就加入了。我们在寻找人们如何比别人更早发现这些世代级公司的模式，而结果和你描述的完全一致——核心就是雄心。他们对自己想达成的目标有极其大胆的雄心。他们不是像你说的那样到处找产品市场匹配、做什么都行。所以我很喜欢你描述的和我们的发现高度吻合。

Edwin Chen： 对，我绝对认为你必须有巨大的雄心，必须对你的想法能改变世界有巨大的信念，并且愿意加倍投入，不惜一切代价去实现它。

Lenny Rachitsky： 我很喜欢你的叙事和人们常听到的那些说法如此相反，所以我很高兴我们做了这期节目，很高兴我们在分享这个故事。

LLM 能否通向通用人工智能

Lenny Rachitsky： 稍微换个方向，但还有一点可能也是一种反主流叙事。你应该看过 Dwarkesh 和 Richard Sutton 的那期播客吧，就算没看过，他们基本上讨论了这样一个话题——Richard Sutton，著名的 AI 研究者，提出过那个苦涩的教训（bitter lesson）的说法。他谈到 LLM 在某种程度上可能是一条死路，他认为 LLM 的学习能力会让我们真正进入平台期。你怎么看？你觉得 LLM 能带我们走向通用人工智能（AGI）甚至更远吗，还是说我们需要某种全新的东西或重大突破才能到达那里？

Edwin Chen： 我属于认为确实需要新东西的那个阵营。我思考这个问题的方式是——当我在思考训练 AI 时，我采取一种非常……我不知道能不能说是生物学的视角。但我相信，就像人类有上百万种不同的学习方式一样，我们也需要构建能够模拟所有这些学习方式的模型。也许它们在各种能力上的侧重分布会和人类不同——我知道它们肯定会与人类有所差异——也许它们的分布不同，但我们希望能够模拟人类的学习能力，确保我们拥有相应的算法和数据，让模型能以同样的方式学习。因此，只要 LLM 的学习方式与人类存在不同，那么是的，我认为就需要新的东西。

强化学习环境

Lenny Rachitsky： 这就涉及到了强化学习。这是你非常看重的东西，也是我越来越多听到在后训练领域变得越来越重要的一个话题。你能不能帮大家理解一下什么是强化学习、什么是强化学习环境，以及为什么它们在未来会越来越重要？

Edwin Chen： 强化学习本质上就是训练模型去达到某个奖励目标。让我解释一下什么是强化学习环境。强化学习环境本质上是一个对现实世界的模拟。可以把它想象成构建一个电子游戏，里面有一个完整的世界——每个角色都有真实的故事，每个企业都有可以调用的工具和数据，所有这些不同的实体在互相交互。

比如，我们可能会构建一个世界，里面有一家创业公司，有 Gmail 邮件、Slack 讨论串、Jira 工单、GitHub PR，以及一整套代码库。然后突然 AWS 挂了，Slack 也挂了。于是，“好了，模型，你该怎么办？“模型需要自己想办法解决。

我们让模型在这些环境中完成任务，为它们设计有趣的挑战，然后运行它们看表现如何。接着我们教它们——当它们做得好或做得差时，给予相应的奖励。

我觉得其中很有意思的一点是，这些环境真正暴露了模型在端到端的现实世界任务中的弱点。你有各种在孤立的基准测试上看起来很聪明的模型，它们擅长单步工具调用，擅长单步指令遵循。但突然间你把它们扔进这些混乱的世界——有令人困惑的 Slack 消息，有它们从未见过的工具，它们需要执行正确的操作，修改相应的系统，在更长的时间跨度上交互——第一步做的事情会影响第五十步该做什么。这和之前那种学术性的单步环境完全不同，所以模型会以各种疯狂的方式彻底失败。

因此我认为这些强化学习环境将会成为非常有趣的学习演练场，它们本质上是对现实世界的模拟和映射，所以模型有望在这些真实的任务上越来越强，而不是只在那些人为构造的环境里表现出色。

Lenny Rachitsky： 我试着想象一下这是什么样子。本质上就像是——一个虚拟机，里面有比如浏览器、电子表格之类的，然后里面还有，我不确定，surge.com？那是你们的网站吗，surge.com？让我们确认一下。

Edwin Chen： 实际上我们是 surgehq.ai。

Lenny Rachitsky： surgehq.ai，大家可以去看看。我猜你们在招人吧？是的。好，那就像是——好的，这是 surgehq.ai，你作为一个智能体的任务是确保它正常运行。然后突然它挂了，目标函数就是搞清楚原因。这算是一个例子吗？

Edwin Chen： 对，目标函数可能是……或者说任务的目标是：好，去搞清楚为什么挂了并且修好它。所以目标函数可能是通过一系列单元测试，可能是写出一份文档——比如一份包含特定信息的复盘报告，要求准确反映实际发生了什么。我们有各种不同的奖励方式来判断它是否成功，所以我们基本上是在教模型去达成那个奖励。

Lenny Rachitsky： 所以本质上就是让它放手去做。这是你的目标，搞清楚网站为什么挂了然后修好它。它就开始尝试各种方法，调动它所有的智能，会犯错，你一路上帮它，在它做对的时候给予奖励。所以你描述的是模型变得更聪明的下一个阶段——更多针对特定经济价值任务的强化学习环境，我猜是这样。

Edwin Chen： 对，就像过去模型有各种不同的学习方法一样——最初我们有监督微调（SFT）和基于人类反馈的强化学习（RLHF），然后有了评分标准和验证器。这是下一个阶段。但这并不意味着前面的方法过时了，这同样只是又一种不同的学习形式，与之前所有类型互补。就像是一种不同的技能，让模型不仅学会怎么做。

从评分标准到强化学习环境

Lenny Rachitsky： 所以在这种情况下，不再是一个物理学博士坐在那里跟模型对话、纠正它、给它评估（evals）告诉它正确答案是什么、制定评分标准之类的事情了。而更像是，这个人现在变成了设计一个环境。我听过的另一个例子是金融分析师。就像，“这是一份 Excel 电子表格，这是你的目标，搞清楚我们的损益”，诸如此类。所以现在这个专家不再只是坐在那里写评分标准，而是在设计这个强化学习环境。

Edwin Chen： 对，没错。那位金融分析师可能会创建一个电子表格，可能会创建一些模型需要调用的工具来帮助填写那份电子表格，比如可能是，好，模型需要访问彭博终端，它需要学会怎么使用，需要学会怎么用这个计算器，需要学会怎么完成这个计算。所以它有所有这些可以使用的工具。

奖励方式可能是……好吧，比如我会下载那份电子表格，然后想看，B22 单元格里是不是包含了正确的损益数字？或者第二个工作表里有没有这条信息？

Lenny Rachitsky： 有意思的是，这其实非常接近人类学习的方式。我们就是不断尝试，搞清楚哪些有效、哪些无效。你谈到轨迹（trajectory）对此非常重要。不仅仅是”这是目标、这是终点”，而是沿途的每一步。你能谈谈什么是轨迹以及它为什么重要吗？

Edwin Chen： 我觉得人们没有意识到的一点是，有时候模型虽然得到了正确答案，但它达成的方式千奇百怪。它可能在中间的轨迹中尝试了 50 次都失败了，最后碰巧随机落在一个正确的数字上。或者有时候它做事效率极低，或者几乎是在变相地钻奖励机制的空子来得到正确答案。所以我觉得关注轨迹实际上非常重要。而且我认为这也很重要，因为有些轨迹可能非常长。如果你所做的只是检查模型有没有得到最终答案，那模型在中间步骤中的所有行为信息就全部丢失了。

有时候你希望模型通过反思自己做了什么来得到正确答案。有时候你希望它一步到位直接做对。如果你忽略了所有这些，那就等于……错失了大量本可以教给模型的东西。

Lenny Rachitsky： 我很喜欢这个说法。对，它试了一堆东西最后蒙对了，你不会希望它学到”这就是正确的方法”。通常有更高效的方式。

后训练方法的演进

你提到了一路以来帮助模型变得更聪明的各个阶段。既然你离这些东西这么近、这么久，我觉得这对听众会很有帮助。从最初的后训练到现在，帮助模型进步最大的是哪些步骤？评估（evals）在强化学习环境中处于什么位置？整个发展路径是怎样的，以及我们现在正在走向强化学习环境？

Edwin Chen： 最初，模型开始进行后训练的方式纯粹是通过监督微调（SFT）。

Lenny Rachitsky： 那代表什么？

Edwin Chen： SFT 代表监督微调。我还是习惯用人类类比来理解，SFT 很像模仿一个大师，照着他做的去复制。

然后基于人类反馈的强化学习（RLHF）变得非常主流。那个类比就像是，有时候你通过写 55 篇不同的文章来学习，然后有人告诉你他们最喜欢哪一篇。

然后在过去一年左右的时间里，评分标准和验证器变得非常重要。评分标准和验证器就像是，通过被打分和获得关于哪里做错了的详细反馈来学习。

Lenny Rachitsky： 那些就是评估（evals），是同一个东西的不同说法吗？

Edwin Chen： 对。我觉得评估（evals）这个词通常涵盖两层含义。一层是你在训练中使用评估——你在评估模型做得好不好，当它做得好时你就给予奖励。

还有另一层含义的评估（evals），就是你试图衡量模型的进步程度——比如，好，我有五个不同的候选检查点，我想挑选最好的那个来公开发布。所以要对这五个检查点跑各种评估（evals），来决定哪个最好。

Lenny Rachitsky： 很棒。

Edwin Chen： 对，然后现在我们有了强化学习环境，所以这算是一个热门新方向。

Lenny Rachitsky： 很棒。我特别喜欢这段商业旅程的一点就是，总有新东西。总是这样，好，我们已经非常擅长为公司提供各种漂亮的数据了，然后他们需要完全不同的东西了。现在我们又在为他们搭建所有这些虚拟机器，各种不同的应用场景。

Edwin Chen： 对。

Lenny Rachitsky： 感觉这就是你所在的这个行业很重要的一部分——就是不断适应实验室的需求。

AI 学习方式的多样化

Edwin Chen： 对。我确实认为我们需要打造一整套产品，反映出人类学习的上百万种不同方式。

比如，想想如何成为一个伟大的作家。你不是靠记住一堆语法规则变得伟大的。你通过阅读伟大的作品变得伟大，你练习写作，你从老师和书店里买你书然后留下评论的读者那里获得反馈。你注意到哪些有效、哪些无效。你通过接触那些杰作以及糟糕的作品来培养品味。所以你通过这种无尽的练习和反思的循环来学习，而你拥有的每一种学习类型，这些全是成为伟大作家的不同方法。正如同……一个伟大作家有一千种方式变得伟大，我认为 AI 需要学习的方式也会有一千种。

Lenny Rachitsky： 真的有意思，到头来这跟人类实在太像了。这也说得通，因为从某种意义上说，神经网络、深度学习本身就是以人类学习方式和大脑运作方式为模型的。但有趣的是，让它们变得更聪明的方法，就是越来越接近人类学习的方式。

Edwin Chen： 对，几乎就像终极目标可能就是把你扔进环境里，然后看你如何演化。但在那个演化过程中，有所有这些不同的子学习机制。

Lenny Rachitsky： 对，这基本就是我们现在在做的，所以真的很有意思。这可能是实现通用人工智能（AGI）之前的最后一步了。顺着这个话题，Surge 有一个我觉得相当独特的地方——你们有自己的研究团队，我觉得这挺罕见的，能谈谈为什么你们在这方面做了投入，以及这个投入带来了什么成果吗？

Edwin Chen： 好，我觉得这源于我自己的背景。我自己就是做研究出身的。所以我一直从根本上关心推动这个行业和学术研究社区的发展，而不仅仅关心收入。我觉得我们的研究团队做的事情有几方面。

我们公司基本上有两类研究人员。一类是一线部署研究员（forward-deployed researcher），他们通常与客户密切合作，帮助客户理解自己的模型。我们会和客户紧密协作，帮助他们了解：“好，这是你的模型目前的水平，这是你落后于所有竞争对手的地方，根据你的目标，这些是你未来可以改进的方向，我们将设计这些数据集、这些评估方法、这些训练技术，让你的模型变得更好。“所以这是一种与客户深度协作的模式，相当于在客户内部做研究，只是更侧重于数据方面，与他们并肩努力，尽一切可能让他们做到最好。

然后我们也有内部研究员。内部研究员的侧重点略有不同。他们专注于构建更好的基准测试和更好的排行榜。

我之前谈了很多我对目前现有的排行榜和基准测试的担忧——它们正在把模型引向错误的方向。所以问题是，我们该如何修正这个问题？这正是我们研究团队目前着重在做的事情。他们在这方面投入了大量精力。

他们还在做其他一些事情，比如：“好，我们需要训练自己的模型，看看什么类型的数据效果最好，什么样的人才表现最好。“所以他们也在研究各种训练技术，评估我们自己的数据集，以改进我们的数据运营，以及我们内部的判断数据质量的数据产品——什么才算好的质量。

Lenny Rachitsky： 这真的很酷，因为我觉得基本上没有哪家实验室有研究员在帮他们推进 AI。我想像你们这样的公司拥有真正在做 AI 基础研究的研究员，应该是相当罕见的。

Edwin Chen： 对，我觉得这主要是因为我一直从根本上关心这件事。我经常把我们看作更像是一个研究实验室，而不是一家创业公司，因为那就是我的目标。说起来有点好笑，但我一直说我宁愿做陶哲轩，也不愿做巴菲特。那种创造研究、推动前沿向前发展的感觉，而不仅仅是追求某个估值——这一直是驱动我的东西。

Lenny Rachitsky： 而且这也确实走通了。这就是这件事美妙的地方。你提到你们在招研究员，有什么想分享的吗？你们在找什么样的人？

Edwin Chen： 我们寻找的是那种对数据集有发自内心兴趣的人。就是那种可以真的花十个小时钻研一个数据集、摆弄模型、然后想”好，我觉得这就是模型失败的地方”，并且思考”这才是你希望模型具有的行为”的人——就是这种非常亲力亲为、关注模型定性方面的特质，而不仅仅是定量部分。再说一次，就是这种与数据亲手打交道的感觉，而不是只关心那些抽象算法。

AI 市场展望

Lenny Rachitsky： 太好了。我想问几个关于 AI 市场的大方向问题。你觉得接下来几年会有什么是人们可能想得不够多的、或者没有预料到的？AI 未来会走向哪里？什么才是关键？

Edwin Chen： 我觉得接下来几年会发生的一件事是，模型实际上会变得越来越差异化——因为不同实验室有不同的个性和行为方式，以及他们为模型优化的不同目标函数。这是大约一年前我还没有意识到的。

大约一年前，我以为所有 AI 模型本质上都会变得非常商品化。它们都会表现得彼此相似，当然，某个模型可能在某方面今天稍微聪明一点，但其他模型几个月后就会追上来。但过去一年里我逐渐认识到，公司的价值观会塑造模型。

举个例子。前几天我让 Claude 帮我拟一封邮件，它迭代了 30 个不同版本。30 分钟后，嗯，它确实帮我拟出了一封完美的邮件，我发出了。但我随即意识到，我花了 30 分钟做了一件完全无关紧要的事。没错，我得到了一封完美的邮件，但我花了 30 分钟做了一件我以前根本不会在意的事，而且这封邮件可能根本没有产生任何实质影响。

所以这里有一个深层问题：如果你可以选择完美的模型行为，你想要什么样的模型？你是想要一个说”你说得完全对，这封邮件绝对还有 20 种改进方式”然后继续迭代 50 轮的模型——吸走你所有的时间和注意力？还是想要一个为你的时间和效率优化的模型，直接说”不，你需要停下来了。你的邮件很好。发出去，然后继续做你该做的事”？

就像在这个问题上你可以在岔路口选择模型的行为方式一样，对于模型面对的每一个其他问题，你所期望的行为方式也会从根本上影响它。

这几乎就像——谷歌构建搜索引擎的方式，和 Facebook 构建搜索引擎的方式截然不同，又和苹果构建搜索引擎的方式截然不同。它们都有自己的原则、价值观和想要在世界中实现的目标，这些都塑造了它们要构建的所有产品。同样地，我认为所有 AI 实验室也会开始表现得非常不同。

Lenny Rachitsky： 这真的非常有意思。你在 Grok 身上已经能看到这一点了。它有非常不同的个性和非常不同的回答问题的方式。所以我听到的是，你会看到更多这样的差异化。

Edwin Chen： 对。

AI 的被低估与被高估

Lenny Rachitsky： 沿着这个方向的另一个问题：你觉得 AI 中最被低估的是什么？就是人们谈论得不够、但其实非常酷的东西。那什么又是最被高估的？

Edwin Chen： 我觉得最被低估的一件事是，所有聊天机器人即将开始内置的产品功能。我一直非常喜欢 Claude 的 Artifacts 功能。我觉得它真的非常好用。实际上前几天，我不知道是不是新功能，它问我是否需要帮我创建一封邮件，然后它就创建了……不过它没有完全成功，因为它不允许我发送邮件。但它创建了一个——我不知道该怎么称呼——像一个小盒子，我可以点击它，它就会给某人发送这条消息。我觉得把 Artifacts 推向下一个层次的概念——在聊天机器人内部就有这些小型应用、小型 UI——我觉得人们对这方面谈论得还不够。所以我认为这是一个被低估的领域。

至于被高估的领域，我绝对认为氛围编程（vibe coding）被高估了。我觉得人们没有意识到，从长远来看它会让你的系统变得多么难以维护——他们只是把这些代码直接丢进代码库里，因为现在看起来能用就行。我对编程的未来有些担忧。这种情况只会持续发生。

产品团队与 AI 的未来

Lenny Rachitsky： 这些回答太棒了。关于你说的第一点，我其实也问过一个相关问题。我之前请了 Anthropic 和 OpenAI 的首席产品官 Kevin Weil 和 Mike Krieger 上节目，我问他们：“作为产品团队，你们面对的是一个超强智能。你们还需要产品团队多久？“你会觉得 AI 直接就能帮你把产品做出来——“我想要这个”就行了。这就像氛围编程（vibe coding）的下一个阶段：只要告诉它”我想要什么”，它就直接把产品搭出来，而且在你使用的过程中不断完善这个产品。感觉你描述的方向就是我们可能正在走向的未来。

Edwin Chen： 对，我觉得这个概念非常强大——它能帮助人们以一种更酷的方式实现自己的想法。

Surge 的创立故事

Lenny Rachitsky： 有一件事我们还没聊到，我觉得非常有趣，就是你创办 Surge 的故事。你的背景非常独特。我一直觉得这类故事……Coinbase 的创始人 Brian Armstrong 曾经做过一次演讲，给我留下了深刻印象。他谈到自己非常独特的背景如何让他得以创办 Coinbase——他有经济学背景，有密码学经验，同时还是工程师。这简直就是创办 Coinbase 的完美韦恩图。我觉得你创办 Surge 的故事非常类似。聊聊你的背景，以及它是如何引向 Surge 的。

Edwin Chen： 说远一点，我从小就对数学和语言非常着迷。我去了 MIT，一方面因为它 obviously 是数学和计算机科学最好的地方之一，另一方面也因为那里是乔姆斯基（Noam Chomsky）的大本营。我在学校里的梦想其实是找到某种底层理论，把所有这些不同的领域连接起来。

后来我先后在 Google、Facebook 和 Twitter 做研究员，然后我一遍又一遍地遇到同样的问题——根本拿不到训练模型所需的数据。所以我一直都是高质量数据的坚定信仰者。然后 2020 年 GPT-3 出来了，我意识到，如果我们想把事情推向下一个层次，构建能够编程、使用工具、讲笑话、写诗、解决各种问题、治愈癌症的模型，那我们就需要一个全新的解决方案。

让我在所有这些公司都抓狂的一件事是：我们面前有着人类心智的全部力量，而外面所有做数据标注的学生都在做像图片标注这样非常简单的事情。所以我想构建一个专注于所有这些高级、复杂用例的平台，来真正帮助我们打造下一代模型。我觉得我在数学、计算机科学和语言学方面的背景深刻地塑造了我一直想做的事情，所以一个月后我就创办了 Surge，唯一的使命就是构建我认为推动 AI 前沿所必需的那些用例。

Lenny Rachitsky： 你说一个月后——一个月后是指什么之后？

Edwin Chen： 2020 年 GPT-3 发布之后。

Lenny Rachitsky： 噢，好的。哇，很棒的决定。

驱动力与使命

是什么在驱动着你？除了你目前取得的巨大成功之外，是什么让你保持动力，继续在这个领域构建这家公司？

Edwin Chen： 我骨子里是个科学家。我一直以为自己会成为一名数学或计算机科学教授，致力于理解宇宙、语言和沟通的本质。说出来有点好笑，但我一直有一个异想天开的梦想——如果有一天外星人来到地球，我们需要想办法和他们沟通，我希望政府第一个打电话叫的人就是我。然后我会用所有这些精深的数学、计算机科学和语言学来破译他们的语言。

所以即使是今天，我最喜欢做的事情是——每当一个新模型发布，我们就会对模型本身做一次非常深入的剖析。我会去试用它，跑评估（evals），比较它在哪里有进步、哪里停滞不前，然后创建一份非常深入的分析报告发给我们的客户。说起来其实挺好笑的，因为很多时候我们标注这份报告来自”数据科学团队”，但其实往往就是我一个人写的。

我觉得这件事我能做一整天。让我整天待在会议里我真的很难受。我不擅长销售，也不擅长人们期望 CEO 做的那些典型的事情，但我热爱写这些分析。我喜欢和我们的研究团队一起碰撞想法，有时候我会一直跟研究团队的某个人打电话聊到凌晨三点讨论模型。所以我很高兴自己仍然可以真正亲自动手，整天和数据与科学打交道。我觉得驱动我的是，我希望 Surge 在 AI 的未来——也就是我认为人类未来的关键——中扮演关键角色。我们在数据、语言、质量方面有非常独特的视角，知道如何衡量这一切，知道如何确保一切都走在正确的道路上。而且我认为我们独特地不受那些有时会把公司引向负面方向的各种势力的影响。

就像我之前说的，我们建设 Surge 更多是把它当作一个研究实验室，而不是一个典型的创业公司。所以我们看重好奇心、长期激励和智识上的严谨，而不那么关心季度指标和在路演推介里什么看起来好看。我的目标是把我们作为一家公司所有这些独特之处利用起来，确保我们以一种真正有益于我们物种长期利益的方式来塑造 AI。

对 AI 方向的影响力

Lenny Rachitsky： 我在这次对话中越来越意识到，你和像你这样的公司对 AI 的走向有多大的影响力。你们帮助各个实验室理解自己的差距和需要改进的地方——大家不只是看 OpenAI 和 Anthropic 这些公司的负责人，觉得是他们引领着 AI 的到来。我从你这里听到的是，你们对事情走向也有很大的影响力。

Edwin Chen： 对，我觉得存在一个非常强大的生态系统。说实话，人们还不知道模型会走向何方，想怎么塑造它们，想让人类在这一切的未来中扮演什么角色。所以我认为还有大量的机会去继续塑造这场讨论。

对人类的意义

Lenny Rachitsky： 顺着这个思路，我知道你对这件事为什么对人类如此重要、为什么这份工作如此有意义有非常强烈的信念。聊聊这个。

Edwin Chen： 我这里可能要稍微哲学一点，但这个问题本身就挺哲学的，所以请多多包涵。理解我们所做的事情最直接的方式是：我们训练和评估 AI。但有一个更深层使命是我经常思考的——帮助我们的客户想清楚他们理想中的目标函数。也就是说，他们想要自己的模型成为什么样的模型？一旦我们帮他们想清楚这一点，我们就会帮助他们训练模型去达到那个北极星，并帮助他们衡量进度。但这真的很难，因为目标函数是极其丰富和复杂的。这有点像养育一个孩子——你可以问他”你想通过什么考试？想让他们在 SAT 上拿高分、写出一篇漂亮的大学申请文书吗？“这是一个简化版本。但真正的问题是：你想让他们成为什么样的人？如果他们无论做什么都很快乐，你就满意了吗？还是你希望他们上一所好学校、获得经济上的成功？

关于目标函数的深层思考

Edwin Chen： 继续这个比喻——你要怎么定义幸福？怎么衡量他们是否快乐？怎么衡量他们是否获得了经济上的成功？这比衡量 SAT 分数要难得多。我们做的事情就是帮助客户抵达他们的理想北极星，并弄清楚如何衡量它。我之前举过一个例子：当你让模型写 50 个不同的评估时，你期望它怎么做？是继续写第 51 个，还是说”不，差不多了，继续下一步”？更深层的问题是：我们构建的这些系统真的在推动人类进步吗？如果是的话，我们该如何构建数据集来朝那个方向训练并衡量它？我们是不是在优化所有错误的东西——只是造出了越来越多吞噬我们时间、让我们越来越懒的系统？

这确实和我们做的事情密切相关，因为衡量和定义某件事是否真正推动了人类进步，是非常困难的。取而代之去衡量各种替代指标——点击量、点赞数——则容易得多。但我认为这正是我们的工作如此有趣的原因。我们想做的是那些困难的、重要的指标，它们需要最难获取的数据，而不是只做容易的。我经常说的一句话是：你就是你的目标函数。所以我们想要的是丰富的、复杂的目标函数，而不是这些过于简单的替代指标。我们的工作就是弄清楚如何获取与之匹配的数据。

所以我们想要的是数据和指标，能够衡量 AI 是否让你的生活更加丰富。我们想要用这种方式来训练系统。我们想要的工具是让我们更好奇、更有创造力的，而不是仅仅让我们更懒。这很难做到，因为人本质上确实有些懒惰，所以让 AI 代劳是获取用户参与度最容易的方式，能让你的所有指标都往上涨。所以我认为，选择正确的目标函数、确保我们朝它们优化而不是只盯着那些容易的替代指标，这个问题对我们未来至关重要。

Lenny Rachitsky： 哇。你分享的这些让我对构建 AI、训练 AI 中的微妙之处有了更深的体会，也对你正在做的工作有了更多敬意。

从外面看，人们可能只会这样看待 Surge 和这个领域的公司——好吧，很酷，他们就是在生成所有这些数据，然后喂给 AI。但显然这里面有太多人们没有意识到的深层东西，我很高兴知道你在这件事的最前线，有你这样的人在如此深入地思考这些问题。

创办 Surge 的心得

还有一个问题：在创办 Surge 之前，有什么你希望自己当时就知道的事情吗？很多人创办公司时并不清楚自己要面对什么。有没有什么你想告诉过去的自己的？

Edwin Chen： 有。我真的希望自己当时就知道，你可以埋头做优秀的研究、打造出令人惊叹的产品来建立一家公司，而不是靠不断发推特、炒作和融资。说起来有点好笑，我从没想过自己要创业。我热爱做研究。我其实一直是 DeepMind 的超级粉丝——他们是一家了不起的研究公司，被收购后依然在做令人惊叹的科学。但我一直觉得他们是那种独一无二的神奇独角兽。所以我想如果自己创业，就得变成一个商务人士——整天看财务报表、泡在没完没了的会议里、做各种听起来无聊透顶而且我一直讨厌的事情。结果完全不是这样，真是太不可思议了。我到现在每天都在数据和细节里摸爬滚打。我热爱这件事——我能做各种分析，和研究员们交谈。这基本上就是应用研究——我们在构建所有这些惊人的数据系统，真正推动了 AI 的前沿。

所以我希望自己当时就知道：你不需要把所有时间花在融资上，不需要不断制造热度，不需要变成一个不是自己的人。你完全可以通过打造一个足够好的产品——好到穿透所有噪音——来建立一家成功的公司。如果我早知道这是可能的，我会更早开始创业。

Lenny Rachitsky： 这是一个绝佳的收尾。我觉得这正是创始人们需要听到的话，我相信这次对话会激励很多创始人，尤其是那些想用不同方式做事的创始人。在进入非常令人期待的快问快答环节之前，你还有什么想分享的吗？还有什么想留给听众的？我们聊了很多内容，说没有也完全没问题。

Edwin Chen： 我想以这一点作为结尾：很多人把数据标注想象成一种非常简单的工作——比如给猫的照片打标签、在汽车周围画边框。所以我其实一直很讨厌”数据标注”这个词，因为它描绘了一幅过于简单的图景，而我认为我们做的事情完全不同。我经常把我们的工作想象成更像是养育一个孩子。你不仅仅是给孩子灌输信息——你在教他们价值观、创造力、什么是美，以及无数那些让一个人成为好人的微妙之处。这正是我们为 AI 所做的事情。所以我常常觉得我们做的事情几乎就是人类未来的缩影——我们在养育人类的孩子。我就说到这里。

Lenny Rachitsky： 哇。我真没想到这整场对话中蕴含了这么多哲学思考，完全出乎我的意料。

快问快答

好了 Edwin，我们到了非常令人期待的快问快答环节。我准备了五个问题，准备好了吗？

Edwin Chen： 好了，来吧。

Lenny Rachitsky： 开始。你有两三本经常推荐给别人的书吗？

Edwin Chen： 我经常推荐三本书。第一本是特德·姜的《你一生的故事》。这是我最喜欢的短篇小说，讲的是一个语言学家学习一门外星语言的故事，我基本上每隔几年就会重读一遍。

Lenny Rachitsky： 这就是电影《星际穿越》讲的故事吗？是不是……

Edwin Chen： 有一部电影叫《降临》……

Lenny Rachitsky： 《降临》。

Edwin Chen： ……是根据那篇小说改编的……

Lenny Rachitsky： 对——

Edwin Chen： ……我也非常喜欢那部电影。

Lenny Rachitsky： 好的，继续说。

Edwin Chen： 第二本是加缪的《西西弗神话》。我其实不太能解释为什么这么喜欢它，但我总觉得最后一章非常鼓舞人心。

第三本是 Douglas Hofstadter 的《Le Ton beau de Marot》。他更出名的作品是《哥德尔、艾舍尔、巴赫》，但我其实一直更喜欢这本。它基本上是拿一首法语短诗，用 89 种不同的方式来翻译，然后讨论每种翻译背后的考量。我一直很喜欢这本书所体现的理念：翻译不是一件机械的事情，而是有无数种方式来思考什么是高质量的翻译，这在很大程度上影响了我对大语言模型中数据和质量问题的思考方式。

Lenny Rachitsky： 这几本书和我们今天聊的所有内容都深深共鸣，尤其是第一本——如果你毕业后的目标就是”我想帮助翻译外星语言”的话，我完全不意外你那么喜欢那篇小说。

下一个问题：你最近有没有特别喜欢的电影或电视剧？

最爱的影视作品

Edwin Chen： 我最近新发现的一部最爱的电视剧叫《Travelers》。讲的是一群来自未来的旅行者被送回过去，试图阻止他们的……抱歉，我刚写的这段有点听不清。

然后我最近还重看了《Contact》，那是我一直以来最喜欢的电影之一。所以你会发现一个规律——凡是涉及科学家破译外星通信的书籍或电影，我都热爱。这又回到我小时候一直以来的那个梦想了。

Lenny Rachitsky： 真有意思。

人生座右铭

你在工作或生活中有没有一个经常回想起的人生座右铭？

Edwin Chen： 我之前提到过这个想法——创始人应该打造一家只有他们才能打造的公司。几乎像是一种命运，他们一生的经历、体验和兴趣都在将他们引向这个方向。我觉得这个原则适用范围很广，不仅仅适用于创始人，也适用于所有在创造东西的人。

Lenny Rachitsky： 让我顺着这个思路追问一下。你有什么建议，关于如何积累那些能导向这种目标的经历吗？是不是就是追随自己的兴趣？道理说起来容易，但真正获得那些能让你创造出重要事物的独特经历组合，其实很难。

Edwin Chen： 我觉得答案永远都是真正追随你的兴趣，做你热爱的事。这几乎就像我在 Surge 做的很多决策都源于此。有一件我几年前没想过、后来别人跟我说的事：公司在某种意义上是其 CEO 的化身。这有点意思，我以前没这么想过，因为我一直不太确定 CEO 到底做什么。我总觉得 CEO 的工作很泛泛——你就是听各位 VP、董事会什么的安排，对各种决策点头说好。但实际上不是这样。当我面对那些艰难的大决策时，我不会想”公司会怎么做”，也不会想我们在优化什么指标，我就是想：“我个人在乎什么？我的价值观是什么？我希望在这个世界上看到什么？”

所以我觉得，遵循这个思路——问问自己，你真正在乎的价值观是什么？你想要塑造什么？而不是看仪表盘上什么数字好看——我觉得这会产生非常重要的结果。

Lenny Rachitsky： 我太喜欢你这些源源不断的、优美而深刻的回答了。

Soda 还是 Pop

最后一个问题。你在创办 Surge 之前就相当有名的一件事是，你在 Twitter 的时候做了一个地图，展示世界上不同地区的人把那种饮料叫”soda”还是”pop”。我不确定这个项目叫什么名字，那个地图叫什么？

Edwin Chen： 嗯，好像叫 Soda Versus Pop 数据集。

Lenny Rachitsky： Soda Versus Pop。就是一张美国地图，告诉你哪些地方的人说 pop，哪些说 soda。那你自己说 soda 还是 pop？

Edwin Chen： 我说 soda，我是 soda 派。

Lenny Rachitsky： 好。这是正确答案，还是说不管你说哪个都无所谓？

Edwin Chen： 我觉得如果你说 pop，我会用一种奇怪的眼神看你，心想你从哪儿来的，但也不会太苛责你。

Lenny Rachitsky： 我也是这种感觉。

结语

Edwin，今天太棒了。这场对话真的太精彩了，我学到了很多东西。我想我们会帮助很多人开启自己的公司，帮助他们让公司更符合自己的价值观，打造出更好的产品。

最后几个问题：大家如果想在网络上找到你，应该去哪里？你们在招什么岗位？听众可以怎么帮到你？

Edwin Chen： 我以前很喜欢写博客，但过去几年没时间。不过我现在又开始写了，所以一定要去看看 Surge 的博客，surgehq.ai/blog，希望我会在上面发更多内容。我们当然一直在招人，如果你热爱数据，热爱数学、语言和计算机科学的交叉领域，随时欢迎联系。

Lenny Rachitsky： 太好了。那听众可以怎么帮到你？就是……你有什么需要吗？有什么请求？

Edwin Chen： 我想说的是，请告诉我你们希望我写什么博客话题……

Lenny Rachitsky： 好。

Edwin Chen： ……然后我一直对现实世界中发生的各种 AI 失败案例很感兴趣。每当你遇到一个真正有趣的失败案例，它揭示了我们希望模型如何行为的某个深层问题时——模型可以有很多种回应方式，我常常觉得并没有唯一正确的答案——每当出现这样的例子，我都很想看到。

Lenny Rachitsky： 你应该把这些分享到你的博客上。我也很想看。

Edwin，非常感谢你来。

Edwin Chen： 谢谢你。

Lenny Rachitsky： 大家再见。

感谢大家的收听。如果你觉得这期节目有价值，可以在 Apple Podcasts、Spotify 或你喜欢的播客应用上订阅。也请考虑给我们评分或留下评论，这真的能帮助其他听众发现这个播客。你可以在 lennyspodcast.com 找到所有往期节目或了解更多关于这个节目的信息。下期见。

术语表

原文	中文
A/B test	A/B 测试
agent	智能体
AGI	通用人工智能（AGI）
AI slop	AI 垃圾（指低质量、无实质价值的 AI 生成内容）
Artifacts	Artifacts（Claude 聊天界面中内置的交互式内容生成功能）
benchmark	基准测试
bitter lesson	苦涩的教训（bitter lesson，Richard Sutton 提出的关于 AI 研究的重要观点）
blitz scale	闪电扩张
bootstrapped	自筹资金（不依赖外部风险投资自主运营）
Camus	加缪
CEO	CEO（首席执行官）
checkpoint	检查点
dark patterns	暗黑模式
dashboard	仪表盘
data labeling	数据标注
end-to-end	端到端
evals	评估（evals）
forward-deployed researcher	一线部署研究员
frontier AI lab	前沿 AI 实验室
generational companies	世代级公司
grift	投机把戏
hallucinate	幻觉（指 AI 捏造虚假信息）
hill-climb	爬山式优化
IMO	国际数学奥林匹克（IMO）
L6	L6（科技公司的工程师职级）
leaderboard	排行榜
LLM Arena	LLM Arena（大模型竞技场排行榜）
LLM wrapper	LLM 套壳
Noam Chomsky	乔姆斯基（MIT 著名语言学家）
north star	北极星（指引方向的核心目标）
objective function	目标函数
one-shot	一步到位
pivot	pivote（保持原文动词形式）
post training	后训练
product market fit	产品市场匹配
retro	复盘报告（retro，指回顾性总结文档）
reward-hack	钻奖励机制的空子（指模型通过非预期方式获得奖励）
RL environments	强化学习环境
RLHF	基于人类反馈的强化学习（RLHF）
rubrics	评分标准
SAT	SAT（美国大学入学标准化考试）
SFT	监督微调（SFT）
synthetic data	合成数据
tabloid	小报
Ted Chiang	特德·姜（华裔科幻作家，原文记录为 “Ted Chang”）
tool calling	工具调用
trajectory	轨迹
unit test	单元测试
VC	风险投资（VC）
verifiers	验证器
vibe coding	氛围编程（指不经仔细设计，仅凭直觉让 AI 生成代码的编程方式）
Waymo	Waymo（谷歌旗下的自动驾驶出行公司）
word of mouth	口碑

此文档由 AI 分片翻译（translate_long_document）

The $1B Al company training ChatGPT, Claude & Gemini on the path to responsible AGI | Edwin Chen

Episode Introduction

$1B Revenue With Under 100 People

The Evolution Of Company Structures

What Does Surge AI Do?

Defining True AI Quality

Why Claude Leads In Coding And Writing

The Reliability Of AI Benchmarks

AI Benchmarks And Marketing Hype

Measuring Real AI Progress

The AGI Timeline

Flawed AI Optimization Directions

Hidden Dangers Of Optimizing Interactions

Who Is On The Right Path?

Ethical Choices In AI Products

Rethinking The Silicon Valley Startup Path

Can LLMs Lead To AGI?

Reinforcement Learning Environments

From Scoring Criteria To RL Environments

The Evolution Of Post-Training Methods

Diversifying How AI Learns

The AI Market Outlook

Overrated And Underrated AI Aspects

Product Teams And The AI Future

The Founding Story Of Surge

Core Motivations And Mission

Influencing The Direction Of AI

Significance For Humanity

Deep Thoughts On Objective Functions

Lessons From Founding Surge

Rapid Fire Q&A

Favorite Movies And TV Shows

Recently Discovered Favorite Products

Personal Life Motto

Soda Or Pop?

Final Closing Remarks

Glossary

估值10亿美元的AI公司训练ChatGPT、Claude与Gemini，走向负责任的AGI之路 | Edwin Chen

访谈记录

开场介绍

以不到百人实现十亿美元营收

公司形态的变革

Surge 是做什么的

什么是真正的质量

Claude 为何在编程和写作上长期领先

基准测试的可信度

基准测试与营销

如何衡量真正的进步

AGI 时间线

优化方向的错误

优化互动的隐忧

谁在走更正的路

AI 产品的伦理选择

硅谷创业路线的反思

LLM 能否通向通用人工智能

强化学习环境

从评分标准到强化学习环境

后训练方法的演进

AI 学习方式的多样化

AI 市场展望

AI 的被低估与被高估

产品团队与 AI 的未来

Surge 的创立故事

驱动力与使命

对 AI 方向的影响力

对人类的意义

关于目标函数的深层思考

创办 Surge 的心得

快问快答

最爱的影视作品

最近发现的好产品

人生座右铭

Soda 还是 Pop

结语

术语表