OpenAI 研究员谈为什么软技能是工作的未来 | Karina Nguyen
OpenAI researcher on why soft skills are the future of work | Karina Nguyen
About the Guest
Lenny Rachitsky: Not only are you working at the cutting edge of AI and LLMs, you’re actually building the cutting edge.
Karina Nguyen: When I first came to Anthropic and I was like, “Oh my God, I really love front-end engineering.” And then the reason why I switched to research is because I realized, “Oh my God, Claude is getting better at front-end. Claude is getting better at coding. I think Claude can develop new apps.”
The Main Interview
Lenny Rachitsky: What skills do you think will be most valuable going forward for product teams, in particular?
Model Training as an Art
Karina Nguyen: Creative thinking and you kind of want to generate a bunch of ideas and filter through them and not just build the best product experience. I think it’s actually really, really hard to teach the model how to be aesthetic or really good visual design or how to be extremely creative in the way they write.
The Data Wall and Synthetic Data
Lenny Rachitsky: What do you think people most misunderstand about how models are created?
Karina Nguyen: When you taught the model, some of the self-knowledge of you actually don’t have a physical body to operate in the physical world, the model would get extremely confused.
From Chatbots to Collaborative Agents
Lenny Rachitsky: Today my guest is Karina Nguyen. Karina is an AI researcher at OpenAI where she helped build Canvas, tasks, the o1 chain-of-thought model and more. Prior to OpenAI, she was at Anthropic where she led work on post-training and evaluation for the Claude 3 models, built a document upload feature with 100K context windows and so much more. She was also an engineer at New York Times, was a designer at Dropbox and at Square. It’s very rare to get a glimpse into how someone working on the bleeding edge of AI and LLMs operates and how they think about where things are heading. Canvas
In our conversation, we talk about how teams that OpenAI operate and build products, what skills she thinks you should be building as AI gets smarter, how models are created, why synthetic data will allow models to keep getting smarter and why she moved from engineering to research after realizing how good LLMs are going to be at coding. If you enjoy this podcast, don’t forget to subscribe and follow it in your favorite podcasting app or YouTube. It’s the best way to avoid missing feature episodes and it helps the podcast tremendously. With that, I bring you Karina Nguyen.
Christina Cacioppo: Great to be here. Big fan of the podcast and the newsletter.
Lenny Rachitsky: Vanta is a longtime sponsor of the show, but for some of our newer listeners, what does Vanta do and who is it for?
Christina Cacioppo: Sure. So we started Vanta in 2018. Focused on founders, helping them start to build out their security programs and get credit for all of that hard security work with compliance certifications like SOC 2 or ISO 27001 today, we currently help over 9,000 companies, including some startup household names like Atlassian, Ramp and LangChain start and scale their security programs and ultimately build trust by automating compliance, centralizing GRC, and accelerating security reviews.
Lenny Rachitsky: That is awesome. I know from experience that these things take a lot of time and a lot of resources and nobody wants to spend time doing this.
Christina Cacioppo: That is very much our experience, but before the company and to some extent during it. But the idea is with automation, with AI, with software, we are helping customers build trust with prospects and customers in an efficient way. And our joke, we started this compliance company so you don’t have to.
Lenny Rachitsky: We appreciate you for doing that. And you have a special discount for listeners, they can get 1,000 off Vanta. Thanks for that, Christina.
Christina Cacioppo: Thank you.
Lenny Rachitsky: Karina, thank you so much for being here. Welcome to the podcast.
Core Behavior 2: Updating Documents
Karina Nguyen: Thank you so much, Lenny, for inviting me.
Lenny Rachitsky: I’m very excited to have you here because not only are you working at the cutting edge of AI and LLMs, you’re actually building the cutting edge of AI and LLMs. You recently launched this feature, which basically… the first agent feature of OpenAI. I also just did this survey, I don’t know if you know about this. I did a survey of my readers and asked them what tools do you use every day in your work and most use? And ChatGPT was number one, above Gmail, above Slack, above anything else. 90% of people said they use ChatGPT regularly.
Decision Boundaries in Code Editing
Karina Nguyen: That’s quite good.
Core Behavior 3: Making Annotations
Lenny Rachitsky: It’s absurd. It wasn’t around two years ago.
Karina Nguyen: Yeah.
A Day in the Life
Lenny Rachitsky: Also, we’re recording this the week that OpenAI announced Stargate, which is this half trillion dollar investment in AI infrastructure. So there’s just a lot happening constantly in AI and you have a really unique glimpse into how things are working, where things are going, how work gets done. So I have a lot of questions for you. I want to talk about how you operate and how you work at OpenAI, where you think things are going, what skills are going to matter more and less in the future, and also just where things are going broadly. So how does that sound?
Running Evals in Practice
Karina Nguyen: Sounds great. Thank you so much. Yeah, I was extremely lucky to join early days Anthropic and learned a lot of things there. And I joined OpenAI around eight months ago. So, yeah, I’m excited to dive more in into-
Prototyping Products with Prompts
Lenny Rachitsky: Okay, I’m going to definitely ask you about the differences between those, but I want to start more technical and just dive right in. I want to talk about model training. People always hear about models being trained, these big models, how much data takes, how long it takes, how much money toss it takes, how we’re running out of data, which I want to talk about. Let me just ask you this question. What do you think people most misunderstand about how models are created?
Karina Nguyen: Model training is more an art than a science. And in a lot of ways we, as model trainers, think a lot about data quality. It’s one of the most important things in model training is like how do you ensure the highest quality data for certain interaction model behavior that you want to create? But the way you debug models is actually very similar the way you debug software. So one of the things that I’ve learned early days at Anthropic was we’ve discovered especially this Claude 3 training, when you taught the model some of the soft-knowledge of, “Hey, you actually don’t have a physical body to operate in the physical world.” But then at the same time you had data that taught the model some of the function calls, which is like, “This is how you set the alarm.”
And so the model would get extremely confused about whether it can set an alarm, but it doesn’t have a body in the physical world. So it’s like the model gets confused and sometimes it’ll over accuse. So sometimes it says, “Look, I don’t know. Sorry, I cannot help you.” And so there is always a balance trade off between how do you make the model to be more helpful for users, but also not being harmful in other scenarios. And so it’s always about how do you make the model more robust and operate across a variety of diverse scenarios.
The Birth of Tasks Feature
Lenny Rachitsky: That is so funny. I never thought about that. Most of the data that it’s trained on is kind of assuming it’s like a human describing the world and how they operate. It assumes there’s a body and you could do things, and the model is told you don’t have a body.
Data Strategy: Synthetic Data First
Karina Nguyen: Yeah.
Researcher vs. Model Designer
Lenny Rachitsky: Okay. I want to talk a little bit about data while we’re on this topic. I know you have strong opinions here. There’s this meme that models are going to stop getting smarter because they’re running out of data. They’re trained in a large part on the internet and there’s only one internet and they’ve already been trained on it, what more can you show them about the world? And there’s this trend of synthetic data, this term synthetic data. What is synthetic data? Why do you think it’s important? Do you think it’s going to work?
Karina Nguyen: I think there are two questions here. We can unpack one at a time. But people say we are hitting the data wall. I think people think more in the terms of pre-trained large models that are trained on the entire internet to predict the next token. But what actually the model is learning during that process is actually how do you compress the compression algorithm here? The model learns to compress a lot of knowledge and it learns how to model the world. So the next prediction of the word, like, “Teach me how to drive,” basically. And you only have a few words that will match that, a car. So the model actually learns about the world in itself. So it’s like it’s modeling human behavior, sometimes it’s modeling… And when you talk to pre-trained models which are very, very large, they’re actually extremely diverse and extremely creative because you can talk to almost any Reddit user through a pre-trained model.
But I think what’s happening right now with new paradigm of o1 series is that the scaling in post-training itself is not hitting the wall. And that’s because basically we went from raw data sets from pre-trained models to infinite amount of tasks that you can teach the model in the post-training world via reinforcement learning. So any task, for example, how to search the web, how to use the computer, how to write, wow, all sorts of tasks that you trying to teach the model all the different skills. And that’s why we’re saying there’s no data wall or whatever, because there will be infinite amount of tasks and that’s how the model becomes extremely super intelligent. And we are actually getting saturated in all benchmarks.
So I think the bottleneck is actually in evaluations that we don’t have all the frontier, like evals like, I don’t know, GPQA, which is a Google-proof question answering, PhD level intelligence. The benchmark is getting to, I don’t know, more than 60, 70%, which is what PhD gets. So it’s literally hitting the wall in like evals.
Future Outlook in the AI Era
Lenny Rachitsky: I want to follow both those threads. So the first is on this idea of synthetic data. Is a simple way to understand it, that the models are generating the data that future models are trained on and you ask it to generate all these ways of doing stuff, all these tasks as you described, and then the newer models trained on this data that the previous model generated?
The Future of Education and Research
Karina Nguyen: Some tasks are synthetically curated. So this is an active research area is how can you synthetically construct new tasks with models to learn. Sometimes when you develop products, you get a lot of data from the product and user feedback and you can use that data too in this cross-training world. Sometimes you still want to use human data because actually some of the tasks can be really, really hard to teach. Experts only know certain knowledge about some chemicals or biological knowledge, so you actually need to tap into the experts’ knowledge a lot. So yeah, I think to me synthetic data training is more for product… It’s a rapid model iteration for similar product outcomes. And we can dive more into it, but the way we made Canvas and tasks and new product features for ChatGPT was mostly done by synthetic training.
The Most Valuable Future Skills
Lenny Rachitsky: Let’s actually get into that. That’s really interesting. I want to talk about evals, but let’s follow that thread. So talk about how this helped you create Canvas.
Why Models Struggle with Creativity
Karina Nguyen: So when I first came to OpenAI, I really had this idea of, “Okay, it would be really cool for ChatGPT to actually change the visual interface but also change the way it is with people.” So going from being a chatbot to more of a collaborative agent, and the collaborator is a step towards more genetic systems that become innovators ultimately. And so the entire team of applied engineers, designers, products, research got formed in the air almost out of nothing. It’s just like a collection of people who just got together and we rapidly started iterating with each other.
Actually Canvas is one of the… I would say the first project at OpenAI, where researchers and applying engineers started working together from the very beginning of the product development cycle. And I think there’s a lot of things that we have learned on the way, but I definitely came with the mindset of, “We need to do a really rapid model situation such that it would be much easier for engineers to work with the latest model possible, but also learn from user feedback or early internal dog food. How do we improve the model very rapidly?”
And it’s really hard to kind of like figure out how people… when you deploy a product, how people would be able to use it. And so the way you synthetically train the model is physically figuring out what are the most core behaviors that you wanted the product feature to do. And for Canvas, for example, it came down to three main behaviors. It was how do you trigger Canvas for prompts like, “Write me a long essay,” when the user intention is mostly iterating over long documents? Or, “Write me a piece of code,” or when to not trigger Canvas for prompts like, “Can you tell me more about President…” I don’t know, some of the general questions. So you don’t want to trigger Canvas because the user intention is mostly getting answer, not necessarily iterate over the long document.
The second behavior is how do we teach the model to update the document when the user asks? So one of the behaviors that we taught the model is actually have some agency and autonomy to literally go to the document and select specific sections and either delete it or edit, so highlight it and rewrite certain sections. Sometimes the user would just say, “Change the second paragraph to be something friendlier,” and we would have to teach the model to literally find the second paragraph in the document and change it to a friendly tone. So basically you teach both how to trigger edit itself, but also how do you teach the model to get higher quality edit for the document?
In case of coding, for example, there’s also the question of how good the model is of completely rewriting the document, versus having a very specific target edits. So that’s another layer of decision boundary within edit itself is, “Let’s select the entire document and rewrite completely, or do you want to have a very targeted custom behavior.” And when we first launched the model, we would bias the model towards more rewrites because we saw the quality of the rewrites were much higher. But over time you are shifting based on user feedback and what you’re learning from iterative deployment.
Lastly, the third behavior that we taught synthetically the model is how to make comments on any document. So the way we used that is we would use o1 model to seem a way of user conversation, let’s say like, “Write me a document about XYZ.” But then we used o1 to produce the document and then we injected user prompt to be like, “Oh, make some comments, critique my piece of writing or critique this piece of writing that you just made.” And then we taught the model to make comments on the document on very specific [inaudible 00:17:45] So it’s also what kind of comments you want the model to make. Do they make sense or not? How do you teach the quality of that? And it all came down to measuring progress via very robust evals. But, yeah, this is how you used o1 and a synthetic data generation for the training.
AI and Strategic Thinking
Lenny Rachitsky: Okay, that’s so interesting. So you talk about this idea of teaching the model and you mentioned how it’s using synthetic data to teach the model different behaviors is a simple way to think about it. Basically that’s where you do that by showing it what success looks like using basically evals. Is that the simple way to think about it? Like, “Here’s what you doing this successfully would look like,” and that teaches it, “Okay, I see this is what I should be doing [inaudible 00:18:31]”
Karina Nguyen: Yeah, great. Yeah, amazing. Yeah, you got it.
Skills Worth Developing
Lenny Rachitsky: Okay, got it. I want to start unpacking what your day-to-day looks like as you’re building these sort of things. Is it like you sitting there talking to some version of ChatGPT, crafting these evals?
Karina Nguyen: Sometimes I do that. Sometimes I do sit with ChatGPT. Actually, I think I learned this so much from Anthropic, is people spend so much time prompting models and where quality’s a really bad batch all the time, and you actually get a lot of new ideas of how do you make the model better? It’s like, “This response is kind of weird. Why’s it doing this?” And you start debugging or something, or you start figuring out new methods of how do you teach the model to respond in the different way, have better personality, let’s say.
So it’s the same thing of how personality is made in the models with those. It’s very similar methods. But, yes, I think my time at OpenAI have changed. I think when I first came, I was mostly research IC work so I was like building a lot of… I was running code, training models, write evals, working with PMs and designers to learn, teach them how to even think about evaluation. I think that was really cool experience and I think it was just like an adoption of, “How do we do this product management of AI feature for our AI models?” Yeah, but now it’s mostly management and mentorship. I’m still doing IC research code up to 4:00 PM, although. But I just kind of changed.
Comparing Anthropic and OpenAI
Lenny Rachitsky: All right, don’t talk too much about being a manager.
Early Days at Anthropic
Karina Nguyen: Okay.
Product Forms and Async Paradigms
Lenny Rachitsky: Because everyone’s in firing their managers. “Who needs managers anymore?” That’s what I hear now. Just kidding. It’s interesting that so much of your time was spent on teaching product teams how evals integrate and how important it is. And I’ve heard this a few times and I haven’t personally experienced it yet, so I think it’s an important thread to follow is just how writing these evaluations is going to become increasingly an important part of the job of product teams, especially when they’re building AI features and working with LLMs. So can you just talk a bit more about what that looks like? Is it sitting there with an Excel spreadsheet basically showing, “Here’s the input, here’s the output, here’s how good the result was”? Talk about what that actually looks like very practically.
Launching the 100K Context Window
Karina Nguyen: It certainly depends on what you’re developing, but there are various types of evaluations. Sometimes I do ask product managers, or there’s also new roles that we have, model designers, to go through some of the user feedback maybe or think of various user conversations that should have triggered… Under these circumstances, it should trigger Canvas. And then you have this ground truth label of, “Okay with this conversation it should look trigger Canvas, under this conversation it should not trigger Canvas.” And you have this very deterministic kind of eval that for decision-making behaviors is like this.
When we were launching tasks, for example, how do you make correct schedules is actually really hard for the model. But we built out some of the deterministic evaluations that is like, “Okay, if the user says 7:00 PM, the model should say 7:00 PM.” So if you can have deterministic evals whether it’s pass or fail. And the way it works is all the… Sometimes I ask product managers to just go create a double sheet, have different tabs and what’s the current behavior, what’s the ideal behavior and why, and some notes.
And sometimes they usually use it with evals, sometimes we use it for training. Because if you give the spreadsheet to o1 model, it can probably figure out how to teach itself a good behavior. And I think there are second type of evals that is more prevalent is human evaluations. And you can have specific trainers or you can have internal people to when you have a conversation of the prompt and then you have various completion of models, you choose the win rate. Which model is the best? Which model produce the highest quality comment or edit? And then you can have continuous win rates. And as you develop new models it should always win over the previous models. So it depends on what you want to measure.
Key Milestone Moments
Lenny Rachitsky: So interesting. Basically what I’m hearing, and there’s something I’m learning about as I talk to people, is product development might move from this, “Here’s a spec PRD, let’s build it together and then cool, let’s review it. Are we happy with this?” From that to, “Hey, AI, build this thing for me and here’s what correct looks like,” and I’m spending all my time on what does correct look like on evals essentially.
Transforming Content Formats
Karina Nguyen: You definitely want to measure progress of your model and this is where evals is, is because you can have prompted model as a baseline already. And the most robust evals is the one where prompted baselines get the lowest score or something. And then because then you know if you’re trained a good model, then it should just hill climb on that eval all the time, while not also regressing on other intelligence evals. That’s what I’m saying, it’s more of an art than science. It’s like, “Okay, if you optimize the model for this behavior, you don’t want to brain damage in other areas of intelligence or…” This is happening all the time in every lab, in every research team.
I would say prompting is also a way to prototype new product ideas. Early days at Anthropic when I was working file uploads feature, I remember I was just prompting the model to just… I remember we were launching a hundred key contexts. I was just prototyping this in their local browser. I did the demo. People really, really loved it. And they just wanted API for file uploads or something. And then that’s when it clicked to me, and also one of the blog posts a long time ago, it clicked on me prompting is a new way of product development or prototyping for designers and for product managers.
For example, one of the features that I want to do is have a personalized starter prompts. So whenever you come to Claude, it should recommend you starter prompts based on what your interests are. And so you can literally do it prompting for that.
Conversation as an Enduring Paradigm
Lenny Rachitsky: Mm-hmm. To experiment with that.
Karina Nguyen: Another feature was generating titles for the conversations. It’s a very small micro experience but I’m really proud of. The way we did that was we took five latest conversation from the model, asked the model, “What’s the style of the user?” And then for the next new conversation, the generated title will be of the same style. It’s just like really little micro experiences like this.
Operator: Computer-Using Agent
Lenny Rachitsky: That’s so cool. Did you do that at Anthropic or at OpenAI?
Technical Challenges of Computer-Using Agents
Karina Nguyen: At Anthropic.
Lightning Round Q&A
Lenny Rachitsky: Okay, cool. I love the file upload feature that Claude has by the way. ChatGPT doesn’t have that yet, is that right?
Karina Nguyen: I think has the way.
Lenny Rachitsky: [inaudible 00:26:23]
Karina Nguyen: I think the way it’s implement is very different though.
Lenny Rachitsky: Okay. Maybe it’s the PDF feature, because I use it all the time with Claude.
Karina Nguyen: Yeah.
Lenny Rachitsky: Okay.
Karina Nguyen: That’s cool.
Lenny Rachitsky: Somebody needs to get on that. Main, it’s wild how many features you built that I use every day and that many people use every day. This prototyping point you made is really important. It’s something that comes up a ton on this podcast also of how that… is maybe the way that AI has most impacted the job of product builders recently is just prototyping instead of going from showing just like, “Here’s a PRD, here’s a design.” PMs are more and more just, “Here’s the prototype with the idea that I have,” and it’s working. You can play with it.
Karina Nguyen: Yeah.
Lenny Rachitsky: Yeah. Okay, I want to spend a little more time on how you operate. So you talked about you built this in launch of this tasks feature, is that the way to describe your tasks?
Karina Nguyen: Yeah.
Lenny Rachitsky: So talk about how that emerged and let’s better understand just how you collaborate with product teams and how OpenAI works in that way, whatever you can share there.
Karina Nguyen: I think Canvas and tasks are going into the bucket of projects where it’s more short or medium terms. And actually the way Canvas and tasks came about to be was it started with one person prototyping and creating a spec. It’s kind of like PRD. It’s like creating a spec of the behavior of the model. I don’t think tasks is extremely groundbreaking feature necessarily. What makes it really cool is because the models are so general… Model can now search, they can write sci-fi stories, they can search for stocks, they can summarize the news every day. Because the models are so general giving something familiar to people that notifications is very familiar, having reminders is very familiar. So feeling like a form factor for the people who are very familiar, same as Canvas, Google Docs is very familiar, but then you add magical AI moment and it becomes very powerful.
But the way it comes usually operationally… Yeah, size is like a prototype, literally prompted prototype of how you would want the model to behave. For tasks, for example, you need to design… Literally design thinking is like okay, well, if the user says, “Remind me to go to lunch at 8:00 AM tomorrow,” what information does the model need to extract from that prompt in order to create a reminder? And so this is how you design a spec for a new feature, like a tool. Canvas and tasks are all tools. So it’s like how do you create the tool stack?
And then it’s mostly like developing JSON schema. It was like, “Okay, from this problem maybe the model should extract the time that the user requested.” And then you think about which format do you want the time to be? And then how do you want the model to notify you is basically the user should give instruction to the model. And then this instruction would fire off every day or something at that particular time. So, for example, if you say, “Every day I want to learn know about the latest AI news,” the model should rewrite into, “Okay search for the latest AI news and this task will get fired at that particular type that the user requested.”
And then your design is like tool spec. Actually, I don’t know. I feel like sometimes it’s through conversations I… Either people ask me to join the [inaudible 00:30:15] team and they’re like, “Oh my god, we need researchers.” Or like, “We need some support. We need to train the models,” or sometimes. Canvas was mostly like I just pitched the idea of… It got staffed quite immediately during the break, so it’s dependent on the project. And then usually with staffing is mostly a product manager, model designer, actual product designer, a couple of researchers and a bunch of applied engineers. Depends on the complexity of a project. And then for tasks it took, I don’t know, like two months or so to go from zero to one basically.
Lenny Rachitsky: Oh wow.
Karina Nguyen: For Canvas this was like four, five months, I guess, to go from zero to one. And then you teach product managers how to build evals and maybe how do we not only ship the better feature, but how do we think longer term? What kind of cool features did you want tasks to have? I think it would be nice for tasks to be a little bit more personalized. It’d be nice to have to create tasks via voice on a mobile, right? This is how you get research roadmap right here is thinking how the feature will be developed in the future.
And then from there it’s like you start getting data sets. With evals, you want to make sure that goes well. And then you need to have a trade-off between what methods you want to use. And the reason why I really love relying purely on synthetic data instead of collecting data from humans is because it’s much more scalable, it’s cheap, less than half. You literally sample from the model and you teach the core behaviors of the models and that will generalize to all sorts of diverse coverage.
And when you launch the beta feature, you learn so much from the users that you can… All your synthetic sets can be shifted in the distribution and how the users behave on the product behavior. And this is how we improve. And this is what happened with Canvass too when we launched from beta to GA.
Lenny Rachitsky: Okay.
Meetings start with everyone on the same page and end early. Problem solved, time saved. We know that everyone isn’t a one-take wonder when it comes to recording videos. So Loom comes with easy editing and AI features to help you record once and get back to the work that counts. Save time, align your team, stay connected and get more done with Loom. Now part of Atlassian, the makers of Jira. Try Loom for free today at Loom.com/Lenny. That’s L-O-O-M.com/Lenny.
Something that I want to help people understand, and I don’t even 100% understand this, is what’s the simplest way to understand the job of a researcher versus say a model designer and other folks involved? What’s the simplest way to understand what researchers do at OpenAI?
Karina Nguyen: So the project that I described are mostly product-oriented. Research is mostly product research. Another component of my team is actually more longer term exploratory projects. And it’s more about developing new methods, understanding those methods under a variety of circumstances. So basically developing methods, you need to follow very similar recipe of building evals but it’s much more sophisticated evals. You want to have outer distribution or if you want to measure generalization, you need to capture that.
But it is basically more sciencey in a way where… If we talk about synthetic data, one of the hardest things about synthetic data is how do you make it more diverse? Diversity in synthetic data is one of the most important questions right now. And so it’s like exploring ways to inject diversity as a general method that will work for all is one of the research explorations. Other ones is more developing new capabilities. I feel like it’s always about you work on this new method and you have signs of life that it’s working, either you think of how do you make it more general or you think of how do you make it very useful? And this is how the longer-term projects become more medium, short-term project.
Lenny Rachitsky: That makes sense. Essentially working on developing ways to make the model smarter, o4, o5, o6. New ways to… o1 was a big breakthrough, right?
Karina Nguyen: Yeah.
Lenny Rachitsky: The way it operates where it’s not just, “Here’s your answer,” it actually thinks and takes time to think through the process of coming up with an answer. Okay.
Karina Nguyen: Yeah.
Lenny Rachitsky: Very helpful. Speaking of that, of thinking about the future, where things are going, I want to spend some time on just this insight that basically you are building the cutting edge of AI, at the very bleeding edge of where AI is going and where it is. And so I’m very curious to hear just your take on how you think things are going to change in the world and how people work based on where you see things are going. And I know it’s a broad question, but let’s say in the next three years, how do you see the world changing? How do you see people’s way of working changing?
Karina Nguyen: It’s a very humbling experience to be in both labs, I guess. To me when I first came to Anthropic and I was like, “Oh no, I really love front-end engineering.” And then the reason why I switched to research is because I realized at that time it’s like, “Oh my god, Claude is getting better at front-end. Claude is getting better at coding. I think Claude can develop new apps or something and so it can develop new features for the thing that I’m working.” So it was kind of like this meta realization where it’s like, “Oh my god, the world is actually changing.” And when we first launched 100K context at that time, obviously I’m thinking about form factors that’s like file uploads were very natural, very familiar to people. But you can imagine we could just make infinite chats in the Claude.ai app, as if it’s 100K context.
But because file uploads… It’s like form follows function. It’s like the form factor, the file uploads can enable people to just literally upload anything, the books, any reports, financial and ask any task to the model. And then I remember it was either enterprise customers, financial customers were really interested in that. It’s like, “Oh wow.” It’s actually one of the very common tasks that people do in that setting. It’s kind of crazy to see how some of the redundant tasks are getting automated basically by these smart models.
And they’re entering the era where, I actually don’t know for example sometimes if o1 gives me the correct answer or not because I’m not an expert in that field. And it’s like, “I don’t even know how to verify the outputs of the models.” It’s because all my experts know they can verify this. So, yes, so basically there are trends that are going on. The first trend is the cost of reasoning and intelligence is drastically going down.
I had a blog post about this. Maybe I should update on latest benchmarks, because at that time everybody was doing one benchmark and they’d be… quickly saturated the benchmarks. So I’m like, “Now we need to do the same plot but with another frontier eval.” But the cost of intelligence is going down because it becomes that much cheaper. Small models are becoming even smarter than large models and that’s because of the distillation research.
This happened with Claude 3 Haiku. I was working with the training on the Claude 3 Haiku and I realized it was much smarter than Claude 2, which was way bigger, lots [inaudible 00:39:08]. But the power of small models become very intelligent and fast and cheap. We are moving towards that world. That has multiple implications, but the news is that people will have more access AI and that’s really good. Builders and developers will have much better access to AI, but also it means all the work that has been bottlenecked by intelligence will be unblocked.
I’m thinking about healthcare, right? Instead of going to a doctor, I can ask ChatGPT or give ChatGPT a list of symptoms and ask me, “Would I have a cold, flu, something else?” I can literally get the access to doctor almost. And there’s been some research studies around that.
Lenny Rachitsky: There was a New York Times story about that where they compared doctors to doctors using ChatGPT to just ChatGPT and just ChatGPT was the best of them. All doctors made it worse.
Karina Nguyen: Yeah, that’s crazy. Yeah. Yeah, that’s crazy, right? Education I think I would have dreamt if I had the tool like ChatGPT when I was young and would learn so much. But it’s like people can now learn almost anything from these models. So they can learn new language, they can learn how to build new look apps and write anything they do want. It’s humbling to have… launch Canvas and bring that thing to the people, enable them to do something else that they couldn’t have ever before. There’s something magical around this experience.
Education will have massive implications. I guess like scientific research, I think it’s the dream of any AI research is to automate AI research. It’s kind of scary, I’d say, which makes me think that people management will stay. It’s one of the hardest thing to… Emotional intelligence with the models, creativity in itself is one of the hardest things. So writers, I don’t think people should be worried as much. I think will alleviate a lot of redundant tasks for people.
Lenny Rachitsky: This is awesome. Okay, I want to follow this thread for sure. And it’s funny that what you described as you were an engineer at Anthropic and you’re like, “Okay, Claude is going to be very good at engineering. This isn’t going to be a potentially career long term, so I’m going to move into research and AI is going to need me for a long time to build it, to make it smarter.”
Karina Nguyen: I would say we still have… I think Canvas team has still have really cool front engineers that are really people who really care about interaction, design, interacting experience. I don’t think models are there yet I think if… But we can get the models to this top 1% of front-ends and things for sure.
Lenny Rachitsky: So what I want to move on to next along these lines is just, and this is just speculation, but what skills do you think will be most valuable going forward for product teams in particular? So folks are listening and they’re like, “Okay, this is scary. What should I be building now to help me stay ahead and not be in trouble down the road?” What skills do you think are going to be more and more important to build?
Karina Nguyen: Yeah, I think creative thinking. You want to generate a bunch of ideas and filter through them and not just build the best product experience. Listening. You want to build something that the most general model will not replace you. And oftentimes you build something and you make it really, really good for specific set of users and actually the mode is now in your user feedback. The mode is more in whether you listen to them, whether you can rapidly iterate. The mode is in here. I don’t think we are yet to… There are so many ideas, I think there’s an abundance of ideas that you can work on. I wouldn’t be worried. I feel like in fact I just think people in AI field are like… I wish they were a little bit more creative and connecting the dots across the print fields or something like that to develop really cool new generation and new paradigms of interactions with this AI.
I don’t think we’ve cracked this problem at all. A couple of years ago I was telling some people, I was like, “You want to build for the future.” So it’s like it doesn’t necessarily matter whether the model is good or not, good right now, but you can build product ideas such that by the time the models will be really good, it’ll work really well. I think it just happened naturally. For example, at Anthropic the Claude artifacts… And I feel early days of Canvas was, back in 2022 before ChatGPT, writing ideas was our knowledge [inaudible 00:44:36]. But I feel like Claude 1.3 model itself was not there to have made really extreme good high quality edits. For example, like coding.
And I feel like I see startups like Kaeser was doing super well. And that’s because they iterate so fast. They invent new ways of training models. They move really fast. They listen to what users like, massive distributions. Yeah, it’s kind of cool.
Lenny Rachitsky: That’s really helpful actually. So what I’m hearing is that soft skills essentially are going to be more and more important, powerful. You just talked about management, leading people, being creative and coming up with innovative insights, listening. There’s a post I wrote that I’ll link to where I try to analyze how AI will impact product management. And we’re actually very aligned, and my sense was the same thing, that soft skills are going to become more and more important. And the things that are going to be replaced is the hard skills, which is interesting because usually people value the hard skills like coding, design, writing really well. And it’s interesting that AI is actually really good at that because it’s taking a bunch of data, synthesizing it and writing, creating a thing, versus all these fuzzy things around of what influences, convinces people to do things and aligning and listening, like you said, creativity, anything along those lines come up as I say that.
Karina Nguyen: I think it’s actually a really, really hard to teach the model how to be aesthetic or do really good visual design or how to be extremely creative in the way they write. I still think ChatGPT kind of sucks at writing and that’s because it’s bottlenecked by this creative reasoning. I think characterization is one of the most important… I think for a manager, I feel like…
Actually, AI research progress is bottlenecked by management, research management. It’s because you have constrained set of compute and you need to allocate the compute to the research paths that you feel the most convinced about. It was like you need to have a really high conviction in the research paths to put the compute, and it’s more return on investment kind of situation. It’s like, “Okay, I’m thinking a lot about across all my projects, which projects are higher priority?” Prioritization and also on the lower level, “Which experiments are really important to run right now and which are not?” and cut through the line. So I was thinking prioritization, communication, management. People skills like empathy, understanding people, collaboration.
I think Canvas wouldn’t be an amazing launch if it wasn’t about people and I think it’s a wonderful group of people. And I get a chance to work with people like Lee Byron who’s a co-creator at GraphQL and some of the best Apple designers. It’s so cool to see… and how do you create this collaboration between people. It’s just something that’s still humane, I think.
Lenny Rachitsky: Let me just follow through a little bit. I imagine people listening are like, “Okay, but once we have AGI or SGI it’s like it’ll do all this.” There’s a world where like, “Why isn’t all this done?” I think it’s easy to just assume all that. I’m curious this idea of creativity and listening, why you think AI isn’t good at it, other than it’s just very hard to train it to do this well. Is there anything there of just why this is especially difficult for AI and LLMs to get good at?
Karina Nguyen: I think currently it’s difficult for many reasons. I think it’s still an active research area and it’s something that I think my team is working on. It’s like, “Okay, how do we teach the models to be more creative in the writing?” And so I’m thinking this new paradigm of wise that the models think more should actually lead to better writing in itself. But when it comes down to idea generation or discriminating of what is a good visual design or not, I feel like it hasn’t had learned examples from people to discriminate it very well. I do think it’s because there are not that many people who are actually really… It’s not accessible to models to learn from these people I guess. So I definitely think that’s why it sucks.
Lenny Rachitsky: Yeah, that makes sense. Basically there’s not enough of you yet, researchers teaching it to do these things, slash people that have incredible taste and creativity that can teach these things. You could argue this will come.
Karina Nguyen: Right.
Lenny Rachitsky: But we don’t need to keep going down that thread. Let me ask you a specific question. In this post I wrote, I made this argument that a lot of people disagreed with that strategy is something that AI tooling will become increasingly great at and take over. There’s the sense that that’s the thing that people will continue to be much better at and you can’t offload to AI basically developing your strategy, telling you what to do to win. My case is, “Isn’t strategy, just take all the inputs, all the data you have available, understand the world around you and come up with a plan to win?” It feels like AI and LLM would be incredibly smart at this. What’s your take?
Karina Nguyen: I think so too. I think again, you teach the model all sorts of tools and capabilities and reasoning and it’s like when it comes down to… For Canvas right now, it would be very cool for the model just aggregate all the feedback from users, summarize me the top five most painful flows on user experiences. And then the model itself is very capable of thinking of knowing how it’s been made, figure out how to create a dataset for itself to train on it. And I don’t think that we are far away from that self-improvement, models becoming self-improved by…
That, and the part of development, is basically self-improving. It’s kind of like its own organism or something. Again, like strategies, it’s more like data analysis and coming up with… I think what models are really good at is connecting the dots, I think. It’s like if you have user feedback from this source, but you also have an internal dashboard with metrics and then you have other feedback or input and then it can create a plan for you, recommendations even. And I think this is one of the most common use cases for ChatGPT too, is coming up with these sort of things.
Lenny Rachitsky: That makes sense essentially a human can only comprehend so much information at once and look at so much data at once to synthesize takeaways. And as you said, these context windows are huge now. Here’s all the information, what’s the most important thing I should do?
Karina Nguyen: Yeah, same as scientific research. Ideally the model would be able to suggest ideas, new ideas, or iterate on the experimental given the empirical results of the previous experiments like how do you come up with new ideas or the methods?
Lenny Rachitsky: Yeah. Oh, man. Okay, so just to close the loop on this conversation, this part of the thread is the skills you’re suggesting people focus on building and leaning into is soft skills like creativity, managing influence, collaboration, looking for patterns. Is that generally where your mind is at?
Karina Nguyen: Yeah, I’m thinking a lot about how do we make organizations more effectively and I think this is mostly management, I guess. It’s like how do you organize research teams or generally teams combined… Compose teams such that they will be at their maximally succeed or at the maximal performance of what can possibly… We can literally create the next generation of computers. It’s just the matter of conviction and the way you manage through that. It’s scaling organizations or scaling product research, I guess.
Lenny Rachitsky: Yeah, I think you’re basically building this thing and not efficiently doing it is limiting the potential of the human species right now.
Karina Nguyen: Right.
Lenny Rachitsky: It’s mismanagement within the research team in OpenAI and Anthropic and some of these other models.
Karina Nguyen: Yeah, it’s kind of crazy to think about it.
Lenny Rachitsky: Holy moly. Okay, so speaking of Anthropic and OpenAI, you’ve worked at both. Very few people have worked at both companies and have seen how they operate. I’m curious just what you’ve noticed about the differences between these two, how they operate, how they think, how they approach stuff. What can you share along those lines?
Karina Nguyen: It’s more similar than different. Obviously there was a lot of… There are some differences always comes to nuances. I would say culture. I really love Anthropic and I have a lot of friends there. And I also love OpenAI and they still have a lot of friends though. So it’s not about enemies. I feel like there’s in AI, it’s all like, “Yeah, they’re competitors. There’s enemies.” It’s actually like one big community of people doing the same thing. I would say what I’ve learned from Anthropic is this real care and craft towards model behavior, model craft, model training.
And I’ve been thinking a lot about, “Okay, what makes Claude Claude and what makes ChatGPT ChatGPT?” And it’s like I still have some sense of operational processes that leads to the outputs, to the model. It’s the outputed model. And it’s like the reason why Claude has so much more personality and is more like a librarian… I don’t know. I don’t know. I am visualizing Claude being like a librarian at some point, very nerdy or something. … is because I feel like it’s the reflection of the creators who are making this model. And a lot of details around the character and the personality and whether the model should follow up on this question or not.
What’s the correct ethical behavior for the model in these scenarios? A lot of crafts and curated datasets. This is where I learned that part of art, I guess, at Anthropic. I would say Anthropic is much smaller. When I joined it was, what, like 70 people? When I left it was tons of people. And obviously the culture changed so much. I really enjoyed being early days startup lives, and people knew each other as a family. But the culture shifted.
I would say that I learned from Anthropic that they’re much better at focusing and prioritization of… Very hardcore prioritization, I guess. And they need to do it. But I think OpenAI’s much more innovative and much more risk-takers in terms of product or research. Actually, in way your full-time job can be just teaching the model how to be creative writers. And it’s like there’s some luxury in this research freedom that comes with scale, maybe. I don’t know. I’d say I have much more creative product freedom to do almost anything, I guess, within OpenAI, evolve ChatGPT into the vision that we want. It’s more probably bottoms-up, I guess.
Lenny Rachitsky: Yeah, that’s how I was thinking about it. It feels like OpenAI is more bottoms-up, distributed, people bubble up ideas, try stuff. And that leads to more products launching, I imagine more things just kind of being tried versus more of a, “Let’s just make sure everything we do is awesome and great and craft and thinking deeply about every investment.”
Karina Nguyen: Right.
Lenny Rachitsky: That’s really interesting. I’ve never heard it described this way. Karina, we’ve covered so much ground. This is going to help a lot of people with so many ways of thinking about where the future’s going. Before we get to our very exciting lightning round, I’m curious if there’s anything else that you think might be helpful to share or get into?
Karina Nguyen: One of my regrets, I guess, when I was early days at Anthropic was that… I think there was some luxury of the time, because pre-ChatGPT, to actually come in with a bunch of ideas and prototype almost every day. And I think that we did a lot of cool ideas like Claude, and Slack was actually one of the first tool-usey products. It’s like Claude could operate in your workplace now. It’s kind of cool because you can add Claude to summarize the thread. So maybe you have an entire conversation with someone and then you want a summary of what happened you can ask Claude, “Summarize this.”
Also, it was really fun to iterate on the model itself. It’s like when you just talk to the model in Slack forever. It created some social element, it was kind like [inaudible 00:58:19] and this Discord, people learned so much about prompting and how to work with Claude. Actually, one of the features that was early tasks prototype is every Monday Claude would just summarize the entire channel. Or every Friday we’d just summarize a bunch of channels and give the news about the organization, or something.
And it’s kind of like really cool form factor. I think thinking about form factor’s a really important question in AI, especially we haven’t even figured out how do we create an awesome product experience with o-series models. It’s like the paradigm between synchronous real time give an answer paradigm into more asynchronous paradigm of agents working on the background. But then now the question is the agents should build trust with you, right? And trust builds over time, which is like with humans. And you start this collaboration which is why this collaboration model with you and the model is so important because you build trust and the model learns from your preferences so that it can become more personalized and it will start predicting the next action that you want to take on the computer or something. And it’s more predictive, much more… We went from personal computers to personal model basically here.
Lenny Rachitsky: Why is it not a thing? That seems like such an obvious feature that every LLM should have as a Slack bot version of them. Is that a thing I can help you install? Or is that not a thing right now?
Karina Nguyen: I know that Claude and Slack was sunset in 2023 or something. I think it was after ChatGPT was mostly the focus on customer use cases or enterprise use cases.
Lenny Rachitsky: Mm-hmm. Bummer.
Karina Nguyen: I think the form factor of Claude and Slack was kind of constrained a little bit when you want to talk about new features.
Lenny Rachitsky: Bummer. I want that.
Karina Nguyen: I know that ChatGPT had Slackbar tools. I don’t know, maybe it will come back sometime.
Lenny Rachitsky: All right, I would pay for that. Any other memories from that time of early days? Because that’s a really special place to have been is early days Anthropic. Any other memories or stories from that time that might be interesting to share?
Karina Nguyen: I think the very first launch when we felt… When click from use, again, was 100K context launch is when the models could input the entire book and give you a summary of the book or something. Or the financial… or catalog multi files financial reports and then give you an answer to the question, to very specific questions. I think there was something in there that was kind like, “Oh my god, this is a really cool new capability.” Not model capability, but more like the capabilities that came from the product form factor itself rather than the model capability as much.
I think other prototypes that we were thinking about… There’s one part having a Claude workspaces and it’s kind of the same idea of Claude and I would have this shared workspace and that share workspace is like a document and we can iterate on the document. And I feel like sometimes the ideas, [inaudible 01:01:55] and they’re locked for two years, just like in this case.
Lenny Rachitsky: It’s interesting, there’s these milestones that kind of open up our view of what is happening and where things are going. ChatGPT think was the first of just like, “Wow, this is much better than I would’ve thought.” You talked about 100K context windows where you could upload a book and ask it questions and have it summarize. I actually use that all the time. When I have interview guests and they wrote a book, I sometimes don’t have time to read the whole book. So I use it to help me understand what the most interesting parts are. And then I actually dive into the book, just to be clear. And then, I don’t know, maybe voice was another one where you could talk to say ChatGPT. Is there any other moments there that you’re like, “Wow, this is much better than I thought it was going to be?”
Karina Nguyen: Yeah, I think the computer use agents, like the model operating the desktop. And you can essentially think of new kind of experience where the model can learn the way you browse. And from that preference it can just browse as just like you. It’s kind of simulated persona. And it’s actually very similar to the idea of like, “Okay, maybe Sam Altman doesn’t have a lot of time. Maybe I want to talk to his simulation and ask…” Or, for example, I really appreciate some of the technical mentorship. Yeah, cool. But he doesn’t have a lot of time so it’s like I really want to ask him this questions. How do you respond with simulated environments like this would be really cool.
Lenny Rachitsky: That’s a great place to plug Lennybot, have one of those. It’s trained on all of my podcasts and newsletters.
Karina Nguyen: Oh, cool.
Lenny Rachitsky: It sits on many models. I don’t know which exactly they use, but it’s exactly that. And it’s not even me, it’s all the guests that have been on the podcast and on newsletter as I wrote. And you could just ask it, “How do I grow my product? How do I develop a strategy?” And it’s actually shockingly good.
Karina Nguyen: Do you feel like it reflects who you are?
Lenny Rachitsky: Yeah.
Karina Nguyen: Or would it be… Okay.
Lenny Rachitsky: The best part of it is you can talk to it. There’s an ElevenLabs voice version that’s trained on my voice from this podcast, and it’s actually very good and people have told me they sit there for hours talking to it.
Karina Nguyen: Wow.
Lenny Rachitsky: And somebody told it, “Interview me like I am on Lenny’s podcast, ask me questions about my career.” And he did a half hour podcast episode with Lennybot.
Karina Nguyen: Oh my god, that’s so fun.
Lenny Rachitsky: It’s incredible. Future is wild.
Karina Nguyen: Yeah. I think content transformation is… I would imagine sometime when you generate a sci-fi story in Canvas, you can transform this into audiobook where you have very natural content transformation of one media to another media. I think one of my earliest inspiration is one of the last episodes of Westworld where, I don’t want to spoil, but where Dolores comes to her work at that time and she comes to this new workspace and she starts writing a story. And then as she writes a story, a 3D, virtual reality, starts creating on the fly. So I kind of want to create that. Kind of cool.
Lenny Rachitsky: Wow. Speaking of medium, I guess I was wondering if I should go in this direct or not, but real quick. Kevin Weill/Kevin Weill, I don’t know exactly how to pronounce his last name, the CPO of OpenAI.
Karina Nguyen: Kevin Weill, uh-huh.
Lenny Rachitsky: Is it Weill or Weill?
Karina Nguyen: I think Weill.
Lenny Rachitsky: Weill. Okay. Okay. Let’s just say that. We’ll go with that.
Karina Nguyen: I hope, yeah.
Lenny Rachitsky: He did a panel at the Lenny and Friends Summit last year and he made this really fascinating point that chat is a really interesting interface for these tools because they’re just getting smarter and smarter and smarter and smarter and smarter. And chat continues to work as a paradigm to just interact with them, similar to a human. You could talk to Albert Einstein. You could talk to someone not very smart and it’s all conversation still. And so it’s a really flexible way to interact with increasingly good intelligence. At some point it’ll not be so great, and you were talking about all these ways that you’re adding additional ways to interact. But it’s interesting chat proved to be a really powerful layer on top of all this stuff.
Karina Nguyen: Yeah, that’s real cool. I feel like chat also has social element which is very humane. It’s like, yeah, you sometimes want to get into group chat. And having conversations with AI is kind of like a group chat in itself, as messaging. Actually, this idea of how do you build features like this? I see tasks as this general feature that will scale very nicely as the models would develop new capabilities themselves. The models will be able to do better searches and create new… come up with more creative writing on render, react apps and like HTML apps. And you can have everyday new puzzle for you, every day continue the story from the previous days. It scales very nicely.
Lenny Rachitsky: You mentioned something as we were getting into this extra section that we ended up going down is this idea of the agents using a computer. I know this is actually something you are going to launch today, the day we’re recording it, which will be out by the time this comes out, called Operator, can you talk about this very cool feature that people will have access to?
Karina Nguyen: Yeah, so I unfortunately did not work on that, but I’m really, really excited about this launch. It’s basically an agent that can complete the task in its own virtual computer, in its own virtual environment. You can do any literally task like order me a book on Amazon. And then ideally the model will either follow up with you which book do you want, or know you so well that it start recommending, “Oh, here is the five books that I might recommend you to buy.” And then you hit, “Yeah, help me buy.” And then the model goes off into its own virtual little browser and complete the task and buy the book on the Amazon. And then if you give the model credentials, credit cards, obviously it comes with a lot of trust and safety, then it will just complete the thing for you. It’s a virtual assistant.
Lenny Rachitsky: It’s interesting how this just sounds like obviously this should happen. Why is this not yet a thing? Which is also mind-blowing that we’re just assuming this should exist. Just some AI doing things for you on a computer we just ask it to do.
Karina Nguyen: Yeah.
Lenny Rachitsky: It’s absurd.
Karina Nguyen: It’s actually really hard. And I think you’re still cracking this, but feel like… I don’t know if you use Tuple like a pair programming product.
Lenny Rachitsky: No.
Karina Nguyen: But at Anthropic we loved pair programming, so if you used-
Lenny Rachitsky: Oh yeah, Shopify uses this. I remember it came up on a podcast episode.
Karina Nguyen: Oh, nice. Yeah, so it is a very cool product where you can just call anyone at any time and then share screen and the other person can have access to the screen or start literally operating your computer. And it’s very realtime… The allegiance is very… it’s very high quality. And it’s just like I kind of want the same. I want to pair program with my model and the model should even talk to me. Draw very specific section in my code and just go to tell me… Obviously teach me and we can have different modes. It’s like right, this is a product right here for you. I don’t know. Some people should build that.
Lenny Rachitsky: It sounds like a startup just got birthed-
Karina Nguyen: Yes.
Lenny Rachitsky: … from someone listening to this. You mentioned that it’s very hard to do this agent controlling a computer as you and helping out. What makes it so hard for whatever, however much you can explain briefly?
Karina Nguyen: Much of it is because right now the model’s operating on pixels instead of language or whatnot. Pixels is actually really, really hard. The models [inaudible 01:10:25] perception, or visual perception. I think there’s still a lot of multimodal research that’s going on, but I think language scaled so much easier compared to multimodal because of that.
Another thing that I guess my team is working that is how do you derive human intent very correctly? It’s like sometimes does the model know enough information to ask a follow-up question or to complete the task? You don’t want an agent to go off for 10 minutes and then come back with an answer that you didn’t even want. That actually creates much more worse user experience. And this comes with teaching the model people skills. It’s like, “What do people like? Kind of like creating the mental model of the user and care about the user in order to ask certain questions. Actually, that part is hard to do for the models.
Lenny Rachitsky: That relates to what we talked about earlier where this kind of the soft skill, people skills piece is not where these models are strong yet.
Karina Nguyen: Yeah.
Lenny Rachitsky: Okay. I’m going to skip the lightning round. I want to ask just one question from the lightning round, something fun.
Karina Nguyen: Yeah.
Lenny Rachitsky: Okay, so when AI replaces your job, Karina, I’m curious what you’re… And it gives you a stipend, gives you a monthly stipend. Here’s your salary for the month. What would you want to do? What do you want to spend your time on? What will you be doing in this future world?
Karina Nguyen: I’ve been thinking about this a lot times. I feel like I have a lot of jobs options. I would love to be a writer, I think. I think that would be super cool. You should write short stories, sci-fi stories, novels. I really like art history, so you know those conservationists in the museums who just try to preserve art paintings, but just painting through a long day?
Lenny Rachitsky: Mm-hmm.
Karina Nguyen: I think that would be really cool to do. Yeah.
Lenny Rachitsky: That sounds beautiful.
Karina Nguyen: I don’t know.
Lenny Rachitsky: What I’m hearing is you need to Nerf these models to not get very good at writing so that you can continue… Although at that point you don’t need to do it from… You don’t need people to buy it, you’re just doing it for fun, so it doesn’t even matter if they’re incredibly good at writing or art conservation. Oh man, what an episode of our conversation. What a wild time we’re living in. Karina, thank you so much for being here. Two final questions. Where can folks find you online if they want to reach out and follow up on anything? And how can listeners be useful to you?
Karina Nguyen: You can find me, I’m on Twitter it’s KarinaNguyen. You can also shoot me an email on my website. And my team is hiring and so I’m looking for research engineers, research scientists, as well as machine learning engineers, people who come from product engineers who want to learn model training. I’m actually hiring for my team. My team is called Frontier Product Research, and we train models, we develop new methods but for product oriented outcomes.
Lenny Rachitsky: What a place to work. Holy moly. What’s the best way for people to apply for these very lucrative roles?
Karina Nguyen: I think you can shoot me a DM on Twitter.
Lenny Rachitsky: Okay.
Karina Nguyen: Or I’m yet to create a job description for them.
Lenny Rachitsky: Okay. This is the job description.
Karina Nguyen: Or you can apply into post training team. Yeah.
Lenny Rachitsky: Okay. You’re going to get a flood of DMs. I hope you’re prepared. Karina, thank you so much for being here. This was incredible.
Karina Nguyen: Thank you so much, Lenny.
Lenny Rachitsky: Bye, everyone.
Karina Nguyen: It was fun.
Lenny Rachitsky: Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at LennysPodcasts.com. See you in the next episode.
Glossary
| English | 中文 |
|---|---|
| AGI | AGI(通用人工智能) |
| artifacts | artifacts |
| bottoms-up | 自下而上 |
| Canvas | Canvas |
| collaborative agent | 协作型智能体 |
| computer use agents | 计算机使用智能体(computer use agents) |
| context window | 上下文窗口 |
| curated dataset | 精心策划的数据集 |
| data wall | 数据墙 |
| distillation | 蒸馏(distillation) |
| dogfooding | 内部测试(dogfooding) |
| evals | evals(评估基准) |
| form factor | 产品形态(form factor) |
| frontier eval | 前沿 eval |
| Frontier Product Research | Frontier Product Research(前沿产品研究团队) |
| function calls | 函数调用(function calls) |
| GPQA | GPQA(Google-proof Question Answering) |
| GraphQL | GraphQL |
| ground truth | 真值 |
| hill climb | 攀升(hill climb) |
| IC (Individual Contributor) | IC(独立贡献者) |
| JSON schema | JSON schema |
| Kaeser | Kaeser |
| Kevin Weil | Kevin Weil |
| Lee Byron | Lee Byron |
| model craft | 模型工艺(model craft) |
| next token prediction | 下一个 token 预测 |
| Operator | Operator(计算机使用智能体产品) |
| post-training | 后训练 |
| PRD (Product Requirements Document) | PRD(产品需求文档) |
| reinforcement learning | 强化学习 |
| SGI | SGI(超级通用人工智能) |
| soft skills | 软技能 |
| spec | 规格文档(spec) |
| synthetic data | 合成数据 |
| tool stack | 工具栈(tool stack) |
| Tuple | Tuple(结对编程产品) |
| win rate | 胜率 |
Reformatted by reformat_english.py
OpenAI 研究员谈为什么软技能是工作的未来 | Karina Nguyen
文字稿
Lenny Rachitsky: 你不仅工作在 AI 和大语言模型的最前沿,你实际上就在构建这个最前沿。
Karina Nguyen: 我刚到 Anthropic 的时候,心想:“天哪,我真的很喜欢前端工程。“后来我转做研究,是因为我意识到:“天哪,Claude 在前端方面越来越厉害了,编程方面也越来越厉害了。我觉得 Claude 可以开发新的应用。”
Lenny Rachitsky: 你认为未来对产品团队来说,哪些技能会最有价值?
Karina Nguyen: 创造性思维——你需要产生大量想法,然后从中筛选,而不仅仅是构建最好的产品体验。我认为,教模型学会审美、真正优秀的视觉设计,或者以极具创造力的方式写作,其实是极其困难的。
Lenny Rachitsky: 你认为人们对模型创建过程最大的误解是什么?
Karina Nguyen: 当你教模型认知自身——告诉它”你实际上没有物理身体,无法在物理世界中操作”时——模型会变得极其困惑。
嘉宾介绍
Lenny Rachitsky: 今天的嘉宾是 Karina Nguyen。Karina 是 OpenAI 的一名 AI 研究员,她参与构建了 Canvas、Tasks、o1 思维链模型(chain-of-thought)等产品。在加入 OpenAI 之前,她在 Anthropic 主导了 Claude 3 模型的后训练和评估工作,构建了支持 100K 上下文窗口的文档上传功能,还有更多其他成果。她还在《纽约时报》做过工程师,在 Dropbox 和 Square 做过设计师。能窥见一位身处 AI 和大语言模型最前沿的人是如何工作的、她如何看待未来的发展方向,这样的机会非常难得。
在这次对话中,我们聊到了 OpenAI 团队如何运作和构建产品、她认为随着 AI 变得越来越智能你应该培养哪些技能、模型是如何创建的、为什么合成数据将使模型持续变得更智能,以及为什么她在意识到大语言模型将在编程方面变得多么出色之后,从工程转向了研究。
正式对话
Lenny Rachitsky: Karina,非常感谢你来参加节目,欢迎来到播客。
Karina Nguyen: 非常感谢你邀请我,Lenny。
Lenny Rachitsky: 我非常高兴你能来,因为你不仅工作在 AI 和大语言模型的最前沿,你实际上就在构建这个最前沿。你最近发布了这个功能,基本上是……OpenAI 的第一个 Agent 功能。我还做了一个调查,不知道你有没有注意到——我调查了我的读者,问他们在工作中每天使用哪些工具、最常使用哪些工具。ChatGPT 排第一,超过了 Gmail,超过了 Slack,超过了其他一切。90% 的人说他们经常使用 ChatGPT。
Karina Nguyen: 这个数据挺不错的。
Lenny Rachitsky: 太不可思议了。两年前它还不存在。
Karina Nguyen: 是的。
Lenny Rachitsky: 另外,我们录制这期节目的这一周,OpenAI 刚刚宣布了 Stargate 项目,这是一个五千亿美元的 AI 基础设施投资计划。AI 领域不断有大事发生,而你对事情是如何运作的、未来走向哪里、工作是如何推进的,有着非常独特的视角。所以我有很多问题想问你。我想聊聊你在 OpenAI 是如何工作和运作的,你认为未来哪些技能会越来越重要、哪些会越来越不重要,以及从更宏观的角度来看事情正在走向何方。听起来怎么样?
Karina Nguyen: 听起来很棒,非常感谢。是的,我非常幸运能在 Anthropic 早期就加入,在那里学到了很多东西。我大约八个月前加入了 OpenAI,所以,是的,我很期待深入聊聊——
Lenny Rachitsky: 好,我肯定会问你这两家公司的区别,但我想先从更技术性的话题开始,直接切入正题。我想聊聊模型训练。人们总是听到关于模型被训练的说法——那些大模型、需要多少数据、要花多长时间、要砸多少钱,还有我们快要没数据了——这个话题我后面也想聊。我先问你这个问题:你认为人们对模型创建过程最大的误解是什么?
模型训练更像一门艺术
Karina Nguyen: 模型训练与其说是科学,不如说更像一门艺术。在很多方面,我们作为模型训练者,会花大量时间思考数据质量。这是模型训练中最重要的因素之一——如何为你想要创建的特定交互模型行为确保最高质量的数据?但调试模型的方式其实和调试软件非常相似。我在 Anthropic 早期学到的其中一个经验是,我们特别是在 Claude 3 的训练中发现,当你教给模型一些软知识,比如”嘿,你其实没有物理身体来在物理世界中操作”,但同时又用数据教了模型一些 function calls(函数调用),就是”设置闹钟是这样操作的”——
于是模型就会非常困惑,它到底能不能设闹钟,但它又没有物理世界的身体。模型被搞糊涂了,有时候会过度拒绝。有时候它会说,“抱歉,我不知道,我没法帮你。“所以这里面始终存在一个平衡取舍:如何让模型对用户更有帮助,同时在其他场景下又不产生危害。这始终是关于如何让模型更加健壮,能在各种多样化的场景下正常运行。
Lenny Rachitsky: 这太有意思了,我从来没想过这一点。它训练用的大部分数据基本上都默认它像一个描述世界及其运作方式的人类。数据假设存在一个身体、你可以做各种事情,然后模型却被告知你没有身体。
Karina Nguyen: 对。
数据墙与合成数据
Lenny Rachitsky: 好,我想顺着这个话题聊聊数据。我知道你在这方面有很鲜明的观点。现在有一个流行的说法,认为模型会停止变得更聪明,因为数据快用完了。它们的训练很大程度上依赖互联网,但互联网只有一个,而且已经被训练过了,你还能给它们展示什么关于世界的新东西?还有一个趋势是合成数据,这个术语”合成数据”。什么是合成数据?你为什么觉得它重要?你觉得这条路走得通吗?
Karina Nguyen: 我觉得这里面其实有两个问题,我们可以一个一个拆解。人们说我们在撞数据墙。我觉得大家更多是在预训练大型模型的语境下思考,这些模型在整个互联网上训练来做 next token prediction(下一个 token 预测)。但模型在这个过程中实际上学到的是——如何压缩,这里的压缩算法是什么?模型学会了压缩大量知识,学会了如何对世界建模。比如下一个词的预测,“教我开车”这个例子,基本上只有少数几个词能匹配上,比如”汽车”。所以模型实际上本身就在学习关于世界的知识。它在建模人类行为,有时候在建模……当你跟预训练模型对话时,那些模型非常非常大,它们实际上极其多样化、极具创造力,因为你几乎可以通过一个预训练模型跟任何 Reddit 用户对话。
但我认为目前发生的事情是,随着 o1 系列的新范式出现,后训练(post-training)阶段的扩展本身并没有撞墙。原因基本上是我们从预训练模型的原始数据集,转向了后训练世界中可以通过强化学习教给模型的无穷无尽的任务。任何任务,比如如何搜索网页、如何使用电脑、如何写作,各种你想教给模型的不同技能。这就是为什么我们说没有数据墙之类的,因为任务的数量是无限的,这也是模型变得极其超智能的方式。而且我们实际上已经在所有基准测试上趋于饱和。
所以我认为瓶颈其实在于评估——我们没有所有前沿的 evals,比如 GPQA,也就是 Google-proof question answering,博士级别的智能。这个基准已经达到了百分之六七十,这就是博士的水平。所以确实是在评估上撞墙了。
Lenny Rachitsky: 这两条线索我都想接着聊。第一个是关于合成数据的想法。一种简单的理解方式是不是:模型生成未来模型要训练的数据,你让它生成各种做事的方式、各种你说的任务,然后更新的模型就用之前模型生成的这些数据来训练?
Karina Nguyen: 有些任务确实是合成筛选出来的。这是一个活跃的研究领域——如何合成地构造新任务让模型去学习。有时候你开发产品时,会从产品和用户反馈中获得大量数据,你也可以在后训练中使用这些数据。有时候你仍然需要使用人类数据,因为实际上有些任务真的非常非常难教。只有专家才知道某些化学知识或生物知识,所以你确实需要大量借助专家的知识。所以我觉得对我来说,合成数据训练更多是为了产品——它是一种针对类似产品成果的快速模型迭代。我们可以更深入地聊,但我们做 Canvas、任务以及 ChatGPT 的新产品功能,主要就是通过合成训练完成的。
从聊天机器人到协作智能体
Lenny Rachitsky: 那我们就展开聊聊吧,这真的很有意思。我想聊聊 evals,但先顺着这条线索走。说说这怎么帮你创建了 Canvas。
Karina Nguyen: 我刚来 OpenAI 的时候,真的有这样一个想法:“如果 ChatGPT 能真正改变视觉界面,同时也改变它与人的交互方式,那就太酷了。“也就是从一个聊天机器人变成一个协作型智能体(collaborative agent),而这个协作者是朝着最终成为创新者的通用系统迈出的一步。于是整个团队——应用工程师、设计师、产品、研究人员——几乎是凭空组建起来的。就是一群人聚在一起,然后我们开始快速迭代。
实际上 Canvas 是 OpenAI 第一个研究人员和应用工程师从产品开发周期最一开始就一起合作的项目。我认为过程中我们学到了很多东西,但我来的时候确实带着这样的思路:“我们需要做非常快速的模型迭代,这样工程师就能更容易地使用最新的模型,同时也能从用户反馈或早期的内部测试(dogfooding)中学习。如何非常快速地改进模型?”
当产品上线后,很难搞清楚人们会怎么使用它。而合成训练模型的方式,本质上是弄清楚你希望产品功能具备哪些最核心的行为。以 Canvas 为例,最终归结为三个主要行为:一是如何在类似”给我写一篇长文”这样的提示下触发 Canvas——当用户意图主要是对长文档进行迭代时;或者”给我写一段代码”;以及何时不触发 Canvas——比如”能告诉我更多关于某某总统的事吗”这类一般性问题。你不想触发 Canvas,因为用户的意图主要是获取答案,而不一定是要对长文档进行迭代。
第二种核心行为:教模型更新文档
Karina Nguyen: 第二种行为是我们如何教模型在用户提出要求时更新文档。我们教给模型的行为之一,实际上就是让它具有一定的主动性和自主性——直接去文档里选中特定的部分,要么删除,要么编辑,比如高亮某段然后重写。有时候用户只会说”把第二段改得更友好一些”,我们就得教会模型真正找到文档中的第二段,然后把语气改得更友好。所以基本上你要同时教它两件事:一是如何触发编辑行为本身,二是如何让模型产出更高质量的文档编辑。
代码编辑中的决策边界
在代码场景中,比如说,还有一个问题是模型完全重写整个文档的能力有多强, versus 进行非常精准的定向编辑。所以在编辑行为内部还有另一层决策边界:“是把整个文档选中全部重写,还是做非常精准的定制化编辑。“我们刚上线模型的时候,会倾向于让模型更多地做整体重写,因为我们看到重写的质量明显更高。但随着时间推移,你会根据用户反馈和迭代部署中学到的东西不断调整。
第三种核心行为:教模型做评注
最后,第三种我们通过合成方式教给模型的行为是如何对任何文档做出评注。我们的做法是用 o1 模型来模拟一段用户对话,比如”给我写一篇关于某某主题的文档”。然后我们用 o1 来生成文档,接着注入一条用户提示,比如”哦,给一些评注,批评一下我的写作”或者”批评一下你刚才写的这篇文章”。然后我们教模型对文档做出非常具体的评注。所以这里也涉及你希望模型做出什么样的评注——它们是否有道理?你如何评判评注的质量?这一切最终都归结到通过非常严谨的 evals(评估基准)来衡量进展。但总的来说,这就是我们如何利用 o1 和合成数据生成来做训练的。
Lenny Rachitsky: 好的,这太有意思了。你提到了”教模型”这个概念,也提到了用合成数据教模型不同行为——这可以理解为一种简单的思考方式。基本上你就是通过向它展示”成功是什么样的”来做这件事,用的基本上就是 evals。是不是可以这样简单地理解:“这就是你做这件事成功时应该有的样子”,然后它就学到了,“好的,我明白了,这就是我应该做的”?
Karina Nguyen: 对,说得很好,没错,你理解对了。
日常工作是什么样的
Lenny Rachitsky: 好的,明白了。我想开始展开聊聊你做这些事情时日常工作是什么样的。是你坐在那里跟某个版本的 ChatGPT 对话,来设计这些 evals 吗?
Karina Nguyen: 有时候我会这么做。有时候我确实会坐在那里和 ChatGPT 一起工作。其实我觉得我从 Anthropic 那里学到了很多——他们的人花大量时间去给模型写提示词,而且总是在遇到质量很差的批次,但恰恰在这个过程中你会获得很多新的想法:怎么让模型变得更好?比如”这个回复有点奇怪,它为什么这么做?“然后你开始调试,或者开始琢磨新的方法——怎么教模型以不同的方式回应,让它的性格更好之类的。
所以模型性格的形成用的也是类似的方法,非常相似的套路。但回到正题,我觉得我在 OpenAI 的工作内容发生了变化。我刚来的时候主要是做研究 IC 的工作,所以我在跑代码、训练模型、写 evals,和产品经理、设计师合作,教他们怎么思考评估这件事。我觉得那段经历非常棒,本质上就是一个采纳的过程——“我们该如何为 AI 模型做 AI 功能的产品管理?“不过现在我的工作主要是管理和带人。不过我还是会做 IC 研究和写代码,一直到下午四点。只是角色有所转变。
Lenny Rachitsky: 好了,别太多聊做管理者的事。
Karina Nguyen: 好。
Lenny Rachitsky: 因为现在大家都在裁管理者。“谁还需要管理者?“这是我现在的感受。开玩笑的。不过很有意思的是,你花了大量时间教产品团队理解 evals 如何整合进来,以及它的重要性。我听过好几次这种说法了,但自己还没有亲身经历过,所以我觉得这是一条很重要的线索——写这些评估将会越来越成为产品团队工作的重要部分,尤其是当他们构建 AI 功能、与 LLM 协作的时候。所以你能不能再多聊聊这具体是什么样的?是坐在那里对着一个 Excel 电子表格,写着”这是输入、这是输出、这是结果有多好”?请非常具体地讲讲实际操作是什么样的。
evals(评估基准)的实际操作
Karina Nguyen: 这当然取决于你在开发什么,但有各种不同类型的评估方式。有时候我会让产品经理——我们现在还有新的角色叫模型设计师——去翻看一些用户反馈,或者想出各种应该触发某个行为的用户对话场景:在这些情况下应该触发 Canvas。然后你就有了一个真值标签:“好吧,这段对话应该触发 Canvas,那段对话不应该触发 Canvas。“你就有了这种非常确定性的 eval,专门用于决策行为的判断。
比如我们上线任务功能的时候,如何正确地处理日程安排其实对模型来说非常难。但我们构建了一些确定性评估,比如”如果用户说了晚上 7 点,模型就应该输出晚上 7 点。“这样你就可以有确定性的 eval——通过或不通过。运作方式是这样的——有时候我会让产品经理直接建一个电子表格,里面有各种标签页,写着当前行为是什么、理想行为是什么、为什么,还有一些备注。
有时候他们把它用于 evals,有时候我们用于训练。因为如果你把这个电子表格给到 o1 模型,它大概能自己想明白怎么学到好的行为。还有第二种更常见的 eval 是人工评估。你可以有专门的训练师,或者内部人员——当给定一段提示词对话,然后有多个模型的完成结果,你来选择胜率。哪个模型最好?哪个模型产出的评注或编辑质量最高?然后你就可以持续跟踪胜率。当你开发新模型时,它应该总是能赢过之前的模型。所以取决于你想衡量什么。
Lenny Rachitsky: 太有意思了。基本上我听到的——这也是我在和不同人交流时正在学到的东西——是产品开发可能会从”这是一份规格文档 PRD,我们一起把它建出来,然后好了,我们来 review 一下,满意了吗?“转变到”嘿,AI,帮我建这个东西,这是正确的样子应该是什么”——然后我把所有时间都花在定义”正确的样子”上,本质上就是在做 evals。
Karina Nguyen: 你当然需要衡量模型的进展,这就是 evals 的用武之地——因为你已经有了一个提示词驱动的模型作为基线。而最健壮的 eval 是那些提示词基线得分最低的 eval。因为这样的话,你就知道如果你训练了一个好模型,它应该能在这个 eval 上不断攀升,同时不会在其他智能 eval 上退化。这就是我说的,它与其说是科学不如说是艺术。就像,“好吧,如果你让模型针对某种行为做优化,你不想让它在其他智能领域出现脑损伤……”这种情况在每个实验室、每个研究团队中都在不断发生。
用提示词做产品原型
我想说提示词也是一种产品创意原型开发的方式。在 Anthropic 早期我做文件上传功能的时候,我记得我就是让模型……我记得我们当时在上架一百 key 上下文。我就在本地浏览器里做了原型演示。大家真的非常喜欢,他们就是想要文件上传的 API 之类的。那时候我恍然大悟,再加上很久以前的一篇博文,让我意识到提示词是产品开发的一种新方式,或者说对设计师和产品经理来说是一种新的原型开发方式。
比如我想做的一个功能是个性化的起始提示词。每次你打开 Claude,它应该根据你的兴趣推荐起始提示词。你完全可以用提示词来做这件事的实验。
Lenny Rachitsky: 嗯哼,用这个来实验。
Karina Nguyen: 另一个功能是为对话生成标题。这是一个很小的微体验,但我真的很自豪。我们的做法是取最近五条对话,问模型”这个用户的风格是什么?“然后下一条新对话生成的标题就会是同样的风格。就是这种非常小的微体验。
Lenny Rachitsky: 太酷了。你这是在 Anthropic 做的还是 OpenAI?
Karina Nguyen: 在 Anthropic。
Lenny Rachitsky: 好的,酷。顺便说一句,我很喜欢 Claude 的文件上传功能。ChatGPT 还没有这个功能,对吧?
Karina Nguyen: 我觉得已经有了。
Lenny Rachitsky: 嗯……
Karina Nguyen: 但我觉得实现方式很不一样。
Lenny Rachitsky: 好吧。也许是 PDF 功能,因为我用 Claude 的时候一直在用。
Karina Nguyen: 对。
Lenny Rachitsky: 好的。得有人赶紧跟进一下。真的,你做了太多我每天都在用、很多人每天都在用的功能,这太疯狂了。你提到的这个原型开发的观点非常重要。这个播客里也经常谈到——也许 AI 最近对产品构建者工作最大的影响就是原型开发,从”给你一份 PRD、给你一份设计”变成 PM 越来越多地说”这是我想法的原型”,而且它真的能跑,你可以直接玩。
Karina Nguyen: 对。
任务功能的诞生
Lenny Rachitsky: 好的,我想再多聊聊你的工作方式。你之前提到你构建并上线了任务功能,可以这样描述吗?
Karina Nguyen: 对。
Lenny Rachitsky: 那聊聊它是怎么产生的,让我们更好地理解你和产品团队怎么协作,以及 OpenAI 在这方面是怎么运作的,能分享多少就分享多少。
Karina Nguyen: 我觉得 Canvas 和任务属于那种中短期项目的范畴。实际上 Canvas 和任务诞生的过程都是从一个人做原型、写规格文档开始的。有点像 PRD,就是写模型行为的规格。我不觉得任务是一个极其开创性的功能。它之所以很酷,是因为模型如此通用——模型现在可以搜索,可以写科幻故事,可以查股票,可以每天总结新闻。因为模型如此通用,给人们一个熟悉的东西——通知是非常熟悉的,提醒是非常熟悉的——给人一种熟悉的形态,就像 Canvas 一样,Google Docs 是非常熟悉的,但加上神奇的 AI 时刻,它就变得非常强大。
但在运作层面它通常是怎么来的……对,就是一个人做原型,就是字面意义上的提示词原型,描述你希望模型怎么表现。比如对于任务功能,你需要设计……字面意义上的设计思维就是,好吧,如果用户说”明天早上 8 点提醒我去吃午饭”,模型需要从这条提示词中提取什么信息才能创建一个提醒?这就是你为新功能设计规格的方式,比如一个工具。Canvas 和任务都是工具。所以问题是如何创建工具栈?
然后就主要是开发 JSON schema。就是”好吧,从这个问题出发,模型也许应该提取用户要求的时间。“然后你想让时间的格式是什么?你希望模型怎么通知你——基本上就是用户应该给模型一个指令,然后这个指令会每天或在那个特定时间触发。比如你说”每天我想了解最新的 AI 新闻”,模型应该把它改写成”搜索最新的 AI 新闻,然后这个任务会在用户要求的时间触发。”
然后你的设计就是工具规格。其实我也不确定。有时候是通过对话——有人让我加入团队,说”天哪我们需要研究员”,或者”我们需要支持,我们需要训练模型”之类的。Canvas 的话基本上就是我自己去提案的……它在休息期间就很快组建好了,所以取决于项目。团队组建通常是一个产品经理、一个模型设计师、一个产品设计师、几个研究员和一批应用工程师。取决于项目的复杂度。然后任务功能大概花了两个月左右从零到一。Canvas 大概四五个月从零到一。然后你教产品经理怎么构建 evals,以及如何不仅发布更好的功能,还要怎么考虑更长远。你希望任务功能未来有什么酷功能?我觉得任务功能如果能更个性化就好了。能在手机上通过语音创建任务也不错。你的研究路线图就是这样来的——思考功能未来会如何演进。
数据策略:合成数据优先
然后就到了数据集的阶段。通过 evals 你要确保效果良好。然后你需要在用什么方法之间做权衡。我之所以非常倾向于纯粹依赖合成数据而不是从人类那里收集数据,是因为它的可扩展性强得多,便宜,成本不到一半。你直接从模型采样,教模型核心行为,这些行为就能泛化到各种多样化的覆盖范围。
当你发布测试版功能时,能从用户那里学到很多东西,你可以……所有的合成数据集都可以根据用户的实际产品行为来调整分布。这就是我们改进的方式。Canvas 从测试版到正式版发布时也是这样的过程。
研究员与模型设计师的区别
Lenny Rachitsky: 有件事我想帮助大家理解——我自己也不百分之百清楚——就是最简单地理解研究员的工作和模型设计师及其他相关人员的区别是什么?在 OpenAI,研究员到底做什么,最简单的方式怎么理解?
Karina Nguyen: 我刚才描述的项目大多是产品导向的。研究主要是产品研究。我团队的另一个部分其实是更长远的探索性项目,更多的是开发新方法,在各种条件下理解这些方法。基本上就是开发方法,你需要遵循非常类似的构建 evals 的流程,但那些 evals 要复杂得多。你想要覆盖分布外的情况,或者如果你想衡量泛化能力,就需要捕捉到这些。
但这基本上更偏科学性……以合成数据为例,合成数据最难的一点是怎么让它更多样化?合成数据的多样性是当前最重要的问题之一。所以探索注入多样性的通用方法,使其适用于所有场景,就是研究方向之一。还有一类是开发新能力。我觉得总是在于你开发了一个新方法,看到了它有效的生命迹象,然后你要么想怎么让它更通用,要么想怎么让它更实用。长远项目就是这样逐步变成中期、短期项目的。
Lenny Rachitsky: 有道理。本质上是开发让模型变得更聪明的方法——o4、o5、o6。新的方法……o1 是一个重大突破,对吧?
Karina Nguyen: 对。
Lenny Rachitsky: 它的运作方式不是直接”这是你的答案”,而是真正去思考,花时间在给出答案之前把过程想清楚。好的。
Karina Nguyen: 是的。
AI 时代的未来展望
Lenny Rachitsky: 非常有帮助。说到这个,说到思考未来和趋势走向,我想花点时间聊聊这个洞察——你基本上是在构建 AI 的最前沿,站在 AI 发展和现状的最尖端。所以我非常想听听你的看法:根据你所看到的,你认为世界将如何变化,人们的工作方式将如何改变。我知道这个问题很宽泛,但比方说在未来三年内,你看到世界会发生什么变化?人们的工作方式会发生什么变化?
Karina Nguyen: 在这两家实验室工作的经历让人非常谦卑。我刚到 Anthropic 的时候,心想”哦不,我真的很喜欢前端工程”。后来我转到研究的原因是我当时意识到,“天哪,Claude 的前端能力越来越强了。Claude 的编程能力越来越强了。我觉得 Claude 可以开发新的应用之类的,所以它也能为我正在做的东西开发新功能。“这就是一种元层面的觉醒——“天哪,世界真的在变。“当时我们首次推出 100K 上下文长度的时候,很显然我在想产品形态,文件上传对人们来说非常自然、非常熟悉。但你可以想象我们完全可以在 Claude.ai 应用里做成无限聊天,就像利用 100K 上下文那样。
但因为文件上传……形式追随功能。文件上传这种产品形态可以让人们直接上传任何东西——书籍、报告、财务文件——然后向模型提出任何任务。我记得当时企业客户、金融客户对那个功能非常感兴趣。“哇。“实际上在那个场景下这是人们非常常见的任务。看到一些重复性工作正在被这些智能模型自动化,真的很疯狂。
我们正在进入这样一个时代——有时候我甚至不知道比如 o1 给我的答案是否正确,因为我不是那个领域的专家。就是”我甚至不知道怎么验证模型的输出。“因为所有我认识的专家他们能验证这个。所以,是的,基本上有一些趋势正在发生。第一个趋势是推理和智能的成本正在急剧下降。
我之前写过一篇博客文章讲这个。也许我应该用最新的基准测试来更新,因为当时所有人都在用同一个基准,很快就把它刷饱和了。所以我想”现在我们需要用另一个前沿 eval 做同样的图表。“但智能的成本在下降,因为它变得那么便宜。小模型正在变得比大模型更聪明,这是因为蒸馏(distillation)研究。
这在 Claude 3 Haiku 上就发生了。当时我在参与 Claude 3 Haiku 的训练工作,我发现它比 Claude 2 聪明得多,而 Claude 2 大得多,参数量也多得多。但小模型的力量就在于变得非常智能、快速且便宜。我们正在走向那个世界。这有多重影响,但好消息是人们将更多地接触到 AI,这真的很好。开发者和构建者将获得更好的 AI 访问权限,但这也意味着所有受限于智能的工作都将被解锁。
比如医疗健康。与其去看医生,我可以问 ChatGPT 或者给 ChatGPT 一系列症状让它判断”我是感冒、流感还是别的什么?“我几乎可以真正获得医生的诊疗。已经有一些相关的研究了。
Lenny Rachitsky: 《纽约时报》有一篇报道就是这个话题,他们对比了医生、使用 ChatGPT 的医生,以及单独的 ChatGPT,结果单独的 ChatGPT 是表现最好的。所有医生反而拉低了效果。
Karina Nguyen: 对,这太疯狂了。是的,真的太疯狂了,对吧?还有教育——我经常想,如果我小时候有 ChatGPT 这样的工具,我能学到多少东西。但现在人们几乎可以从这些模型身上学到任何东西。他们可以学新语言,学怎么构建新的应用,写任何他们想写的东西。能够发布 Canvas 并把它带给大家,让他们做到以前做不到的事情,这是一种令人谦卑的体验。这个体验周围有一种魔力。
教育与研究的未来
Karina Nguyen: 教育将产生深远的影响。我觉得科学研究的领域也是如此,我认为任何 AI 研究的梦想都是自动化 AI 研究本身。我得说这有点可怕,这也让我认为人员管理会长期存在。这是最难被替代的事情之一……模型在情商方面、创造力本身,都是最难攻克的方向。所以作家们,我觉得大家不用太担心。我认为 AI 会帮人们减轻大量重复性工作。
Lenny Rachitsky: 太棒了。好的,我想沿着这条线继续深入聊下去。有趣的是,你刚才描述的——你曾是 Anthropic 的工程师,然后你想,“好吧,Claude 在工程方面会非常强,这可能不是一份长期的职业,所以我要转向研究,AI 在很长一段时间内都会需要我来帮它变得更聪明。”
Karina Nguyen: 我想说的是,我们仍然有……我觉得 Canvas 团队仍然有一些非常出色的前端工程师,他们是真正在乎交互、设计和用户体验的人。我觉得模型还没达到那个水平。但我认为我们完全可以让模型达到前 1% 的前端水平。
未来最有价值的技能
Lenny Rachitsky: 接下来我想聊的就是——这纯粹是推测——你觉得未来产品团队最需要的技能是什么?那些正在听播客的人可能会想,“好吧,这挺可怕的。我现在应该培养什么能力,才能保持领先、不被淘汰?” 你认为哪些技能会变得越来越重要?
Karina Nguyen: 我觉得是创造性思维。你想生成大量想法并从中筛选,而不仅仅是打造最好的产品体验。还有倾听。你想构建的东西是那些最通用的模型不会替代你的。很多时候,你构建一个产品,把它做得对特定用户群体非常好,实际上你的护城河就在于用户反馈。护城河更多在于你是否倾听他们,是否能快速迭代。护城河就在这里。我觉得我们还没有……还有那么多想法,我认为想法是取之不尽的,你完全不用担心。事实上我觉得 AI 领域的人反而应该……我希望他们能更有创造力一些,把跨领域的点连接起来,开发出真正酷的、全新一代的、与 AI 交互的新范式。
我认为我们完全没有解决这个问题。几年前我跟一些人说过,“你要为未来而构建。” 也就是说,模型现在好不好其实不那么重要,你可以设计出这样的产品理念:等到模型真正变得很好的时候,它就能运行得非常好。我觉得这自然就发生了。比如在 Anthropic 的 Claude artifacts……我觉得 Canvas 早期,回到 2022 年 ChatGPT 之前,写作方面的想法就是我们的知识积累。但我觉得 Claude 1.3 模型本身还没有达到能做出真正高质量编辑的程度。比如写代码方面。
我看到像 Kaeser 这样的初创公司做得非常好。那是因为他们迭代速度极快,他们发明了训练模型的新方法,行动非常迅速,倾听用户的需求,分发能力也很强。确实很酷。
Lenny Rachitsky: 这真的很有启发。所以我听到的是,软技能会变得越来越重要、越来越强大。你刚才谈到了管理、带领团队、创造力、提出创新见解、倾听。我之前写过一篇文章,我会把链接放上来,是关于分析 AI 将如何影响产品管理的。我们的观点其实非常一致,我的感觉也是软技能会越来越重要。而被替代的会是硬技能,这很有意思,因为通常人们更看重硬技能,比如编程、设计、写作。有意思的是,AI 恰恰擅长这些,因为它的本质就是吸收大量数据,综合整理,然后写出来、创造出东西。而那些模糊的、关于如何影响和说服他人、达成共识、倾听、创造力之类的,反而成为了难以替代的部分。我说这些,你觉得呢?
模型为何难以掌握创造力
Karina Nguyen: 我觉得教会模型审美、做出好的视觉设计,或者以极具创造力的方式写作,其实是非常非常难的。我仍然觉得 ChatGPT 的写作能力不太好,这是因为受限于这种创造性推理。我认为角色化表达是最重要的能力之一……对于管理者来说,我觉得……
其实,AI 研究的进步受制于管理,即研究管理。因为你的计算资源是有限的,你需要把计算资源分配到那些你最有信心的研究路径上。这就像你需要对研究路径有极高的判断力才能投入计算资源,这更多是一种投资回报率的考量。就好比,“我要通盘考虑所有项目,哪些项目优先级更高?” 优先级排序,以及在更细的层面上,“哪些实验现在非跑不可,哪些不着急?“然后果断砍掉不必要的。所以我觉得关键在于优先级判断、沟通、管理。还有人际技能,比如共情、理解他人、协作。
我认为如果没有人,Canvas 不可能是一个如此出色的发布,我觉得这是一群非常棒的人。我有机会和像 Lee Byron 这样的人共事,他是 GraphQL 的联合创造者,还有一些最顶尖的 Apple 设计师。看到这种人与人之间的协作是如何形成的,真的很酷。我觉得这仍然是属于人的东西。
Lenny Rachitsky: 让我顺着再追问一下。我猜有些听众在想,“好吧,但一旦我们有了 AGI 或 SGI,这一切不就都能做了吗?” 确实存在这样一种可能性——“为什么这些不能全被替代?“我觉得很容易就这么想当然。关于创造力和倾听,除了”训练起来很困难”之外,你为什么认为 AI 不擅长这些?这其中有什么特别的原因吗,为什么这对 AI 和 LLM 来说尤其困难?
Karina Nguyen: 我觉得目前来说困难有很多原因。这仍然是一个活跃的研究领域,也是我们团队正在做的事情。比如,“我们如何教会模型在写作上更有创造力?“所以我在想,这种新的思维范式——让模型进行更深入的思考——应该能够带来更好的写作本身。但涉及到创意生成,或者判断什么是好的视觉设计时,我觉得模型还没有从人那里学到足够多的样本来做出很好的判断。我确实认为这是因为在审美和创造力方面真正顶尖的人其实并不多……这些对模型来说不太容易学习到。所以我绝对认为这就是它做得不好的原因。
Lenny Rachitsky: 有道理。基本上就是像你这样的人还不够多——去教它做这些事情的研究者,以及那些拥有极高品味和创造力、能够教授这些东西的人。不过你也可以说,这一切终将到来。
Karina Nguyen: 对。
AI 与战略能力
Lenny Rachitsky: 不过我们不需要继续这个话题了。让我问你一个具体的问题。在我写的这篇文章里,我提出了一个很多人不同意的观点:战略(strategy)是 AI 工具会越来越擅长并接管的领域。人们普遍觉得战略是人类会持续比 AI 强得多的领域,你不能把制定战略——也就是告诉你要做什么才能赢——交给 AI。我的论点是,“战略不就是把你手头所有的输入、所有的数据汇总,理解你周围的世界,然后制定一个赢的计划吗?” 感觉 AI 和大语言模型在这件事上会极其擅长。你怎么看?
Karina Nguyen: 我也这么想。我觉得,同样地,你教模型各种工具、能力和推理,当落实到具体操作时——比如现在对于 Canvas 来说,如果模型能自动聚合所有用户反馈,总结出用户体验中最令人痛苦的五个流程,那就太好了。而且模型本身也非常有能力思考它是如何被构建的,找出如何为自己创建数据集来训练。我觉得我们离那种自我改进不远了,模型通过……来实现自我改进。
这本身以及开发过程,本质上就是自我改进。有点像它自己的有机体之类的。同样,说到战略,更像是数据分析和提出……我觉得模型真正擅长的是把点连起来。比如你从这个来源有用户反馈,同时你内部仪表盘有指标数据,还有其他反馈或输入,然后它可以为你制定计划,甚至给出建议。而且我认为这也是 ChatGPT 最常见的用例之一,就是帮你梳理出这类东西。
Lenny Rachitsky: 说得通,本质上一个人一次只能理解有限的信息、查看有限的数据来综合出结论。而正如你所说,现在上下文窗口已经非常大了。把所有信息给它,然后问”我应该做的最重要的事情是什么?”
Karina Nguyen: 对,科学研究也是一样。理想情况下,模型应该能提出新的想法,或者根据之前实验的经验结果来迭代实验设计——比如你怎么想出新的想法或方法?
值得专注培养的技能
Lenny Rachitsky: 是啊,天哪。好的,总结一下这段讨论,你建议人们专注培养和倚重的技能就是软技能,比如创造力、影响力管理、协作、发现模式。大致就是你的想法方向吗?
Karina Nguyen: 对,我一直在思考如何让组织更高效,我觉得这主要就是管理。就是你怎么组织研究团队,或者一般来说如何组合团队,使它们能达到最大的成功,或者达到可能的最大绩效。我们真的可以创造下一代计算机,关键在于信念以及你管理这个过程的方式。就是扩展组织或扩展产品研究。
Lenny Rachitsky: 对,我觉得你们基本上就是在造这个东西,而不够高效地做这件事正在限制人类物种的潜力。
Karina Nguyen: 没错。
Lenny Rachitsky: OpenAI 和 Anthropic 这些研究团队内部的管理不善——以及其他一些模型公司。
Karina Nguyen: 对,仔细想想挺疯狂的。
Anthropic 与 OpenAI 的比较
Lenny Rachitsky: 我的天。好的,说到 Anthropic 和 OpenAI,你两家都工作过。很少有人在两家公司都工作过、亲眼见过它们的运作方式。我很好奇你注意到了两家公司之间有哪些差异——运作方式、思维方式、做事方法。你能分享些什么?
Karina Nguyen: 相似之处比不同之处多。当然总是有一些差异,具体到细节上的不同。我会说是文化。我很喜欢 Anthropic,那里有很多朋友。我也很喜欢 OpenAI,那里也有很多朋友。所以这不是什么敌对关系。我觉得在 AI 领域,大家总说”他们是竞争对手、是敌人”,但实际上就是一个大的社区,大家在做同样的事情。我想说从 Anthropic 学到的是对模型行为、模型工艺(model craft)、模型训练的真正用心和匠心。
我一直在想,“是什么让 Claude 成为 Claude,什么让 ChatGPT 成为 ChatGPT?“我对最终产出模型之前的运营流程还有些感觉。Claude 之所以有那么多个性,更像一个图书管理员——我不知道,我脑子里有时候会把 Claude 想象成一个图书管理员,很书呆子气的那种——是因为我觉得这反映了创造这个模型的人。很多关于角色和个性的细节,比如模型是否应该追问这个问题。
在这些场景下模型正确的伦理行为应该是什么?大量的匠心和精心策划的数据集。这是我在 Anthropic 学到的那部分艺术。我想说 Anthropic 规模小得多。我刚加入时大概七十人?我离开时已经有很多人了。显然文化也变了很多。我很享受早期创业阶段的生活,大家彼此像家人一样了解。但后来文化发生了转变。
我想说从 Anthropic 学到的另一点是,他们在专注和优先级排序方面做得更好——非常硬核的优先级取舍。而且他们必须这样做。但我认为 OpenAI 在产品和研究方面更具创新精神,也更敢于冒险。在 OpenAI,你的全职工作真的可以就是教模型如何成为创意写手。这种研究自由度伴随着规模而来,也许可以说是一种奢侈。我不知道。我想说在 OpenAI 我有更多的产品创作自由,几乎可以做任何事情,把 ChatGPT 演变成我们想要的愿景。可能更偏向自下而上的方式。
Lenny Rachitsky: 对,我也是这么想的。感觉 OpenAI 更自下而上、更分散,大家冒出想法、去尝试。这导致了更多产品发布,我猜想更多的东西被尝试,而不是更偏向”我们要确保做的每件事都很出色、很精致、对每个投入都深思熟虑”。
Karina Nguyen: 对。
Lenny Rachitsky: 这很有意思,我从来没听人这样描述过。Karina,我们聊了太多了。这会帮助很多人从很多角度思考未来的方向。在我们进入非常精彩的快问快答环节之前,我好奇你还有没有什么觉得可能有帮助想分享的?
早期 Anthropic 的经历
Karina Nguyen: 我的一个遗憾,我想是我在 Anthropic 早期的时候……那时候有某种时间的奢侈,因为那是在 ChatGPT 之前,你可以带着一堆想法进来,几乎每天都能做原型。我觉得我们做了很多很酷的东西,比如 Claude,还有 Slack 里的 Claude 实际上是最早的工具使用类产品之一。就是 Claude 可以在你的工作场所里操作了。挺酷的,因为你可以把 Claude 加进来总结对话串。比如你和某人有一长段对话,然后你想要一个摘要,你可以让 Claude “帮我总结一下。”
另外,在模型本身上进行迭代也很有趣。就像你在 Slack 里不停地和模型对话。它创造了一种社交元素,有点像……和 Discord 里一样,人们学到了很多关于提示词和如何与 Claude 协作的知识。实际上,早期原型中的一个功能是每周一 Claude 会自动总结整个频道。或者每周五总结一堆频道,给你提供关于组织的新闻动态之类的。
产品形态与异步范式
这是一种非常酷的产品形态(form factor)。我认为思考产品形态是 AI 中一个非常重要的问题,尤其是我们甚至还没有想清楚如何为 o 系列模型创造出色的产品体验。这就是从同步实时给出答案的范式,向智能体在后台工作的更异步范式的转变。但现在的问题是,智能体应该和你建立信任,对吧?而信任是随着时间建立的,就像人与人之间一样。你开始了这种协作,这就是为什么你和模型之间的这种协作模式如此重要——因为你建立了信任,模型从你的偏好中学习,从而变得更加个性化,它会开始预测你想在电脑上执行的下一个操作。它更具预测性,更加……我们基本上从个人电脑走向了个人模型。
Lenny Rachitsky: 为什么现在没有这个功能了?这似乎是一个显而易见的功能,每个 LLM 都应该有它的 Slack 机器人版本。这是我能帮你安装的东西吗?还是说现在已经没有了?
Karina Nguyen: 我知道 Claude for Slack 在 2023 年左右被停用了。我想是在 ChatGPT 之后,重心主要转移到了客户用例和企业用例上。
Lenny Rachitsky: 嗯。太遗憾了。
Karina Nguyen: 我觉得 Claude for Slack 的产品形态在你想要谈论新功能时会有点受限。
Lenny Rachitsky: 太遗憾了。我想要那个功能。
Karina Nguyen: 我知道 ChatGPT 有 Slack 相关的工具。我不确定,也许某天它会回来的。
Lenny Rachitsky: 好吧,我愿意为此付费。关于那段早期时光还有没有其他回忆?因为能在 Anthropic 早期待过是一个非常特殊的经历。还有没有其他有趣的回忆或故事想分享的?
100K 上下文窗口发布
Karina Nguyen: 我想我们第一次感到……让我印象深刻的,是 100K 上下文窗口发布的时候,模型可以输入整本书然后给你一个摘要之类的。或者是处理金融报告——整理多个文件的金融报告,然后针对非常具体的问题给出答案。我觉得其中有某种东西让人觉得,“天哪,这真是一个很酷的新能力。“不是模型能力本身,而更多是来自产品形态本身的能力,而非纯粹的模型能力。
我记得我们在考虑的其他原型……有一个是 Claude 工作空间,和现在的思路类似——我们会有一个共享工作空间,这个共享工作空间就像一个文档,我们可以在文档上迭代。我觉得有时候这些想法……它们会被搁置两年,就像这个例子一样。
里程碑时刻
Lenny Rachitsky: 很有趣,有这些里程碑式的时刻打开了我们对正在发生的事情和未来方向的视野。ChatGPT 应该是第一个让人觉得”哇,这比我想象的好太多了”的。你提到的 100K 上下文窗口,你可以上传一本书然后提问、让它总结。我实际上一直在用这个功能。当我的访谈嘉宾写了书,我有时没时间读完整本书。所以我用它来帮我理解最有趣的部分是什么。然后我确实会深入阅读那本书,说明一下。然后,我不知道,也许语音是另一个里程碑,你可以和 ChatGPT 对话。还有其他让你觉得”哇,这比我想象的好太多了”的时刻吗?
Karina Nguyen: 有的,我觉得是计算机使用智能体(computer use agents),就是模型操作桌面电脑。你基本上可以想象一种全新的体验,模型可以学习你浏览网页的方式。从这些偏好出发,它可以像你一样去浏览。这有点像一个模拟的人格。这其实很像这样一个想法——“好吧,也许 Sam Altman 没有很多时间。也许我想和他的模拟体对话并提问……”或者,比如说,我很感激某些技术上的指导。是的,很酷。但他没有那么多时间,所以我就真的很想问他这些问题。在这样的模拟环境中得到回应,会非常酷。
Lenny Rachitsky: 这是安利 Lennybot 的好时机,我就有一个。它是用我所有的播客和通讯训练的。
Karina Nguyen: 哦,很酷。
Lenny Rachitsky: 它基于多个模型。我不确定具体用的是哪些,但就是这个思路。而且不仅是我,还包括所有上过播客的嘉宾和我写的通讯内容。你可以直接问它”我该如何增长我的产品?我该如何制定策略?“而且它实际上好得令人惊讶。
Karina Nguyen: 你觉得它反映了你本人吗?
Lenny Rachitsky: 是的。
Karina Nguyen: 或者说会不会……好吧。
Lenny Rachitsky: 最棒的是你可以和它对话。有一个 ElevenLabs 的语音版本,是用我在这个播客里的声音训练的,效果非常好,有人告诉我他们坐在那里和它聊了好几个小时。
Karina Nguyen: 哇。
Lenny Rachitsky: 有人跟它说,“像我在 Lenny 的播客上一样采访我,问我关于我职业生涯的问题。“然后他和 Lennybot 做了半小时的播客节目。
Karina Nguyen: 天哪,太有趣了。
Lenny Rachitsky: 太不可思议了。未来太疯狂了。
内容形态转化
Karina Nguyen: 是的。我觉得内容形态转化是……我可以想象将来某天当你在 Canvas 里生成一个科幻故事时,你可以把它转化成有声书,实现一种非常自然的内容从一种媒介到另一种媒介的转化。我最早的灵感之一是《西部世界》最后几集中的一集,我不想剧透,但就是 Dolores 在那个时候去上班,她来到一个新的工作空间,开始写一个故事。然后随着她写故事,一个 3D 虚拟现实就实时构建出来。所以我有点想创造那样的东西。挺酷的。
Lenny Rachitsky: 哇。说到媒介,我在想是不是该直接问这个,但很快问一下。Kevin Weil,我不确定他的姓怎么发音,OpenAI 的 CPO。
Karina Nguyen: Kevin Weil,嗯哼。
Lenny Rachitsky: 是 Weil 还是 Weil?
Karina Nguyen: 我觉得是 Weil。
Lenny Rachitsky: Weil。好的好的。就这么念吧。
Karina Nguyen: 希望吧,嗯。
对话作为一种持久的交互范式
Lenny Rachitsky: 去年在 Lenny and Friends Summit 上他参加了一个 panel,提出了一个非常有趣的观点——对话(chat)作为这些工具的交互界面非常有趣,因为它们正变得越来越聪明、越来越聪明、越来越聪明。而对话作为一种与它们交互的范式始终适用,就像和人类交流一样。你可以和爱因斯坦对话,也可以和一个不太聪明的人对话,本质上都还是对话。所以这是一种非常灵活的方式来与日益增强的智能进行交互。当然到了某个阶段,对话可能就不够好了,你之前也谈到了你们正在添加的各种额外交互方式。但很有趣的是,对话被证明是叠加在所有这些技术之上的一层非常强大的交互层。
Karina Nguyen: 是的,这确实很酷。我觉得对话还有一种社交属性,非常人性化。就像,有时候你就是想进入一个群聊。和 AI 进行对话本身就有一种群聊的感觉,一种消息交流的感觉。实际上,关于如何构建这类功能,我把 tasks 视为一个通用型功能,随着模型自身发展出新的能力,它能很好地扩展。模型将能够做更好的搜索,创造出更……想出更有创意的写作,或者渲染 React 应用、HTML 应用之类的东西。每天可以给你出新的谜题,每天从前一天的故事继续往下讲。它的扩展性非常好。
Operator:计算机使用智能体
Lenny Rachitsky: 你在进入我们刚才额外聊的这个话题时提到了智能体使用计算机。我知道这其实是你们今天(我们录制当天)要发布的功能,等到这期节目上线的时候它已经发布了,叫 Operator。你能聊聊这个大家很快就能用到的超酷功能吗?
Karina Nguyen: 好的,这个功能我本人并没有参与开发,但我对这个发布非常非常兴奋。它本质上是一个智能体,可以在自己的虚拟计算机里、自己的虚拟环境中完成任务。你可以给它下达任何任务,比如”帮我在亚马逊上买本书”。理想情况下,模型要么会追问你想要哪本书,要么对你足够了解以至于开始推荐——“哦,这里有五本我可能会推荐你购买的书”。然后你说”好,帮我买”,模型就会进入自己的虚拟浏览器,完成任务,在亚马逊上把书买好。如果你给模型提供了凭证、信用卡——显然这涉及到大量的信任与安全问题——那它就会替你把事情办好。它就是一个虚拟助手。
Lenny Rachitsky: 有意思的是,这一切听起来就像是”理所当然应该发生的”。为什么这到现在才实现?同时也很令人震撼的是,我们已经在理所当然地认为它应该存在——就是某个 AI 在电脑上替你做事,你只需开口让它做。
Karina Nguyen: 对。
Lenny Rachitsky: 太荒谬了。
Karina Nguyen: 其实这真的很难。我觉得大家也还在攻克这个问题。不过你用过 Tuple 吗,就是那个结对编程的产品?
Lenny Rachitsky: 没有。
Karina Nguyen: 在 Anthropic 我们很喜欢结对编程,所以如果你用过——
Lenny Rachitsky: 哦对,Shopify 在用这个。我记得在一期播客里提到过。
Karina Nguyen: 哦不错。是的,它是一个非常酷的产品,你可以随时呼叫任何人,然后共享屏幕,对方可以操控你的屏幕甚至直接操作你的电脑。它的实时性非常……画质非常高。我就想要类似的东西——我想和我的模型结对编程,模型甚至应该能跟我说话,在代码中圈出非常具体的部分,告诉我……显然还能教我,我们可以有不同的模式。就像,这就是一个现成的产品。我不知道,应该有人去做这个。
Lenny Rachitsky: 听起来一个创业公司刚刚诞生了——
Karina Nguyen: 对。
Lenny Rachitsky: 就在某个听众的脑海中。你提到让智能体控制计算机并帮你做事非常难。能简单说说到底难在哪里吗?
计算机使用智能体的技术挑战
Karina Nguyen: 很大程度上是因为目前模型操作的是像素,而不是语言之类的东西。像素实际上非常非常难。模型在感知方面,或者说视觉感知方面还有很多挑战。我觉得目前仍然有大量多模态研究在进行中,但我认为语言比多模态容易扩展得多,原因就在于此。
另一方面,我团队正在研究的问题是如何非常准确地理解用户的意图。比如,模型是否掌握了足够的信息来追问,还是应该直接完成任务?你不希望一个智能体跑出去十分钟,然后带回一个你根本不想要的答案。那实际上会严重损害用户体验。这就涉及到教模型学会人际技能——比如”人们喜欢什么?什么?某种程度上就是构建用户的心智模型并关心用户,才能提出恰当的问题。实际上,这部分对模型来说很难做到。
Lenny Rachitsky: 这和我们之前聊到的相关,就是这种软技能、人际技能的部分,还不是这些模型擅长的领域。
Karina Nguyen: 对。
闪电问答
Lenny Rachitsky: 好的。闪电问答环节我就跳过了,只想问其中一个问题,来点有趣的。
Karina Nguyen: 好啊。
Lenny Rachitsky: 好了,所以当 AI 取代了你的工作,Karina,我好奇你会……而且它还给你发津贴,每月给你发一笔生活费。这是你这个月的薪水。你会想做什么?你想把时间花在什么上面?在这个未来世界里你会做什么?
Karina Nguyen: 我想过很多次这个问题。我觉得我有很多职业选择。我想当一名作家,我觉得。我觉得那会非常酷。写短篇小说、科幻故事、长篇小说。我也很喜欢艺术史,你知道博物馆里那些文物保护师吗?就是整天在那里修复保存油画的人?
Lenny Rachitsky: 嗯。
Karina Nguyen: 我觉得做那个会非常酷。对。
Lenny Rachitsky: 听起来很美好。
Karina Nguyen: 我不知道。
Lenny Rachitsky: 我听到的是,你需要给这些模型降点难度,别让它们在写作方面变得太厉害,这样你才能继续……不过到那个时候你也不需要靠这个谋生了,你纯粹是为了乐趣而做,所以即使它们在写作或艺术品保护方面极其厉害也无所谓了。天哪,我们这期对话真是太精彩了。我们生活在一个多么疯狂的时代。Karina,非常感谢你来。最后两个问题。大家如果想联系你、跟进任何话题,在网上哪里可以找到你?听众怎样才能帮到你?
Karina Nguyen: 可以在 Twitter 上找到我,账号是 KarinaNguyen。也可以通过我的网站给我发邮件。我的团队正在招聘,我在找研究工程师、研究科学家,以及机器学习工程师,也包括想学模型训练的产品工程师。我正在为自己的团队招人。我的团队叫 Frontier Product Research,我们训练模型、开发新方法,但目标是面向产品的成果。
Lenny Rachitsky: 这工作也太令人向往了。我的天。大家申请这些非常抢手的职位,最好的方式是什么?
Karina Nguyen: 可以在 Twitter 上私信我。
Lenny Rachitsky: 好的。
Karina Nguyen: 不过我还没来得及写职位描述。
Lenny Rachitsky: 没关系,这就是职位描述了。
Karina Nguyen: 或者你也可以申请加入后训练(post-training)团队。嗯。
Lenny Rachitsky: 好的。你会收到铺天盖地的私信的,希望你做好准备。Karina,非常感谢你来参加节目。这期对话太精彩了。
Karina Nguyen: 非常感谢你,Lenny。
Lenny Rachitsky: 大家再见。
Karina Nguyen: 很开心。
Lenny Rachitsky: 非常感谢大家的收听。如果你觉得这期节目有价值,可以在 Apple Podcasts、Spotify 或你喜欢的播客应用上订阅本节目。也请考虑给我们评分或留下评论,这真的能帮助更多听众发现这个播客。你可以在 LennysPodcasts.com 找到所有往期节目或了解更多关于本节目的信息。下期再见。
术语表
| 原文 | 中文 |
|---|---|
| AGI | AGI(通用人工智能) |
| artifacts | artifacts |
| bottoms-up | 自下而上 |
| Canvas | Canvas |
| collaborative agent | 协作型智能体 |
| computer use agents | 计算机使用智能体(computer use agents) |
| context window | 上下文窗口 |
| curated dataset | 精心策划的数据集 |
| data wall | 数据墙 |
| distillation | 蒸馏(distillation) |
| dogfooding | 内部测试(dogfooding) |
| evals | evals(评估基准) |
| form factor | 产品形态(form factor) |
| frontier eval | 前沿 eval |
| Frontier Product Research | Frontier Product Research(前沿产品研究团队) |
| function calls | 函数调用(function calls) |
| GPQA | GPQA(Google-proof Question Answering) |
| GraphQL | GraphQL |
| ground truth | 真值 |
| hill climb | 攀升(hill climb) |
| IC (Individual Contributor) | IC(独立贡献者) |
| JSON schema | JSON schema |
| Kaeser | Kaeser |
| Kevin Weil | Kevin Weil |
| Lee Byron | Lee Byron |
| model craft | 模型工艺(model craft) |
| next token prediction | 下一个 token 预测 |
| Operator | Operator(计算机使用智能体产品) |
| post-training | 后训练 |
| PRD (Product Requirements Document) | PRD(产品需求文档) |
| reinforcement learning | 强化学习 |
| SGI | SGI(超级通用人工智能) |
| soft skills | 软技能 |
| spec | 规格文档(spec) |
| synthetic data | 合成数据 |
| tool stack | 工具栈(tool stack) |
| Tuple | Tuple(结对编程产品) |
| win rate | 胜率 |
此文档由 AI 分片翻译(translate_long_document)