AI 工程 101:Chip Huyen(Nvidia、Stanford、Netflix)
AI 工程 101:Chip Huyen(Nvidia、Stanford、Netflix)
文字记录
Chip Huyen: 一个被反复问到的问题是,“我们如何跟上最新的 AI 新闻?“我就想问,“为什么?你为什么需要跟上最新的 AI 新闻?“如果你去和那些清楚自己想要什么、不想要什么的用户交流,认真研究反馈,那么你实际上能大幅、大幅、大幅地改进应用。
Lenny Rachitsky: 很多公司都在做 AI 产品,但很多公司做 AI 产品做得并不顺利。
Chip Huyen: 我们正处于一种理想危机之中。现在我们有了各种非常酷的工具,可以从零开始做任何事情,实现全新的设计——它可以帮你写代码、建网站。所以理论上我们应该看到更多的成果,但与此同时,人们却不知为何陷入了困境——他们不知道该做什么。
Lenny Rachitsky: 所有这些 AI 炒作之下,数据实际上显示大多数公司尝试了,效果不明显,然后就停了。你认为这里的差距在哪里?
Chip Huyen: 生产力真的很难衡量。所以我确实会让人去问他们的经理,“你宁愿给团队每个人买很贵的编码代理(coding agent)订阅,还是多招一个人头?“几乎每个经理都会说要人头。但如果你问 VP 级别或管理很多团队的人,他们会说,“我想要 AI 助手。“因为作为经理,你的团队还在成长阶段,对你来说多一个 HR 人头是一件大事。而对于高管来说,也许他们更关注的是业务指标,所以他们会真正思考什么能驱动自己关注的生产力指标。
Lenny Rachitsky: 今天我的嘉宾是 Chip Huyen。与很多分享 AI 产品构建洞见和行业趋势的人不同,Chip 确实打造过多个成功的 AI 产品、平台和工具。Chip 曾是 NVIDIA NeMo 平台的核心开发者、Netflix 的 AI 研究员。她在斯坦福大学教授过机器学习课程,还是两次创业的创始人和两本 AI 领域全球最受欢迎著作的作者,包括她最新出版的《AI Engineering》,该书自发布以来一直是 O’Reilly 平台上阅读量最高的书。她还与很多企业合作制定 AI 战略,因此她能亲眼看到许多不同公司内部实际发生的事情。
在我们的对话中,Chip 解释了很多基础知识,比如预训练和后训练究竟是什么样的?什么是 RAG?什么是强化学习?什么是 RLHF?我们还深入探讨了她关于如何构建优秀 AI 产品的所有经验,包括人们以为需要什么以及实际需要什么。我们谈到了公司最容易踩的坑、她看到的生产力提升最大的领域,以及更多内容。
这期节目技术性较强,比我以往大多数对话都更技术化,适合那些希望深入讨论 AI 的人。如果你喜欢这档播客,别忘了在你喜欢的播客应用或 YouTube 上订阅关注。
热门表格:人们以为的 vs 实际上改进 AI 应用的方式
Lenny Rachitsky: Chip,非常感谢你来参加节目,欢迎来到播客。
Chip Huyen: 嗨,Lenny。我关注这档播客很久了,所以真的很高兴能来这里。谢谢你邀请我。
Lenny Rachitsky: 我想从你之前在 LinkedIn 上分享的一张表格开始聊,那张表格传播得非常广,我觉得它之所以这么火,是因为戳中了很多人的痛点。我来读一下,YouTube 上的观众可以看到这张图。这是你分享的一张非常简洁的表格,对比了”人们认为能改进 AI 应用的方式”和”实际改进 AI 应用的方式”。人们认为能改进 AI 应用的方式包括:紧跟最新的 AI 新闻、采用最新的 Agentic 框架、纠结用什么向量数据库、不断评估哪个模型更聪明、微调模型。而实际改进 AI 应用的方式包括:和用户交流、构建更可靠的平台、准备更好的数据、优化端到端工作流、写更好的提示词。你觉得为什么这张表格会如此戳中人们的痛点?如果归结为一句话,你认为人们在构建成功的 AI 应用时到底忽略了什么?
Chip Huyen: 一个被反复问到的问题是,“我们如何跟上最新的 AI 新闻?“我就想问,“为什么?你为什么需要跟上最新的 AI 新闻?“我知道这听起来很反直觉,但外面的新闻实在太多了。很多人也会问我类似的问题,比如”如何在两种不同的技术之间做选择?“比如最近,MCP 和 Agent-to-Agent 协议?然后就是”哪个更好,是这个还是那个?“我认为更值得问的问题是,“首先,最优解和非最优解之间能带来多大程度的提升?“对吧?有时候他们会说,“其实差别不大。”
我就会说,“好,如果差别不大,那你为什么要花那么多时间去争论一个对性能影响不大的事情?“他们还会问的另一个问题是,“如果你采用了一项新技术,换成另一项会有多难?“有时候他们会说,“哦,我觉得切换出去会费不少功夫。“我就说,“嗯,假设这里有一项新技术,还没被很多人验证过,如果你采用了它,你就会一直被绑在上面。你真的想采用它吗?“也许在面对那些尚未经过充分验证的新技术时,你应该三思而后行,不要过度投入。
从关注最新技术回到基本功
Lenny Rachitsky: 我很喜欢你给出的这个更宏观的建议——要构建成功的 AI 应用,去跟用户交流,构建更好的数据,写更好的提示词,优化用户体验,而不是去追最新最炫的东西——现在最好的模型是什么?AI 领域又有什么新动态?让我沿着微调这个话题继续深入,基本上也就是后训练。人们在 AI 领域会听到各种各样的术语,我觉得这是一个非常好的机会,帮助大家搞清楚我们到底在说什么,因为你确实在做这些事情、构建这些东西、和做这些事情的公司合作。我想在对话中穿插几个术语,但让我们先从这个开始。最简单的方式来理解:预训练和后训练之间的区别是什么?然后微调在其中处于什么位置?微调到底是什么?
Chip Huyen: 先做个声明,我对那些大型前沿实验室内部的秘密工作并没有完全的了解。但据我所知,一个是监督微调(supervised fine-tuning)——当你有示范数据时,你有一批专家,“好,这是提示词,这是答案应该长什么样。“你就训练它去模仿人类专家的表现。这也是很多开源模型在做的事情,它们通过蒸馏(distillation)来实现。也就是说,不需要人类专家为提示词写出非常好的答案,而是让非常流行、著名的优秀模型来生成回答,然后用这些数据训练较小的模型去模仿。
有时候你会看到人们直接……顺便说一句,我非常欣赏开源社区,但从”能训练出模仿现有优秀模型的模型”到”能从头训练出真正优秀的模型”——这两者之间有很大的差距。好了,我们有监督微调,还有一个非常重要的东西,不确定你之前的嘉宾是否已经讨论过,但强化学习(reinforcement learning)现在无处不在。
Lenny Rachitsky: 我们先停在这里,因为我确实想花时间聊聊这个,这个话题在我最近的对话中出现得越来越频繁,非常有意思。不过让我先总结一下你刚才分享的内容,我觉得这些真的非常重要。也就是说,一个模型本质上就是一段算法代码,前沿模型会往里面喂整个互联网的内容,它基本上是在通过预测所有这些数据中的下一个词来测试自己——准确地说是 token,但更简单的理解方式就是文本中的下一个词。当它预测错误时,就会调整那些叫做权重(weights)的东西。这样理解对吗?虽然这只是一个非常表面的描述。
语言建模与统计信息
Chip Huyen: 我认为语言建模是一种编码语言统计信息的方式。假设我们都说英语,我们就会对什么在统计上更可能有一种感觉。如果我说”我最喜欢的颜色是”,你就会说”那应该是另一个颜色”。“蓝色”这个词出现的可能性要比其他词大得多,因为在统计上,蓝色更可能出现在”我最喜欢的颜色是”后面。所以它是一种编码统计信息的方式。
在语言建模中,当你用大量数据进行训练时,你会看到很多语言、很多领域。所以它能判断,好的,你的基础水平是这样的。然后用户给出提示词,它就能输出下一个最可能的 token。顺便说一句,这其实不是什么新想法。这个想法非常古老,可以追溯到 1951 年那篇关于英语信息熵的论文,作者是 Claude Shannon,这是一篇非常好的论文。我想起一个我很喜欢的故事,来自……你读过 Sherlock Holmes 吗?
Lenny Rachitsky: 读过,我读过几本 Sherlock Holmes。
Chip Huyen: 有一段故事是 Sherlock Holmes 利用这种统计信息来帮助破案。故事是这样的:有人留下了一条用很多小棍人图案写的消息。Sherlock Holmes 想到,在英语中最常见的字母是 E,那么最常见的小棍人图案就一定是 E。然后他就这样一步步推导下去。所以这其实就是一种简单的语言建模,只不过他是在字符层面做的,而 token 则介于两者之间——token 不完全是一个词,但比单个字符大。我们用 token 这个概念是因为它可以帮我们缩小词汇表——字符有最小数量的词汇,字母表只有 26 个字符,但单词可以有千千万万个。而 token 可以在两者之间找到一个最佳平衡点。
比如一个新词,podcasting。假设它是一个新词,但可以把它拆成 podcast 和 ing。人们就能理解,podcast 我们知道它的意思,ing 是动词后缀等等。所以我们即使没见过 podcasting 这个词也能理解它,这就是 token 的作用。但回到正题,预训练本质上就是编码语言的统计信息,让你能够预测什么最可能出现。我觉得”最可能”是最简单的理解方式,因为它其实是在构建一个概率分布——下一个 token 90% 的可能是一个颜色,10% 的可能是其他东西。所以它本质上是一个分布,语言模型可以从中选取,具体取决于你的采样策略(sampling strategy)。你是希望它总是选最可能的 token,还是希望它选一些更有创造性的?我认为采样策略是一个极其重要的东西,它可以大幅提升性能,而且非常被低估。
Lenny Rachitsky: 好的。所以本质上,一个模型就是一段代码加上一整套权重,本质上是一个统计模型,学会了在特定词和短语之后预测接下来会出现什么?
Chip Huyen: 对。
权重与微调
Lenny Rachitsky: 然后后训练和微调,具体来说,做的事情是一样的。预训练之后你得到 GPT-5。微调是有人拿着 GPT-5,做同样的事情,针对特定用例稍微调整这些权重,使用他们认为对于完成特定用例来说必要的数据。这样理解对吗?
Chip Huyen: 对,我认为权重就是函数。假设有一个函数,比如 Lenny 的身高可能是 1X 加上一个常数,或者 2X 加上一个常数,那个常数就是权重。你不断调整它,直到它能拟合正确的数据——也就是我的身高和你的身高。所以你可以把权重理解为函数中的参数。你训练、调整权重,让它们能够拟合数据,也就是训练数据。
Lenny Rachitsky: 好的。我们刚才聊了预训练、后训练、微调。关于这些到底是什么,人们需要理解什么,还有没有什么重要的需要补充的?
Chip Huyen: 绝大多数情况下,我们不会去碰预训练模型。作为用户,我们根本用不到它。
Lenny Rachitsky: 对,它已经为我们做好了。
预训练与后训练的差异
Chip Huyen: 对,我觉得观察朋友训练模型的过程还挺有趣的——他们去摆弄预训练模型,效果惨不忍睹,会说出类似”天哪”这样的话。确实很疯狂。所以,看看后训练能在多大程度上改变模型行为,非常有趣。我认为如今很多人、很多前沿实验室(frontier lab)都在把精力投入后训练。因为预训练……预训练一直被用来提升模型的通用能力,它需要大量数据和模型规模来增强能力。而在某个阶段,我们实际上已经基本耗尽了互联网数据,文本数据几乎用尽了。很多人在尝试其他数据,比如音频和视频,大家都在思考新的数据来源在哪里。但关键在于,大家的预训练数据其实非常相似,后训练才是如今拉开差距的地方。
Lenny Rachitsky: 这很好地过渡到了下一个话题——你提到了监督学习(supervised learning)和无监督学习。对了,我特别高兴我们聊到这里,超级有意思。你谈到了标注数据(labeled data)。基本上,监督学习就是 AI 在已经被人标注过的数据上学习,告诉它什么是对的、什么是错的。比如,这是垃圾邮件、这不是垃圾邮件;这是一篇好的短篇小说、这不是一篇好的短篇小说。我们之前请过很多做这类业务的公司 CEO 来做节目,比如 Mercor、Scale、Handshake、Micro 等等。那这些公司为实验室做的事情,本质上就是提供标注数据、高质量的训练数据,对吗?
Chip Huyen: 在某种程度上是的,但我觉得它更像是一个庞大方程式的产物,其中包含的组件远不止这些。所以我之前提到强化学习(reinforcement learning),我不确定你在采访那些 CEO 时有没有聊到这个概念。核心思路是——假设你有一个模型,给它一个 prompt,它生成一个输出。你想通过强化来鼓励模型生成更好的输出。那问题就来了:我们怎么知道一个回答是好是坏?通常人们会依赖信号。获取好坏判断的一种方式是人类反馈。具体做法是给出两个回答,然后判断哪个更好。之所以这样做,是因为作为人类,给出一个具体分数很难,但做比较就容易多了。
如果你让我给一首歌打分,我又不是音乐家,不知道该怎么评判,可能随便给个六分。一个月后再问我,我完全忘了之前打过什么分,也许这次给七分,或者四分,说不准。但如果你让我比较两首歌,问我更愿意在生日聚会上放哪首,那我就能说”我选这首”。所以比较要容易得多。收集到人类反馈后,用它来训练一个奖励模型(reward model),让奖励模型来判断——模型生成了这个回答,给它打分,好不好?然后你尽量引导模型生成更好的回答。另一种方式是不用人类,而是用 AI 来评判回答的好坏。事实上,如今大家非常看重的是可验证奖励(verifiable rewards),这很自然。基本上就是给模型一道数学题,它生成一个解答,预期答案比如是 42,如果它没给出 42,那就是错的,不是一个好的回答。所以确实,很多时候人们会使用人力来产出专家级的问题和对应的答案,并且以可验证的方式构建,以便模型可以在此之上训练。
RLHF 与信号采集
Lenny Rachitsky: 好,很高兴你聊到了这里。这本质上就是 RLHF(基于人类反馈的强化学习),这正是我也想讨论的话题,对吧?
Chip Huyen: 对。我觉得它本质上是一种学习方式——强化学习训练,不管是通过人类反馈、AI 反馈还是可验证奖励来学习,我认为这只是采集信号的不同方式而已。
Lenny Rachitsky: 太好了。对,我们之前请过 Anthropic 的 CEO 来做播客,他谈到了他们的 RLHF 版本,即 AI 驱动的强化学习。我很喜欢你刚才的表述——本质上你想帮助模型,想强化正确的行为和正确的答案,而这就是实现的方法。不管是工程师看到模型的输出后说”不对,我会这样写代码”,还是训练一个独立模型来告诉原始模型”我这个对还是不对”,本质上就是这个思路。大致如此,对吗?
Chip Huyen: 对。
Lenny Rachitsky: 好。
Chip Huyen: 我觉得这是一种理解方式。这个领域如今之所以如此令人兴奋,是因为有太多领域专家任务需要模型做好。比如说你是个会计,你可能希望模型能处理会计任务,那就需要大量来自会计的会计数据样本。所以你需要雇佣大量会计来提供数据。或者物理问题、法律问题、工程问题——有人跟我说他们想用编码来解决科学问题,而不仅仅是用编码来开发产品,这是完全不同的一回事。还有使用非常特定的工具——不知道你用什么应用,可能是某个 app,或者 QuickBooks、Google Excel。它们都有非常特定的工具和专业知识。所以你需要让模型去学习。
这意味着需要大量该领域的专家来创建训练数据。这是一个庞大的工程,因为每个人都想要大量数据,想要无限预算。但我觉得这里面还有一些挺有意思的经济学问题,不知道你有没有和嘉宾聊过。我觉得思考一下很有意思,因为这个市场非常不平衡。前沿实验室的数量很少,但它们需要大量数据;而提供相关数据的初创公司或企业却非常多。你可以看看那些做数据标注的初创公司,它们可能有很高的 ARR(年度经常性收入),但如果你问它们”你们有多少客户?“,数量可能非常少。我不确定……我看到你在笑。
Lenny Rachitsky: 对对对,我们聊过这个。
Chip Huyen: 对,所以我会有点……不太安心。一家公司增长很疯狂,但高度依赖两三个客户。而与此同时,如果我是前沿实验室,经济上理性的做法是什么?我希望有很多初创公司、很多供应商,这样我可以挑挑拣拣,让这些供应商相互竞争来压低价格,但不管怎样都高度依赖这些供应商。所以我觉得这整个经济学格局非常有意思,我很想看看它会怎么演变。
数据标注公司的前景
Lenny Rachitsky: 我听下来的感觉是,你对这些数据标注公司的未来持悲观态度,因为正如你所说,它们在定价上没有太多议价权——客户太少,而涌入这个领域的供应商又太多。所以基本上,即使它们是全世界增长最快的公司之一,你觉得前方仍然存在挑战。
Chip Huyen: 我不确定算不算悲观。更多是好奇,因为事情往往会以我预料之外的方式发展。也许这些公司手握大量数据,或许能从中获得某些洞察,帮助它们保持领先。所以我也说不好。
Lenny Rachitsky: 这个回答很公允。好,趁这个话题,我想聊聊 eval,这是本播客反复出现的一个主题。eval 也是这些公司与 AI 实验室共享的另一类数据内容,AI 实验室非常需要。你能聊聊 eval 是什么吗?用最简单的方式来理解它,以及它是如何帮助模型变得更聪明的?
eval 是什么
Chip Huyen: 我觉得大家讨论 eval 时,其实面对的是两个截然不同的问题。一种情况是,我做一个应用——比如一个聊天机器人——这是最先跳到我脑子里的例子——我想知道这个聊天机器人到底是好是坏。那就需要一套方法来评估它。另一种情况,我把它理解为任务级的 eval 设计。比如我是模型开发者,我想让模型在代码编写方面表现更好。那问题就来了——“代码写得好不好,我到底怎么衡量?”
那就需要有人理解代码编写这件事,想清楚什么样的代码算是好的,然后设计整个数据集和评估标准。所以我觉得这种 eval 设计非常有意思——如何制定工作标准、工作指南、怎么执行,还要培训人们有效地去做。我觉得 eval 真的非常有趣,因为它极具创造性。我看过不同人构建的 eval,会觉得”哇”,一点都不枯燥,真的超级有趣。
Lenny Rachitsky: 我们之前和 Hamel、Shreya 做过一整期关于 eval 的播客,他们说的正是这个——给公司做 eval 其实真的很有趣。那我们再深入聊聊。网上有一种争论,我不确定这个争论的声量有多大,但感觉很多人花了不少时间思考这个问题——AI 产品到底需不需要 eval?一些最优秀的公司说它们其实不怎么搞 eval,全靠直觉,就是”这东西用起来感觉好不好?能不能感受到?“对于 AI 应用(不是模型公司)做 eval 的重要性和技能,你怎么看?
eval 的投资回报与务实取舍
Chip Huyen: 我觉得你不需要做到完美才能赢,你只需要做到足够好,并且保持一致。这不是我信奉的哲学,但我跟足够多的公司合作过,看到这种模式反复上演。为什么有些公司不做 eval?假设你是一位高管,想上线一个新用例。你把它做出来了,效果不错,客户还算满意。你没有精确的指标来衡量——但流量在涨,用户看起来开心,购买在持续。这时候工程师跑来说:“我们需要给它做 eval。“那要做多少投入呢?工程师说可能需要两个人,投入这么多精力。也许能有所改善。然后你问:“预期能提升多少?“工程师说:“可能从 80% 提升到 82%、85%。“你想了想:“投入两个工程师去做 eval,不如拿他们去做一个新功能,收益反而大得多。”
所以我觉得这就是其中一种情况——有些人觉得 eval 嘛,够用了就别动它。如果你在 eval 上花大量精力,收益只是渐进式的,而把同样的精力投入到新用例上,也许做到够好就行,靠直觉判断就够了。
我觉得这可能就是那场争论的焦点。很多时候人们只是把东西做到一个”够用了、能跑”的状态就收手。当然,这其中有很多风险——如果你没有清晰的指标来监控应用或模型的表现,它可能会做非常蠢的事情,甚至造成严重后果。所以如果你的产品是在大规模下运行、失败会带来灾难性后果的,那确实需要非常严格地控制暴露给用户的内容,搞清楚各种失败模式、哪里可能出问题。另外,如果你的产品功能本身就是竞争优势,你想做到最好,那就需要对自己的水平以及与竞争对手的差距有非常清晰的认识。
但如果只是一个相对次要的功能——不是核心,但对用户有帮助——那也许不需要那么执着、那么理论化。够用就行,如果出问题了再说。我知道这听起来很吓人。但我觉得归根结底是投资回报的问题。我个人是 eval 的忠实粉丝,我很爱读 eval 相关的内容。但我也理解为什么有些人选择先不聚焦在 eval 上,而是优先上线新功能。
Lenny Rachitsky: 这个回答非常务实。我听下来就是——eval 很好,很重要,尤其在规模运营时更是如此,但要选择战场。你不需要为每个小功能都写 eval。Hamel 和 Shreya 也分享过,人们其实只需要针对产品中最重要的环节做五到七个 eval 就够了。你观察到的是这样吗?还是说你在实际生产环境中看到人们构建和需要更多的 eval?
Chip Huyen: 我不认为 eval 有一个固定的数量。eval 的目的是什么?是指导产品开发。我是 eval 的忠实粉丝,因为它能帮你发现哪些地方做得好、哪些地方有机会提升。有时候一个非常明显的情况是——你看 eval 结果,发现模型在某个特定用户群体上表现很差,然后去查原因,结果发现只是我们的信息传达没做好。所以人们应该聚焦在那些表现差、但有大幅提升空间的地方。eval 的数量真的因情况而异。我们见过拥有上百个不同指标的产品。
Lenny Rachitsky: 哇。
Chip Huyen: 人们搞得那么复杂,是因为那个产品很通用,有不同的名称,有一个针对——比如说——冗余度的 eval,有一个针对用户敏感数据的 eval,还有一个针对长度的 eval……好吧,我们举个具体的例子,比如 Deep Research。你有一个应用,你用模型来帮你做深度研究。好,你给它一个 prompt,比如说:“帮我针对 Lenny’s Podcast 做一个全面的研究,帮我出一份报告,分析他对哪些话题感兴趣,什么样的视频能获得最多播放量,或者他还没有覆盖到但应该做的主题。” 有了这个 prompt 之后,你怎么评估结果?我不认为有哪个单一指标能解决问题。可能你需要一百个——我觉得有人做了一个 benchmark,请了一百位专家,写了一堆 prompt,然后逐条去看 AI 给出的答案。但这非常昂贵,也非常慢。
但我可能有些不同的想法。我之前跟一个朋友聊过这件事,一个思路是——你怎么评估这个摘要的质量?首先你需要做什么?收集信息。要收集信息就需要执行大量搜索查询。你拿到搜索结果,进行聚合,然后可能发现还缺一些东西,需要换一条搜索路径继续找。所以每一个环节都需要评估。你不需要端到端地只看最终结果。也许第一步是搜索查询——我写了五条搜索查询,那我需要看这些查询的质量如何?它们是否彼此雷同?因为你需要五条各不相同的查询。如果都是”Lenny Podcast”、“Lenny Podcast 上个月”、“Lenny Podcast 两个月前”,那就没什么意义了。但如果查询的关键词更加多样化,那你就可以去看搜索结果——输入一条查询,比如”Lenny Podcast data labeling”,出来 10 个结果;再输入”Lenny Podcast 前沿实验室”,又出来 10 个结果,这些都是不同的网页。那它们之间有多少重叠?我们是否既做到了广度——获取了大量页面——又有深度?还有相关性呢?因为如果我们提出的搜索查询与原始 prompt 完全不相关,那就没用了。所以我觉得每一个环节都需要一种评估方式。因此问题不是”我应该做多少个 eval”,而是”我需要多少个 eval 才能对应用性能有良好的覆盖和高度信心,同时帮我定位哪些地方表现不佳以便改进”。
Lenny Rachitsky: 很好。我听下来还有一个感受——尤其是对最核心的使用场景、用户在产品中最常见的路径,那才是你应该重点投入的地方。
Chip Huyen: 对,对。
RAG 是什么
Lenny Rachitsky: 好。还有一个术语我想聊一下,然后换个方向。RAG?人们经常看到这个词,R-A-G。它是什么意思?
Chip Huyen: RAG 全称是检索增强生成(Retrieval-Augmented Generation),它并不是某一种特定的生成式 AI 技术。核心思想很简单:对于很多问题,我们需要上下文才能回答。我觉得这个概念大概来自 2017 年的一篇论文。有人在做问答 benchmark 的时候发现,如果给模型提供关于问题的相关信息,答案的质量会好得多。所以他们尝试从 Wikipedia 检索信息,把检索到的内容放进上下文里再让模型回答,效果就大幅提升了。听起来像是理所当然的事对吧?所以我觉得 RAG 最简单的理解就是——为模型提供相关上下文,让它能更好地回答问题。而让事情变得更有趣的是——最初 RAG 基本上都是文本。
RAG 的数据准备
所以我们讨论了很多如何准备数据,使模型能够有效地检索。比如说,并非所有东西都是 Wikipedia 页面。Wikipedia 页面的结构比较规整,你知道里面的内容都是围绕某个主题的。但很多时候,你面对的文档结构千奇百怪。假设你有一份关于 Lenny Podcast 的文档,文档开头写了一句:“从现在起,‘podcast’指的就是 Lenny’s Podcast。” 那么如果将来有人问”告诉我 Lenny 的工作”,而文档里没有出现”Lenny”这个词,你可能就检索不到。如果文档足够长,被切分成了不同的部分,第二部分没有”Lenny”这个词,你就检索不到。所以你必须找到一种数据处理方式,确保即使信息表面上看起来与查询不直接相关,也能被正确检索到。
所以人们想出了各种方法,比如上下文检索——给每个数据块加上相关的摘要或元数据,让它知道这个块是关于什么的;还有人用假设性问题,非常有意思——针对每个文档块,先生成一批这个块能回答的问题,这样当用户提出查询时,就去匹配这些假设性问题,匹配上了就能把对应的块捞出来。这是一个非常有意思的方法。所以我想说的是,RAG 的数据准备极其重要。在我见过的很多公司里,RAG 解决方案性能提升最大的来源就是更好的数据准备,而不是在纠结用什么数据库。当然数据库也很重要——延迟、读写模式这些东西当然需要关心。但如果单纯从回答质量来看,数据准备才是关键。
Lenny Rachitsky: 你说的数据准备,能举个例子让我们更具体地理解一下吗?
Chip Huyen: 比如刚才提到的数据分块。你需要考虑每个块应该多大。因为你想最大化上下文的有效性——举一个简单的例子,假设你想检索一千字的内容。如果数据块很长,它更可能包含更多相关的元数据,检索命中率就更高。但如果太长,一个块就一千字,你只能检索到一个块,那就不太有用。反过来,如果块太短,你可以检索到更多相关的文档和块,但每个块本身太小,包含的有用信息又不够。
数据准备的更多技巧
Chip Huyen: 所以我们有了很好的分块设计,确定了每个块应该多大。你添加了上下文信息,比如摘要、元数据、假设性问题。有人跟我说,他们获得的一个很大性能提升,就是将数据改写成问答格式。比如他们有一个播客,与其直接对播客进行分块,不如把它重新组织成”这是一个问题,这里是答案”的形式,生成大量的问答对。这个过程也可以用 AI 来辅助完成。所以这是数据处理的一个例子。
我还看到很多例子是关于帮助人们用 AI 处理特定用途的文档。我们写的文档,当今很多文档是写给人类阅读的,而 AI 的阅读方式不同——因为人类有常识,大致知道是怎么回事。所以即使是人类专家,他们也有 AI 所不具备的上下文。
有人告诉我,他们做的一个很大改进是这样的:假设你有一个函数,它的文档说明这个库的输出,可能用了一些很晦涩的术语,比如温度或者图表上的某个值,应该在 1、0 或 -1 之间。作为人类专家,你可能理解这个刻度的含义,1 在这个刻度上意味着什么。但对 AI 来说,它真的不理解这意味着什么。所以他们实际上为 AI 添加了另一个标注层,比如说,好的温度等于 1 意味着什么,它不是实际的温度,而是与那边那个刻度相关联的。所以就是做所有这些数据处理,让 AI 更容易检索到相关信息来回答问题。
企业 AI 工具的采用现状
Lenny Rachitsky: 好,你之前谈了不少关于你如何与公司合作,帮他们制定 AI 战略、做 AI 产品、决定构建哪些工具等等。我想在这方面多花点时间,因为很多公司在做 AI 产品,但很多公司做 AI 产品做得并不顺利。让我沿着这个方向问几个问题——从你与做得好的公司合作中学到了什么。第一个问题是,关于 AI 工具的采用和企业内部的整体采用,最近有很多人在谈论 AI 的炒作,但数据实际上显示,大多数公司试了一下,效果不明显,然后就停了。所以很多人觉得这可能也就到此为止了。那么关于企业内部 AI 工具的采用,你观察到了什么?
Chip Huyen: 关于企业中的生成式 AI,我见到的生成式 AI 工具大致分两类。一类是面向内部效率的,比如编码工具、Slack 聊天机器人、内部知识库。很多大型企业做了某种模型封装,可以接入不同类型的 RAG 解决方案。我们之前谈的主要是基于文本的 RAG,还没有谈到 agentic RAG 或者多模态 RAG。那也是一个非常令人兴奋的领域。基本上就是让员工能够访问内部文档,比如有人问:我要生小孩了,公司的产假或陪产假政策是什么?我做这个手术,医保能报销吗?我想推荐一个朋友来面试,流程是怎样的?很多公司都有这种内部聊天机器人来辅助内部运营。
另一类是面向客户或合作伙伴的。客服聊天机器人是一个很大的类别。如果你是一个酒店连锁品牌,可能会有一个预订聊天机器人,这类应用非常普遍。我确实有一个理论:很多公司选择某些应用,是因为这些应用的成果可以被具体衡量。我觉得预订或销售聊天机器人就很典型——现在的转化率是多少,用了聊天机器人之后转化率能提升多少,这些结果很清晰,所以公司更容易接受这些方案。所以很多公司都做了面向客户的聊天机器人。
这是工具的另一类。我觉得对于面向客户或外部的工具,因为人们倾向于选择成果可衡量的应用,所以采用的问题就取决于他们能否看到成果。当然,这并不完美,因为有时候成果不好,不是因为这个想法或应用的思路不好,而是因为构建它的过程做得不够好。所以这比较复杂。
对于内部工具的采用或者内部效率的提升,情况就比较棘手了。我觉得很多公司在考虑 AI 战略时——我认为 AI 战略通常有两个关键方面:一是用例,二是人才。你可能针对好的用例拥有很好的数据,但如果没有人才,你也做不了。
在生成式 AI 刚起步的时候——有时候我真的很佩服很多公司——他们会说,我们需要让员工对生成式 AI 有很强的意识,具备很高的 AI 素养。所以他们会先为团队引入一批工具让大家使用,办很多技能培训工作坊,鼓励学习,这真的是非常好的事情。他们也愿意花很多钱来推动采用,给员工购买订阅、采购各种工具的授权,让员工的 AI 素养提升。但问题在于,很多人会说,我们在这些工具上花了一大笔钱,但看不到效果——你能看到使用数据,但人们似乎并没有真正用起来,问题出在哪里。所以我觉得这确实比较棘手。
Lenny Rachitsky: 你觉得问题出在哪里?是他们不知道怎么用吗?你觉得这个差距是什么?你认为我们会到达这样一个阶段吗——哇,因为 AI,很多公司的工作方式已经完全不同了?
生产力的衡量难题
Chip Huyen: 核心问题在于,生产力本身真的很难衡量。我跟很多人聊过他们网站方面的情况。首先,编码这块——很多公司并没有使用编码代理(coding agent),也没有用编码相关的工具。我问他们:“你觉得这有帮助提升你的生产力吗?“很多时候,回答都很模糊——“嗯,我觉得好像好了一些。“我说:“好吧,因为我们看到更多的 PR、更多的代码,然后马上就……”但当然了,代码行数并不是一个好的衡量指标。所以这真的非常棘手,而且还有一个很有趣的现象。我确实会去问人们,让他们去问问自己的经理——因为我通常合作的是 VP 级别的人,他们手下管理着多个团队。所以我问他们:“你会去问问经理们吗——你更愿意……”
“你是更愿意给团队每个人买很贵的编码代理订阅,还是多要一个人头?“几乎所有的经理都会选择人头。但如果你去问 VP 级别或管理很多团队的人,他们会说,就选 AI、选好的工具系统。原因在于,经理的选择其实也有道理——因为你还在成长阶段,你还没有到管理成百上千人的级别。对你来说,多一个正式人头是一件大事。所以你想要人头,不是因为生产力的原因,而是因为你就是想有更多的人为你工作。而对于高管来说,他们更关心的可能是业务指标。所以他们真正思考的是什么在驱动他们的生产力指标。所以这确实很棘手,我觉得生产力这个问题——我不确定根本原因是不是”让人变得更有生产力”,而是我们没有一个好的方法来衡量生产力的提升。
不同水平工程师对 AI 工具的反应差异
还有一点也非常值得关注。人们确实告诉我,他们注意到不同类型的员工对 AI 辅助工具有不同的反应。我老是回到编码这个话题,因为编码是一个很大的领域,而且某种程度上更容易分析。我收到了不同的反馈。有一个团队的负责人告诉我——他把团队分成了三组,但没有告诉他们。他说,这里是目前表现最好的、表现一般的和表现最低的。然后做了一个随机试验——给每组一半的人开放 Cursor 的使用权。然后他注意到,随着时间的推移,出现了一个很有意思的现象。在他看来,获得最大性能提升的群体——他和团队关系很密切——是高级工程师,表现最好的那一组。表现最好的工程师从中获得了最大的提升。第二组是表现中等的。他的看法是,表现最好的工程师也有良好的实践习惯,他们知道怎么解决问题,所以能更好地利用已解决的问题。而表现最低的那些人,他们本身对工作就不太上心,所以更容易直接用自动模式,让它生成代码然后交差,或者干脆不知道怎么做。但另一家公司告诉我恰恰相反——高级工程师反而是最抗拒使用 AI 工具的。因为他们的意见更强,标准更高,会说:“AI 生成的代码太烂了。“所以他们非常抗拒使用。所以我不知道,我还没能把这些截然不同的报告统一起来。
Lenny Rachitsky: 这太有意思了。让我确认一下我听到的——有一家你合作的公司,对工程团队做了三组测试,把工程师分成了表现最好、表现中等和表现最低的三组,然后给其中一些人开放了——他们给的是 Cursor 对吧?是 Cursor 吗?
Chip Huyen: 我记得是 Cursor。
Lenny Rachitsky: 好。所以——
Chip Huyen: 我没有和他们合作,这是一个朋友的公司。
Lenny Rachitsky: 好,是朋友的公司。那他们是给一半的高绩效工程师用 Cursor、一半不用吗?具体怎么分的?
Chip Huyen: 对,他们给整个公司一半的人开放,但每个组里也是一半一半。
Lenny Rachitsky: 哇。
Chip Huyen: 然后观察生产力差异。
Lenny Rachitsky: 我明白了。他们是怎么做到的?就直接说”你用 Cursor,你不能用”?怎么操作的?太有意思了。
Chip Huyen: 具体机制我没细问,但我当时就说:“你能做随机试验,我真的很佩服。”
Lenny Rachitsky: 太酷了。
Chip Huyen: 是的。
Lenny Rachitsky: 哇。这个工程团队有多大?几百人吗?
Chip Huyen: 没那么大,大概三四十人吧。
Lenny Rachitsky: 三四十人,好。
Chip Huyen: 对。
Lenny Rachitsky: 哇。所以他们发现表现最好的工程师从 AI 工具中获益最多,其次是中等水平的,最低水平的获益最少。好。
Chip Huyen: 但各地情况也不完全一样。
Lenny Rachitsky: 对,对,对。
Chip Huyen: 有些公司情况不同。
Lenny Rachitsky: 对。你分享的另一个例子是,某些地方的高级工程师最抗拒改变工作方式,这我能理解。我现在确实觉得,除了像你这样的 ML 研究者和 AI 研究者之外,当下最有价值的人就是高级工程师了。因为初级工程师的大量工作现在已经被 AI 取代了,而一个真正懂行的工程师——理解大规模系统如何运作——配合 AI 工具,就相当于拥有了无限的初级工程师在听他指挥,这感觉是一个极其有价值且强大的资产。
Chip Huyen: 是的,我确实非常欣赏——你可以看到公司里——我们欣赏那些对整个系统有良好理解、具备优秀问题解决能力的工程师,能够全局思考而非局部思考的人。有一家公司在看到他们使用 AI 工具后的工作方式后告诉我:“我们现在完全不一样了。“他们实际上重组了工程团队,让更多高级工程师参与到代码审查(code review)中,因为由他们来制定关于什么是好的工程实践、流程应该是怎样的等编写规范。
Chip Huyen: 或者可以说,他们编写了大量关于如何高效工作的流程规范,然后让更多初级工程师负责生产代码、提交 PR,而高级工程师更多处于审查的角色。所以我认为这可能是在为未来做准备。还有一家公司也跟我说了非常类似的话——为未来做准备,届时他们只需要一小群非常、非常强的工程师来制定流程和审查代码以推进上线,而让 AI 或初级工程师来生产代码。但接下来的问题就是:一个人如何才能成为一名非常强的高级工程师。
Lenny Rachitsky: 对。没错。没错。这就是问题所在。
Chip Huyen: 是啊。所以我不确定这个过程是什么,我也在想这个问题。
Lenny Rachitsky: 没有人在想这个问题。这是个问题。再过十到二十年,我们就不会有任何工程师了,因为没有人再招初级工程师了。不过也可以有另一种看法——初级工程师,现在刚进入计算机科学领域的人,就是 AI 原生一代。理论上你可以说,如果他们保持好奇心,不是简单地 delegating(委托)给 AI 去学习和思考,而是真正学会利用 AI 来学习如何写好代码、正确地做架构,那他们会成长得非常快。你可以说他们会成为未来最成功的工程师。
Chip Huyen: 我确实认为我之前提到的关于架构的内容——我把它归入系统思维(system thinking)的范畴里。我确实认为这是一项非常重要的技能,因为我觉得 AI 可以帮助自动化许多孤立的、零散的技能,但知道如何把这些技能组合起来解决问题是很难的。我之前参加了一个线上研讨会,参与者有 Mehran Sahami,他是我最喜欢的教授之一,曾是斯坦福大学计算机科学系的课程委员会主任,所以他在 AI 编程时代学生应该学什么这个问题上花了很多时间思考。另一个人是 Andrew Ng,当然是 AI 领域的传奇人物。Mehran Sahami,Sahami 教授说了一些非常有意思的话。他说很多人认为计算机科学就是写代码,其实不是。编码只是达到目的的手段。
计算机科学的核心是系统思维,是用编码来解决实际问题,而问题解决永远不会消失,因为随着 AI 能够自动化越来越多的东西,问题只会变得更大。但理解问题根因、设计逐步解决方案的过程,将始终存在。所以我想举一个例子——实际上我在调试方面对 AI 有不少不满。我不知道你是否经常用 AI 来写代码,但我自己和朋友们观察到的一点是:当你的任务非常清晰、明确时,AI 确实挺好用的。比如写文档、修复特定功能,或者从零开始构建一个不需要与大型代码库交互的应用。但一旦事情变得稍微复杂一些,比如需要与其他组件交互之类的,它通常就不太行了。
比如我之前用 AI 部署一个应用,当时在试用一个我不熟悉的新托管服务。通常来说——AI 给我的一个好处是给了我尝试新工具的信心。在 AI 出现之前,尝试新工具意味着要阅读文档——不是一开始的文档,但我就会觉得,好吧,试一试、学一学。所以我在试用这个新的托管服务,一直遇到一个 bug,非常、非常烦人。我就让 AI 来修复它,它就不断换方法——改环境变量,改代码,把这个函数换成那个函数,换语言,也许它不支持处理 JavaScript,我不知道,各种尝试。但都没用。最后我就说,好吧。
我要自己读文档,看看问题出在哪里。结果发现——我所在的 tier(服务层级)没有我需要的那个功能,在这个 tier 下不可用。所以我觉得,AI 的问题在于它一直试图从某个组件内部去修复问题,而实际上问题的根源来自完全不同的方向。所以这让我想到,要理解不同组件如何协同工作、问题可能来自哪里,需要有一种全局视角。这也让我思考,我们如何教 AI 系统思维——我观察到人类专家会搭建非常精细的认知脚手架,遇到这类问题就看这个、看那个、再查那个。也许这是一种方法。但这同时也让我反思:我们如何教人类系统思维?是的。所以我觉得这是一项非常有意思也非常重要的技能。
Lenny Rachitsky: 这和 Bret Taylor 在播客中分享的观点完全一致。他是 Sierra 的联合创始人,创建了 Google Maps,曾任 Salesforce 的 CEO,还做过 Quip 等其他一些项目。我问他,人们是否应该学编程?他的观点和你说的完全一样——上计算机科学课不是学 Java 和 Python,而是学习系统如何运作、代码如何运行、软件广义上如何工作,而不仅仅是”这是一个实现某个功能的函数”。
ML 工程师 vs AI 工程师
还有一件事我想帮大家理解——你写了一本叫《AI Engineering》的书,本质上是在帮助人们理解这种新型工程师。你对 ML 工程师和 AI 工程师之间的区别有一种非常简洁的划分方式,而且这个类比对应到产品经理领域也非常贴切——AI 产品经理和非 AI 产品经理之间的区别。你的描述是——ML 工程师自己构建模型,AI 工程师则使用已有的模型来构建产品。有什么要补充的吗?
Chip Huyen: 写书有一件事我很不喜欢,就是必须要下定义,而我觉得没有哪个定义是完美的,因为总会有边界情况。但总的来说,我认为随着 GenAI 以服务的形式提供——更多的作为一种服务——有人为你构建好了模型,而且基础模型的性能已经相当不错。这就使得人们可以轻松地说:好,现在我想把 AI 集成到我的产品里,我不需要去学习模型训练的细节——虽然了解这些确实会有帮助——但它让想用 AI 构建产品的人的入门门槛变得非常低。与此同时,AI 的能力如此强大,也扩展了 AI 可以被使用的应用类型的可能性。所以我觉得,入门门槛极低,对 AI 应用的需求又大得多。这非常非常令人兴奋,打开了一整个全新的可能性空间。
Lenny Rachitsky: 对。就像现在你不需要花时间去构建这个 AI 大脑,而是直接用它来做事情,这是一个巨大的解锁。好的,也许最后一个问题。你看到了很多什么是有效的、什么是无效的、以及方向在哪里。我很好奇,如果你要展望未来两三年,方向在哪里——你认为构建产品会有什么不同?公司的工作方式会有什么不同?如果要你想想未来几年我们可能看到的最大变化,在公司如何运作这件事上,你有什么看法?
组织结构的变革
Chip Huyen: 我觉得很多组织的行动并没有那么快,但与此同时,它们的速度其实比我预想的要快,因为再说一次,我觉得这也存在偏差——我不会和那些不在乎 AI 的恐龙公司合作。来找我的很多高管都非常有前瞻性。所以对我来说,我对组织的看法有很大偏差,倾向于那些行动迅速的。我看到的其中一个重大变化是在组织结构上。以前我们有很多割裂的团队——工程团队和产品团队分工非常明确——但随之而来的问题是:谁来写评估(eval)?谁来负责指标?事实证明,评估不是一个孤立的问题,而是一个系统性的问题,因为你需要审视不同组件以及它们之间的交互方式。你需要了解用户行为,因为你要知道用户关心什么,才能写出反映用户需求的评估。
所有这些你可以通过审视不同组件的架构、设置护栏(guardrails)等来解决。这属于工程的范畴,而理解用户则是产品的工作。正因为评估极其重要,它把产品团队和工程团队,甚至市场团队、用户获取团队,紧密地拉到了一起。所以人们在调整组织结构,使得以前彼此泾渭分明的职能之间有了更多的沟通。
另一个我看到的变化是关于团队的——当然,我在思考未来几年哪些工作可以被自动化、哪些不能。我看到人们已经在精简了——想起来其实有点可怕——但团队的人也跟我说,就是觉得好吧,这挺好的,我们已经把很多职能砍掉了,比如以前外包出去的那些工作。
传统上,企业会把非核心的、可以更系统化的业务外包出去。而现在,你可以用 AI 来自动化大量这类工作。因此,人们开始重新思考初级工程师和高级工程师的价值各是什么,应该如何重新调整工程组织架构。我确实认为,成功的组织正在不断调整资源配置,思考各种用例——是否需要孵化新的用例,由谁来主导新的项目。这是一个重大变化。
基础模型进步放缓与后训练的崛起
另一个关于 AI 的方面,我不确定这个观点有多准确——我自己也倾向于认同这个阵营——就是:我们现有的基础模型(base model)可能还没有完全到顶,但我们不太可能再看到那种极其强大、令人震撼的模型了。
回想一下,我们经历了 GPT,然后是 GPT-2,比 GPT 有了大幅提升,再然后是 GPT-3,比 GPT-4 又大了非常多——当然还有 GPT-5——但 GPT-5 相比前代的跨越幅度是否像以前那样巨大,我觉得是有争议的。我认为基础模型性能的提升不会再像过去那样令人震撼——至少不像过去三年那样。所以我看到的大量改进发生在后训练阶段和应用构建阶段。这也是我认为会看到大量进步的地方。
多模态的挑战与机遇
我对多模态(multimodality)也非常感兴趣。我们已经看到了很多基于文本的应用,但我觉得音频和视频领域有非常多令人兴奋的用例。音频领域还没有被很好地解决。我与几家语音创业公司合作,当你认真思考语音交互时,它完全是另一回事。
比如说聊天机器人,从文本聊天机器人到语音聊天机器人,面临的挑战完全不同。因为语音聊天机器人需要考虑延迟(latency)——它要经过多个步骤:先从语音转文本,再从文本问题得到文本答案,再把文本答案转为语音。中间有多个环节,延迟就变得至关重要。还有一个问题是:怎样让它听起来自然?比如人类之间的对话,如果你想打断我说”Chip……”,我会停下来听你说。但有时我只是发出一点声音表示在听,比如”嗯嗯”,这种时候我不应该停下来,而应该继续说。所以关于强制打断以及是否应该停下来,这是人们感知自然对话的一个关键因素。
这还涉及监管问题,因为很多时候人们想构建听起来像人类的语音聊天机器人,试图让用户以为自己在和真人对话,但可能也会有监管规定要求必须向用户披露对方是真人还是 AI。我觉得这整个领域还远没有你以为的那么成熟。但它其实不完全是基础模型的问题——人类打断检测实际上是一个经典的机器学习问题,换一个框架来看,你可以用分类器(classifier)来解决。延迟问题则是一个巨大的工程挑战,而非 AI 挑战。当然,它也可以变成 AI 挑战,因为人们正在尝试构建语音到语音模型(voice-to-voice model)——不需要先把语音转录成文本、再让模型生成文本答案、再由另一个模型转成语音,而是直接从语音到语音。这是大家在努力的方向,但非常困难。
所以即便是音频——我认为它比视频简单,因为视频同时包含图像和语音——音频就已经够难了。我觉得这个领域还有很多挑战。
总结与回顾
Lenny Rachitsky: 这份清单太棒了。让我快速镜像回顾一下。你预测未来几年工作方式将发生的变化,这些和我在这档播客上的很多对话产生了共鸣。所以,算是进一步确认了趋势走向。第一,不同职能之间的界限变得模糊,不再只是设计、工程各管各的,每个人都要做很多不同的事情。第二,更多工作会被代理和各种 AI 工具自动化,理论上生产率会提升。第三,重心从预训练模型转向后训练、微调(fine-tuning)等,因为正如你所说,模型变聪明的速度可能在放缓。
不过,我想提一下——我之前和 Anthropic 的联合创始人聊过,他有一个很好的观点:我们很不擅长在指数增长的过程中去感受指数意味着什么。而且,模型发布的频率更高了,所以我们可能察觉不到每次之间的差异,只是因为它们发生得更频繁了——而 GPT-3 是在 GPT-2 之后一年才出来的。也许对,也许不对。第四点你提到的是多模态,投资多模态体验。我太期待 ChatGPT 的语音模式在打断处理上变得更好了——完全就是你说的那个问题,我正在跟它说话,然后有人发出一点声音,它就”……”停下来了,然后你得重新来。太烦人了。
Chip Huyen: 我很震惊我们到现在还没有更好的家用语音助手。老实说,我测试了一堆。我总是满怀期待——哦天哪,这个说不定就是了——结果其中好多我不得不送人,因为它们真的不够好。
语音助手的未来
Lenny Rachitsky: 我觉得快了。我听说快了。Anthropic 正在和某个合作方做,我不知道有没有已经上线。
Chip Huyen: 对,我想回到你之前提到的你那位来自 Anthropic 的嘉宾关于性能提升的观点。我觉得有一个很大的变化,就是模型本身的能力和感知到的性能表现之间的区别。我说的是预训练模型与感知到的性能表现之间的差异。比如说——你有没有想过——你熟悉”测试时计算”(test time compute)这个概念吗?
Lenny Rachitsky: 我觉得不太熟悉。帮我们解释一下。
Chip Huyen: 这个想法是这样的:你的计算资源是固定的。你会在预训练,也就是训练模型上花费大量计算。预训练之后,再花一些计算在微调上。预训练和后训练之间的计算比例非常夸张,不同实验室之间差异很大。此外,还有推理阶段的计算消耗——当你有了训练好、微调好的模型,要把它提供给用户使用。我在提示词里输入一个问题,它生成输出、做推理,这需要计算资源。所以现在有讨论是:我应该把更多计算花在预训练、微调还是推理上?人们认为推理阶段就是测试时计算。在推理上花费更多计算资源,这种策略就是把更多计算资源分配给推理生成,来获得更好的性能。具体是怎么做的呢?
假设你有一道数学题,与其只生成一个答案,不如生成四个不同的答案,然后看哪个按照某种标准是最好的;或者四个答案里有三个说 42,一个说 20,你就说好,三个达成一致,答案应该是 42。就是生成一批结果。还有一种方式是很多时候像推理、思考——就是能生成更多思考 token,在展示最终答案之前花更多时间思考。这需要更多计算,但也能带来更好的性能。所以从用户角度来看,当模型花更多时间探索不同的潜在答案、思考更长时间,它可以给出好得多的最终答案。但基础模型(base model)本身并没有改变。
Lenny Rachitsky: 明白了。
Chip Huyen: 说清楚了吗?
Lenny Rachitsky: 是的,完全理解了。
Chip Huyen: 是吧?
Lenny Rachitsky: 这和 Ben Man 的观点形成了很好的呼应。
Chip Huyen: 对。
AI 战略与”创意危机”
Lenny Rachitsky: Chip,我们聊了很多内容。我想了解的都已经覆盖了,甚至超出了预期。在进入非常精彩的快问快答环节之前,你还有什么想分享的吗?还有什么想留给听众的?
Chip Huyen: 我确实和一些公司合作,帮他们让员工提出创意。关于 AI 战略到底应该怎么做,有一个很大的争论——应该是自上而下还是自下而上?是由高管想出一两个杀手级用例,然后全公司资源都投进去?还是让工程师、产品经理和聪明的一线人员自己提出想法?我觉得两者应该结合。有些公司的做法是,我们招了一批聪明人,看看他们能想出什么,然后组织黑客马拉松或内部挑战赛让大家来构建产品。我注意到的一点是,很多人根本不知道你能做什么。这让我很震惊——我觉得我们某种程度上处于一场创意危机之中。
现在我们有了所有这些非常酷的工具,可以从零开始做一切——可以帮你做设计,可以帮你写代码,可以帮你建网站。按理说我们应该能看到更多的创意涌现,但与此同时,人们反而卡住了,不知道该做什么。我觉得这可能和社会期望有关——我们已经进入了高度专业化的阶段,人们被要求在一件事情上做到极致,而不是要有全局视野。当我们没有全局视野的时候,就很难想出该做什么。
所以我和这些公司合作做黑客马拉松的时候,我们会制定一个指南,教大家如何想出创意。我们通常的建议是:回顾过去一周。花一周时间,留意自己在做什么、什么让你感到挫败。当某件事让你感到挫败的时候,想一想——有没有什么办法可以改善?能不能换一种方式做,让它不再那么烦人?你可以在团队之间交流这些挫败体验,我甚至看到他们比较彼此的挫败点——也许有些就是你可以围绕它去做点什么的灵感。所以,关注我们的工作方式,思考改进的方向,不断问”这件事怎么才能做得更好?“然后围绕这些挫败点去构建解决方案,我觉得这是学习和采用 AI 的一个好方法。
Lenny Rachitsky: 我觉得人们每次打开那些 vibe coding 工具的时候,都会有你说的这种感受——你可以描述任何你想要的东西,但我就愣在那儿:“我不知道,我想要什么?“我很喜欢你这条非常实用的建议——关注让你感到挫败的地方,留意你的挫败点。比如,我刚刚就 vibe coding 了一个很酷的小应用。我当时在 Google Docs 里写一篇 newsletter 帖子,往里面粘贴了很多截图之类的图片,然后突然想起来——对,Google Docs 里的图片是拿不出来的。就像加州旅馆那种体验——东西可以贴进去,但很难再拿出来。所以我就去各个 vibe coding 工具里建了一个应用:给它一个 Google Doc 的链接,它就能自动把所有图片下载下来。效果非常好,我还把它做得很可爱。我会在节目备注里附上链接。
Chip Huyen: 哦,我很想看看。我非常看好用 AI 来创建微型工具——就是那些能让你的生活稍微轻松一点的东西。
Lenny Rachitsky: 百分之百同意。我觉得这就是人们使用这些工具的主要方式之一——解决自己遇到的小众问题。
快问快答
那么,Chip,我们到了非常精彩的快问快答环节。我准备了五个问题。准备好了吗?
Chip Huyen: 好的,随时——不不不,取决于问题有多难。
Lenny Rachitsky: 每位嘉宾的问题都是一样的,所以我猜你可能已经听说过了。第一个问题:有两三本你发现自己最常推荐给别人读的书吗?
Chip Huyen: 我其实很害怕推荐书,因为我觉得一个人应该读什么书,真的取决于他们想要什么、处在人生的什么阶段、想到达哪里。不过有几本书确实改变了我思考和看待世界的方式。一本是《自私的基因》(The Selfish Gene),它帮我理解了——实际上它帮我想明白了是否要孩子这个问题——因为它让你更理解我们的很多功能、运作方式都是基因的功能,而基因只想做一件事:复制繁衍。
Chip Huyen: 所以从某种意义上说,这本书还提出了另一个观点——每个人都想长生不老,也许不是有意识地想,但潜意识里确实如此。实现这一目标有两条途径:一条是通过基因,基因想要永远延续下去;另一条是 ideas(思想/理念),我觉得 ideas 也是一样的道理——如果你把一些 ideas 传播出去,它们能存在很长时间,就会一直活下去。我知道这有点抽象,但我确实觉得非常有趣。
另一本我非常非常喜欢的书,是新加坡前领导人——我认为他被誉为新加坡国父——李光耀写的。我不太确定书名是什么,但他是带领新加坡在 25 年内从第三世界国家变为第一世界国家的人。我从未见过哪个国家领导人花如此大的精力,把自己的治国理念系统地写下来。
书中大量讨论了公共政策——如何制定鼓励人们做对国家有利之事的政策,也谈到了外交事务、外交政策、国家解放等等。所以这是一本非常值得思考的书。对我来说,它是一种系统思维(system thinking),但是一个不同层面的系统——一个国家,而我们大多数人一辈子都没有机会去实验这样的系统。所以能从中学习非常有价值。
Lenny Rachitsky: 第二本书叫什么名字?
Chip Huyen: 叫《从第三世界到第一世界》(From Third World to First)。其实我觉得我这儿就有一本。对。
Lenny Rachitsky: 来了。
Chip Huyen: 这是一本很厚的书。
Lenny Rachitsky: 展示一下。
Chip Huyen: 嗯。
Lenny Rachitsky: 太棒了,我一定要读一读。这个推荐非常好。我听说过他产生的巨大影响,也在 Twitter 上看到过很多关于他如何建设繁荣社会的智慧洞见的视频。显然,效果摆在那里。
Chip Huyen: 是啊,你能相信吗,他哪来的时间写这么厚一本书?太疯狂了。
Lenny Rachitsky: 确实。Claude,请帮我总结一下。开个玩笑。顺便说一下,《自私的基因》我也非常喜欢。这个选择太好了。这本书有点低调,但确实改变了我看待世界的方式。很好的推荐。好,下一个问题。你最近有没有特别喜欢的电影或电视剧?
Chip Huyen: 我看了很多电影和电视剧,主要是为了做研究——我正在写我的第一本小说,最近刚刚卖掉了。这是一部剧情小说,不是科技圈的人通常读的那种科幻小说。所以我知道这有点出人意料,我几乎是通过看电视剧来研究什么样的故事会流行,试图理解其中的叙事套路之类的东西。所以我不确定观众会不会喜欢……
Lenny Rachitsky: 说一个就好,哪一部教会了你一些关于写作的东西?
Chip Huyen: 比如说《延禧攻略》,一部中国电视剧。
Lenny Rachitsky: 酷,好的。我在播客上还没听人推荐过这个。不错。
Chip Huyen: 嗯。
Lenny Rachitsky: 下一个问题。你有没有一句人生座右铭,在工作或生活中遇到困难时会常常想起、反复回味的?
Chip Huyen: 这听起来可能很虚无主义。我想说的是:归根结底,一切都不重要。通常我会这样想——在宏大的尺度上,十亿年之后,什么都不复存在,也没有人会在那里。我知道有人会跟我争论这一点。我的理论就是:十亿年后,我们谁都不存在了。所以不管我们做了多么糟糕的事、多么疯狂的事、搞砸了什么,没有人会记得,也没有人会在。从某种意义上说,这听起来有点可怕,但其实非常令人解脱,因为它让我可以对自己说:好吧,放手去试就行了,对吧?有什么大不了的呢?
最近有一件事——我最近有一位家人去世了。因为我没法赶回家,就跟我爸通电话。
我问我爸:“有没有什么我能做的,让那个人……” 就是想找些什么能让他感到安慰的东西,任何能让他开心的事。我爸就说:“这个时刻他还能想要什么呢?” 这让我深深感到,在生命的尽头,没有任何物质的东西能带给你快乐。没有钱、没有产品、什么都没有。这也反过来让我思考:到了最后,我真正在乎的到底是什么?所以我觉得——也许我失败了,也许我没签下那个合同——那些事到了生命的尽头,其实并不真正重要。这样想,反而很令人解脱。
Lenny Rachitsky: 你说这可能有点虚无主义,但 Steve Jobs 在他最著名的那次演讲中也分享过类似的想法。我们终有一天都会死,所以别把事情看得太重,这确实是一种解脱。绝对如此。它让你珍惜每一个当下、每一天。就像你说的,放手去做一些困难而可怕的事吧。好,最后一个问题。你提到你在写小说,科技圈的大多数人从来没写过创意类和虚构类的东西。在这个过程中学到的关于如何写出更好的故事、更好的小说的一件事是什么?
Chip Huyen: 我们在阅读的时候,经常会被一些小细节绊住。所以我想尝试创意写作,就是想成为一个更好的写作者——也许换一个不同的受众群体,能帮助我更好地预测这类读者想听什么、关心什么。所以这也是一种训练自己的方式。我觉得无论是写东西还是做任何内容创作,本质上都是在预测受众的反应,对吧?
Lenny Rachitsky: 预测下一个 token。
Chip Huyen: 你做播客也是。
Lenny Rachitsky: 开玩笑的。
Chip Huyen: 对,对。你做播客,就会想:什么样的内容能让听众觉得有吸引力?我觉得这一点和很多公司也是相通的——你发布一款产品,需要有一套叙事,要想:我们如何定位这款产品,让用户产生需求?所以我觉得我在技术写作方面已经做了一段时间,也有了一些经验,知道如何预测工程师想听什么、关心什么。但我完全没有面对这种完全不同的受众群体的经验。所以我才想去尝试创意写作,写一个故事。这也是为什么我做了很多研究。做研究的过程其实也很享受,看了很多电视剧,就是想看看人们喜欢什么。所以有一件我很在意的事是——我觉得我从一位编辑那里学到了什么是 emotional journey(情感旅程)。
创意写作与情感旅程
Chip Huyen: 所以当我们写东西的时候,会关心读者在整个故事中会有什么感受。开头需要有吸引人的点,需要一个钩子让人想继续读下去。但也不能有太多戏剧冲突,不然读者会很累,因为情感上会被消耗——就像一直被情感操控一样。所以情感旅程要有起伏,可能有一个高潮,然后来一些放松的部分。还有一件我之前没意识到的事是——对我来说,技术写作完全聚焦于内容和论点,非常不带个人色彩。比如,人们喜欢 ML 编译器的内容,他们是否喜欢讲编译器的那个人根本不重要,因为那纯粹是客观的。但小说不一样,人们在意角色是否讨喜。
所以在我的故事第一版里,我把角色写得比较——非常理性,非常冷静,做每件事都很有逻辑。然后我收到的反馈是——我有一个非常好的朋友读了这个故事,他这个人特别棒,他说:“Chip,我说实话,我讨厌这个角色。“所以作为故事来说,关键就在于这个角色太不讨喜了,所以他不想继续读。第二版我就让这个角色更讨喜了。怎么让她更讨喜呢?就是给她一些脆弱的时刻——比如她也会遇到挫折,这样读者就能产生共鸣。所以从很多方面来说,这非常有趣。很大程度上是关于理解情感层面——读者的感受如何,不仅仅是关于故事本身,还包括对角色的感受。
Lenny Rachitsky: 这太有趣了。哇,我学到的比我预期的多得多。太棒了,非常好的例子。Chip,最后两个问题。大家如果想在网络上找到你,想跟你合作,或者只是想分享你提供的内容,去哪里可以找到你?然后,听众怎样能帮到你?
Chip Huyen: 我在社交媒体上,LinkedIn、Twitter 都有。我不怎么发帖,但我一直告诉自己应该多发一些,因为我很喜欢和读者交流。我其实马上要开始做一个 Substack。现在 Substack 上有一个占位页面,我在考虑主要写系统思维(system thinking)方面的内容,因为我觉得这是一项非常有趣的技能。我也在考虑做一个 YouTube 频道,做书评,基本上就是那些帮助你更好地思考的书。我觉得我要评的第一本书可能就是这本,因为它是我从小到大最喜欢的一本书,而且我一直在反复读。所以,大家怎么帮到我呢?给我推荐你们喜欢的书,那些改变了你思考方式或做事方式的书。我会非常感激。
Lenny Rachitsky: 太棒了,我很期待读那本书。
Chip Huyen: 嗯。
Lenny Rachitsky: Chip,非常感谢你来参加节目。
Chip Huyen: 非常感谢你邀请我,Lenny。
Lenny Rachitsky: 大家再见。非常感谢收听。如果你觉得这期节目有价值,可以在 Apple Podcasts、Spotify 或你喜欢的播客应用上订阅。也请考虑给我们评分或留下评论,这真的能帮助其他听众找到这个播客。你可以在 lennyspodcast.com 找到所有往期节目或了解更多关于这个节目的信息。下期再见。
术语表
| 原文 | 中文 |
|---|---|
| agentic RAG | agentic RAG(代理式检索增强生成) |
| AI literate / AI literacy | AI 素养 |
| ARR | ARR(年度经常性收入) |
| base model | 基础模型(base model) |
| classifier | 分类器(classifier) |
| Claude Shannon | 克劳德·香农 |
| code review | 代码审查(code review) |
| coding agent | 编码代理(coding agent) |
| conversion rate | 转化率 |
| distillation | 蒸馏(distillation) |
| eval | 评估(eval) |
| fine-tuning | 微调(fine-tuning) |
| frontier lab | 前沿实验室(frontier lab) |
| GenAI | 生成式 AI |
| guardrails | 护栏(guardrails) |
| inference | 推理(inference) |
| labeled data | 标注数据(labeled data) |
| latency | 延迟(latency) |
| MCP | MCP(Model Context Protocol) |
| multi-modal RAG | 多模态 RAG(多模态检索增强生成) |
| multimodality | 多模态(multimodality) |
| post-training | 后训练 |
| PR | PR(Pull Request) |
| pre-training | 预训练 |
| RAG | RAG(检索增强生成) |
| randomized trial | 随机试验 |
| reinforcement learning | 强化学习(reinforcement learning) |
| reward model | 奖励模型(reward model) |
| RLHF | RLHF(基于人类反馈的强化学习) |
| sampling strategy | 采样策略(sampling strategy) |
| soft skills | 软技能 |
| supervised fine-tuning | 监督微调(supervised fine-tuning) |
| supervised learning | 监督学习(supervised learning) |
| system thinking | 系统思维(system thinking) |
| test time compute | 测试时计算(test time compute) |
| tier | 服务层级(tier) |
| token | token |
| unsupervised learning | 无监督学习(unsupervised learning) |
| verifiable rewards | 可验证奖励(verifiable rewards) |
| vibe coding | vibe coding(用自然语言描述需求、由 AI 生成代码的开发方式) |
| voice-to-voice model | 语音到语音模型(voice-to-voice model) |
| weights | 权重(weights) |
此文档由 AI 分片翻译(translate_long_document)
Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)
Chip Huyen: A question that get asked a lot and a lot is, “How do we keep up to date with the latest AI news?” Why do you need to keep up to date with the latest AI news? If you talk to the users who understand what they want or they don’t want, look into the feedback, then you can actually improve the application way, way, way more.
Expectations vs Reality in Improving AI Apps
Lenny Rachitsky: A lot of companies are building AI products. A lot of companies are not having a good time building AI products.
Chip Huyen: We are in an ideal crisis. Now, we have all this really cool tools to do everything from scratch and have new design. It can have you write code. You can have new website. So in theory, we should see a lot more, but at the same time, people are somehow stuck. They don’t know what to build.
Moving From Hype Back to Fundamentals
Lenny Rachitsky: All this AI hype, the data is actually showing most companies try it, doesn’t do a lot. They stop. What do you think is the gap here?
Language Modeling and Statistics
Chip Huyen: It’s really hard to measure productivity. So, I do ask people to ask their managers, “Would you rather give everyone on the team very expensive coding agent subscriptions or you get an extra head count?” Almost every one, the managers will say head count. But if you ask VP level or someone who manage a lot of teams, they would say, “Want AI assistant.” Because as managers, you are still growing, so for you having one HR head count is big. Whereas for executives, maybe you have more business metrics that you care about. So you actually think about what actually drive productivity metrics for you.
Weights and Fine-Tuning
Lenny Rachitsky: Today, my guest is Chip Huyen. Unlike a lot of people who share insights into building great AI products and where things are heading, Chip has built multiple successful AI products, platforms, tools. Chip was a core developer on NVIDIA’s NeMo platform, an AI researcher at Netflix. She taught machine learning at Stanford. She’s also a two-time founder and the author of two of the most popular books in the world of AI, including her most recent book called AI Engineering, which has been the most read book on the O’Reilly platform since its launch.
She’s also gotten to work with a lot of enterprises on their AI strategies, and so she gets to see what’s actually happening on the ground inside a lot of different companies. In our conversation, Chip explains a lot of the basics like, what exactly does pre-training and post-training look like? What is RAG? What is reinforcement learning? What is RLHF? We also get into everything she’s learned about how to build great AI products, including what people think it takes and what it actually takes. We talk about the most common pitfalls that companies run into, where she’s seeing the most productivity gains and so much more.
This episode is quite technical, more technical than most conversations I’ve had, and is meant for anyone looking for a more in-depth conversation about AI. If you enjoy this podcast, don’t forget to subscribe and follow it in your favorite podcasting app or YouTube. And if you become an annual subscriber of my newsletter, you get a year free of 16 incredible products, including Devin, Lovable, Replit, Bolt, n8n, Linear, Superhuman, Descript, Wispr Flow, Gamma, Perplexity, Warp, Granola, Magic Patterns, Raycast, ChatPRD, and Mobbin. Head on over to lennysnewsletter.com and click product pass. With that, I bring you Chip Huyen after a short word from our sponsors.
Did you know that I have a whole team that helps me with my podcast and with my newsletter? I want everyone on my team to be super happy and thrive in their roles. Justworks knows that your employees are more than just your employees. They’re your people. My team is spread out across Colorado, Australia, Nepal, West Africa, and San Francisco. My life would be so incredibly complicated to hire people internationally, to pay people on time and in their local currencies, and to answer their HR questions 24/7. But with Justworks, it’s super easy. Whether you’re setting up your own automated payroll, offering premium benefits or hiring internationally, Justworks offers simple software and 24/7 human support from small business experts for you and your people. They do your human resources right so that you can do right by your people. Justworks, for your people. Chip, thank you so much for being here, and welcome to the podcast.
Pre-Training vs Post-Training Differences
Chip Huyen: Hi, Lenny. I’ve been a big fan of the podcast for a while, so I’m really excited to be here. Thank you for having me.
RLHF and Signal Collection
Lenny Rachitsky: I want to start with this table/chart that you shared on LinkedIn a while ago that went super viral, and I think it went super viral because it hit a nerve with a lot of people. Let me just read this and we’ll show this on YouTube for people that are watching. So it’s this very simple table you shared of what people think will improve AI apps and what actually improves AI apps. What people think will improve AI apps, staying up to date with the latest AI news, adopting the newest agentic framework, agonizing about what vector databases to use, constantly evaluating what model is smarter, fine-tuning a model. And then you have what actually improves AI apps, talking to users, building more reliable platforms, preparing better data, optimizing end-to-end workflows, writing better prompts. Why do you think this hit such a nerve with people? If you had to boil it down, what do you think people are missing about building successful AI apps?
Chip Huyen: [inaudible 00:05:30] question that get asked a lot and a lot is that, “How do we keep up to date with the latest AI news?” I’m like, “Why? Why do you need to keep up to date with the latest AI news?” I know it sound very counter-intuitive, but there’s just so much news out there. A lot of people also ask me questions like, “How do I choose between two different technologies?” Maybe like recently, MCP versus agent-to-agent protocol? And it was like, “Which one is better or this or that?” I think it’s a [inaudible 00:05:59] question you should ask them is like, “First, how much of the improvement could you get from optimal solutions versus non-optimal solutions?” Right? And sometimes they were like, “Actually, it’s not much.” Right?
I was like, “Okay, if it’s not much improvement, then why do you want to spend so much time debating something that doesn’t make that much difference to your performance?” Another question they ask is like, “If you adopted a new technology, how hard it could be to switch that out to another?” And sometimes they will like, “Oh, I think it could be a lot of work switching it out.” And I’m just like, “Hmm, let’s say here’s a new technology. It hasn’t been tested by a lot of people, and if you would adopt it, you would be stuck with it forever. Do you actually want to adopt it?” Maybe you want to think twice about over commit to new technologies that hasn’t been better tested.
The Future of Data Labeling Companies
Lenny Rachitsky: I love your just broader advice is just simple like, to build successful AI apps, talk to users, build better data, write better prompts, optimize the user experience, versus just like, what is the latest and greatest? What’s the best model to use right now? What’s happening in AI? Let me follow this thread of this idea of fine-tuning and basically post-training. There’s all these terms that people hear in AI, and I think this is going to be a really good opportunity for people to learn what we’re actually talking about, since you actually do these things, you build these things, you work with companies doing these things. There’s a few terms I want to sprinkle in through the conversation, but let’s start with this one. What’s the simplest way for someone to understand? What is the difference between pre-training and post-training and then just how fine-tuning fits into that, just what fine-tuning actually is?
What Are AI Evaluations?
Chip Huyen: Chip disclaimer, I don’t have full visibility on what this big secretive frontier labs are doing. But right from what I heard, so I think it’s like one is, supervised fine-tuning when you have demonstration data, and you have a bunch of experts, “Okay, here’s a prompt, and here is what the answer should be like.” You just train it to emulate what the human expert could be like. That’s also what a lot of people would like, so open-source models are doing as they do it by distillation. So instead of having human experts to write really great answers to prompts, they get very popular, famous good models to generate a response to it and getting this train smaller models to emulate.
Sometimes you see people just like… So, that’s because I really appreciate open source community by the way, but going from being able to train models that can emulate a existing good model. It’s very different from being [inaudible 00:08:38] trained good models, like an output for existing good model. So, it’s a big step there. Yeah, we have my supervised fine-tuning, and another thing that’s very big, I’m not sure you have guests talking about it already, but reinforcement learning is everywhere.
Lenny Rachitsky: Let’s pause on that because I definitely want to spend time on that, and that’s such cool topic that’s merging more and more in my conversations. But just to even summarize the things you just shared, which I think is really, really important stuff. So, the idea here is a model, essentially this algorithm piece of code that someone writes and say the frontier models are feeding it just like the entire internet of content, and basically, it’s trying to test itself on predicting across all that data the next word, essentially. Token is the correct way of thinking about it, but a simpler way to think about it is the next word in text. As it gets it wrong, it adjusts these things called weights, essentially. Just like, is that a simple way to think about it, even that’s just very surface level?
Evaluation ROI and Pragmatic Trade-Offs
Chip Huyen: So, I think of language modeling as a way of encoding statistical information about language, right? So, let’s say that we both speak English, so we get a sense of what is more statistically likely. If I say my favorite color is, then you would say, “Okay, that should be another color.” The word blue would be much more likely to appear than the word like [inaudible 00:09:59], right? Because statistically, blue is more likely to [inaudible 00:10:02] my favorite color is. So, it’s a way of encoding statistical information.
So when language modeling, when you train a large amount of data, you see a lot of languages, a lot of domains. So it can tell, okay, your basic size is standard. Then the user do the prompts and it could come with the next most likely token. So by the way, it’s not a new idea actually. So it’s the idea comes very, very old, from the 1951 papers like English entropy. I think it’s by Claude Shannon, it’s a great paper. And I think it reveals a story I really like is from… Did you read Sherlock Holmes by the way?
What Is RAG?
Lenny Rachitsky: Yeah, I read a few Sherlock Holmes books. Yeah.
Chip Huyen: Yeah. So this is story of when Sherlock Holmes says using this statistical information to help solve a case. So this is his story. There is somebody left a message with a lot of stick figures. So Sherlock Holmes was like, okay, he knows that in English, the most common letter is E. Then the most common stick figure must be E. And then he goes, he stopped like that, [inaudible 00:11:07]. So the code… So I think there’s language. So in a way, it’s simple language modeling, but instead of at a work level, he does this as character level and token is something in between, right? A token is not quite a word, but it’s bigger than a character. So let’s say we say token because it would help us reduce vocabulary because which character is smallest amount of vocabulary right now. So alphabet has 26 character, but words can have millions and millions, right? Whereas tokens, you can be able to get the sweet spot between the two.
So let’s say that we have the new word, how to say it, like podcasting, right? Let’s say it’s a new word, but it can divide a podcast and ing. So people understand, okay, podcast, we know the meaning. We know that ing is a verb, gerund, whatever it is. So we even know the word podcasting so that’s why the token comes in. But yeah, the pre-tuning is basically encoding statistical informations of language to have you predict what is most likely. I think that most likely is the most simple way of doing it because it’s more building a distribution of, okay, so the next token could be more 90% of the the time it could be a color, 10% of the time could be something else. So it basically distribution so language could pick, depending on your sampling strategy. Do you want it to always pick the most likely token or do you want it to pick something more creative? So I think my sampling strategy, I think is something extremely important. It can have you boost a performance in a huge way and very, very underrated.
Data Preparation for RAG
Lenny Rachitsky: Okay, awesome. So essentially, a model is just code with this whole set of weights, essentially the statistical model that has learned to predict what comes next after certain words and phrases?
More Tips for Data Preparation
Chip Huyen: Yeah.
The State of Enterprise AI Adoption
Lenny Rachitsky: And then post-training and fine-tuning, specifically, is doing that same thing. So pre-training you get GPT5. Fine-tuning is someone taking GPT5 and doing the same sort of thing, adjusting these weights a little bit for specific use cases on data that they find is necessary to do their very specific use case. Is that a simple way to think about it?
The Challenge of Measuring Productivity
Chip Huyen: Yeah, I think weights is functions, right? So let’s say you have… Maybe it has a functions of maybe Lenny’s height is maybe 1X plus something or 2X [inaudible 00:13:38] and plus something is the weights, right? So you change it until you fit the correct data, which is my height and your height. So you can think it’s a weight, as just a weight, say function. So you train, adjust the weights so they can fit the data, which is the training data.
AI Tool Adoption Across Engineer Skill Levels
Lenny Rachitsky: Awesome. Okay. So we’re talking about pre-training, post-training, fine-tuning. Is there anything else here that’s important to share about just what this is exactly? What people need to understand about these parts of training?
ML Engineers vs AI Engineers
Chip Huyen: So the vast majority of time, we don’t touch on pre-training model. As users, we don’t use it at all.
Lenny Rachitsky: Right. It’s already done for us.
Organizational Structure Shifts
Chip Huyen: Yeah. So I think my [inaudible 00:14:15] is a bit of fun process when my friend’s training model is they try to play with their pre-training model and they’re horrendous. They’re saying things like [inaudible 00:14:23] “Oh, my gosh.” Yeah, it’s crazy. So it’s very interesting to look at how much of post-training can change the model behavior and I think that’s where a lot of time, is a lot of people are spending energy on nowadays, their frontier lab, is on post-training. Because pre-training, I think… So pre-training have been used to increase the general capacity of capabilities of a model. And it needs a lot of data and model size to increase the model capabilities. And at some point, we are actually have kind of maxed out on the internet data. And people text data max out. I think a lot of people are doing with other data like audios and videos, and everyone’s trying to think of what is the new source of data, but where like post-trading, but middle course of this is more of everyone have very similar pre-training data, is that post-training is where they make a big difference nowadays.
Lenny Rachitsky: This is a good segue to, you talked about supervised learning versus unsupervised learning. I love, we’re getting into this, by the way. This is super interesting. So you talk about labeled data. Basically, supervised learning is AI learning on data that somebody has already labeled and told it, here’s correct versus incorrect. For example, this is spam versus not spam. This is a good short story. This is not a good short story. We’ve had the CEOs of a lot of these companies that do this for labs, Mercor and Scale, Handshake, there’s Micro, there’s a few others. So is that essentially what these companies are doing for labs, giving them labeled data, high-quality data to train on?
Slowing Foundation Models and Post-Training Rise
Chip Huyen: It is in a way, but I think it’s more like a product of big equations. So there are a lot more different components than that. So that’s why I was talking about reinforcement learning. I’m not sure if your CEO [inaudible 00:16:09] interview bring up that term. So the idea is that once you [inaudible 00:16:14]… So let’s say you have a model, give the model a prompt and it produce an output. You want to buy, once you reinforce, encourage the model to produce an output that is better. So now it comes to how do we know that the answer is good or bad? So usually, people relies on signals. So one way to get a first one good or bad is human feedback. They happen to be have two responses. You can, okay, this one one’s better than the other. And we do that is because as humans, we tend to, it’s very hard to give a concrete score, but it’s easier to do comparisons.
If you ask me, okay, give this song a score, I’m not a musician and don’t know how hard it is. It’s like yeah, I don’t know what, out of 10 I going to remove six. And if you ask me again a month from now and I completely forgotten, okay, maybe now seven, only four, I don’t know. But then if you ask me, okay, here are two songs and which one would you prefer to play for the birthday party? I was like, “Okay, I can prefer this song.” So comparisons a lot easier. So [inaudible 00:17:18] have a human, you have human feedback and then you use this human feedback to treat a reward model to tell which and then the reward model help you like, okay, it’s a model that produce this response.
It’s [inaudible 00:17:30] can score, is this good or bad? And you try to bias toward producing better model, the better responses. Another ways you can, instead of using a human, so you can use AI because the response and say good or bad, right? Or in fact the thing is that people are very big on nowadays, verifiable rewards, which it’s natural. So basically, they give it a math problem and then math solutions is a model app a solution. Okay, it’s expected response should be 42 and if it doesn’t provide 42, then it’s wrong. Now it’s not a good response. So yes, a lot of time, people using this human laborer, human laborers should produce, how to say, expert questions and I say expected answers and in the ways that [inaudible 00:18:16] systems that verifiable so that the models can be trained on. Yeah.
Lenny Rachitsky: Okay, I’m really glad you went there. This is essentially RLHF reinforcement learning with human feedback, which is exactly what I wanted to also talk about, right?
Challenges and Opportunities in Multimodality
Chip Huyen: Yeah. So I think it’s general, it’s a way of learning. It’s training is [inaudible 00:18:33] learning and whether it learn from human feedback or AI feedback or verifiable rewards, I think I say it’s just different way of collating signals.
Summary and Key Takeaways
Lenny Rachitsky: Awesome. Yeah. We had the CEO of Anthropic on the podcast and he talked about their version of RLHF, which is AI driven reinforcement learning. I love the way you phrased it where basically you want to help the model, you want to reinforce correct behavior and correct answers, and this is the method to do it, whether it’s say an engineer seeing an output from a model being like, “No, here’s how I would code it differently.” And it’s training a different model that the original model works with to tell it, am I correct or not correct? Is that right, roughly?
Chip Huyen: Yeah.
The Future of Voice Assistants
Lenny Rachitsky: Okay.
Chip Huyen: I think that’s a way of looking into it. I think that’s a space is so exciting nowadays because there are so many domain expert task that the model developers want models to do well on, right? Let’s say you’re accountant. Maybe you want to use a model to have accounting task and need a lot of accounting data examples from accountant. So you need to hire a lot of them, should I do it or everyone [inaudible 00:19:41] physics problems, everyone should do, I don’t know, legal questions and stuff or engineering questions or somebody was telling me they want to do, using coding to source scientific problems and not just coding to build product, which is another different whole realm of things. And I also using very specific toolings. I’m not sure what apps you use, but maybe like a [inaudible 00:20:04] app or QuickBooks or Google Excel. They have very specific tools, specific expertise. So you want the model [inaudible 00:20:13] learn.
So they need a lot of humans expert in this area should create data to train them and it’s a massive thing people because everyone wants a lot of data and wants [inaudible 00:20:25] unlimited budget. But whether, I think this is also a little bit of low-key, interesting economics. I’m not sure you’ve talked to the guests about, I thought it’s very interesting [inaudible 00:20:35] think about because it’s very lopsided, right? Because they’re only a very small numbers of frontier labs and they want a lot of data and there’s a massive amount of startups or company providing related data. So you can see these companies like this startup doing data labeling. They have maybe some massive AR, but if you ask them, “Okay, so how many customers you have?” And they could be very small numbers, I’m not sure. I’m not sure you… I saw you smiling.
AI Strategy and the Creative Crisis
Lenny Rachitsky: Yeah, yeah, yeah, we chatted about that.
Rapid Fire Q&A
Chip Huyen: Yeah, so I’m a little bit like [inaudible 00:21:08] uneasy. I have a company’s growing crazy, but it’s heavily dependent on two or three companies. And at the same time, if I was this company, frontier labs, what could be the right economical things for me to do? Now I want a lot of startups. I want to have a lot of providers so I can pick and choose, and as this providers can also to compete each other to lower the price and it’s so dependent on [inaudible 00:21:34] regardless. So I feel like, yeah, so this whole economics is very interesting to me and I’m curious to see and how it plays out.
Lenny Rachitsky: What I’m hearing is you’re bearish on the future of these data labeling companies because as you said, they don’t have a lot of leverage over pricing because they have so few customers and there’s so many people getting into the space. So basically, even though there’s some of the fastest growing companies in the world, you’re feeling like there’s a challenge up ahead.
Creative Writing and Emotional Journeys
Chip Huyen: I’m not sure if I’m bearish on it. I think I’m curious because I think things has a way of work out in ways that I don’t expect. So I think that maybe these companies, they have a lot of data, maybe they wouldn’t be able to use that to have some insight that helps them stay ahead of the curve. So I don’t know.
Lenny Rachitsky: A very fair answer. Okay, while we’re on this topic, I want to chat about evals, which is a very recurring topic in this podcast. This is the other piece of data content these companies share that AI labs really need. Can you just talk about what an eval is, the simplest way to understand it and then how this helps models get smarter?
Chip Huyen: So I think people approach eval, I think they’re two very different problems. One is a app builder and can I say have an app that do maybe a chatbot? Very simple answer first thing that came to my mind and I want you to know if chatbot is good or bad. So it needs to come away with evaluate the chatbot. Another thing is, I think of this as a task-specific eval design. So let’s say I’m a model developer and I want to make my model better at code writing. And it was like, “Okay, but how do I even measure code writing?”
So I need someone to understand code writing and think about what makes a story good and then design the whole dataset and then criteria to evaluate code writing. So yeah, I think there’s that. I think it’s more like eval design that is very interesting [inaudible 00:23:39] work criteria, [inaudible 00:23:42] work guidelines, how to do it and then also train people how to do it effectively. So I guess, [inaudible 00:23:49], I think eval is really, really fun because it’s extremely creative. I was looking at different eval people built and it was like, “Wow.” It’s not dry at all. It’s just super, super, super fun.
Lenny Rachitsky: We had a whole podcast on evals with Hamel and Shreya. That’s exactly what they talked about is just, it’s actually really fun to create evals for companies, especially. So let’s still dig into that one a little bit more. There’s this kind of debate online that, I don’t know how big of a deal this debate is, but it feels like people spend a lot of time thinking about this, this idea of, do we need evals for AI products? Some of the best companies say they don’t really do evals, they just go on vibes. They’re just like, “Is this working well? Can I feel it or not?” What’s your take on just the importance of building evals and the skill of evals for AI apps, not the model companies?
Chip Huyen: You don’t have to be absolutely perfect, I think, to win. You just need to be good enough and being consistent about it. Okay, this is not a philosophy I follow, but I have worked with enough companies to see that play out. So when I say, why a company don’t eval? Let’s say you are an executive and you want to have a new use case. So here’s a use case you started out, built and it’s like it works well. The customers are somewhat happy. You don’t have the exact metric for it.
So the traffic keeps increasing, people seem happy, people keep buying stuff and now here’s our engineer coming like, “Okay, we need eval for it.” And it was like, “Okay, how much effort do we need to go into eval?” And they were like, “Okay, maybe two engineers, this much, this much.” And it could maybe would improve that and it was like, “Okay, so how much expected gain can I get from it?” And the engineer would be like, “Oh, maybe you can improve it from 80% to 82%, 85%.”
And it was like, “Okay, but [inaudible 00:25:35] that two engineers and we going to launch a new feature, then it could give me so much more improvement.” So I think it’s one of them is eval. Sometimes people think of eval as like okay, this is good enough, just don’t touch it. If you do spend a lot of energy on eval, it would only incremental improvement where it spends the energy on another use case and maybe [inaudible 00:25:55] good enough that you can vibe check it.
So I do think maybe that’s a debate is about. I do think that a lot of time people just get things to the place where it’s like, okay, good enough, people run. But in the end, but of course there’s a lot of risk associated with it because if you don’t have a clear metric, you have a good visibility to [inaudible 00:26:17] applications or models performing it might do something very dumb or it can cause you, I know something crazy can happen. So yeah, so I do think eval is very, very important if you have, if you operate a scale and where failures can have catastrophic consequences.
Then you do need to be very tyrannical about what you put in front of the users, understand different failure modes, what could go wrong and also maybe in a space when that it’s a feature, the product is a competitive advantage. You want to be the best at it. So you want to have a very strong understanding of where you are and where you are with the competitors. But it’s just something that’s more a low-key, okay, this is like something is like, okay, that’s not the core but it helps with our users.
Then maybe you don’t need to be so obsessed or theoretical about it. It’s like, okay, that’s good enough for now and if it fails, then it fails. Okay, I know it’s so terrifying. But yeah, I think it’s all about the question of return investment. I’m a big fan of eval, I love reading eval. And I says, I understand why some people would choose to not focus on eval right away and choose bringing on new functionalities instead.
Lenny Rachitsky: Awesome. That is a really pragmatic answer. What I’m hearing is evals are great, very important, especially if you’re operating at scale, but pick your battles. You don’t need to write evals for every little feature. Something that Hamel and Shreya shared is that people need just, I don’t know, five or seven evals for the most important elements of their product. Is that what you see or do you see a lot more in production that people build and need?
Chip Huyen: I don’t think of just a fixed number on the evals. What was the goal of eval? The goal of eval is to guide the product development. So you see eval, because I think I’m a big fan of eval, is that it helps you uncover opportunities where the progress are doing well. So sometimes, we’ve seen a very obvious [inaudible 00:28:15] where you look at the eval and we realize it’s like, okay, it performed really poorly on this specific segment of users and then we look into it’s like, okay, what’s wrong with it? And it turns out, it’s like we just don’t have a good messaging to it. So people should just focus on the things that we’re doing poorly, can improve significantly. Yeah, so I kind of like the number of eval is really depends. We have seen product with hundreds of different metrics.
Lenny Rachitsky: Oh, wow.
Chip Huyen: People going crazy, this is because that product is general, have different names, have one eval for, I don’t know, verbosity, have one eval for user sensitive data and another is for length but has a number of, okay, let’s just give a good example, concrete example, like deep research. So you have the application, you have views and model to do deep research for you. Okay, have a prompt. Let me say, okay, do me a comprehensive research on only Lenny’s Podcast and help me propose, show me report on what kind of topics he’s interested in, what kind of videos could get the most views or what topics that he’s missing on that he should be covering, right? Have that prompt. Then how do you evaluate the result? I don’t think there’s one metrics that would help. Maybe it’s like maybe you have a hundred, I think somebody has a benchmark and is get a hundred expert, write a bunch of prompts and they go through, on the answers on AI and do it. And it’s extremely costly and slow.
But [inaudible 00:29:47] might have something else. First of all, one way I was thinking about it, I was talking to a friend about it and one way it’s like, how would you produce the result of the summary? At first you need to, what you do, gather informations and to gather informations you need to do a lot of search queries. You gather, grab the search results and then some of the search results you aggregate and then maybe say, okay, I’m still missing on this. You have to go another route and on another route, [inaudible 00:30:17] have the summary. So every step of the way, you need evaluations. You don’t [inaudible 00:30:21] end-to-end. Maybe it was a search query in my first thing about, okay, now I write five search queries. I might look into how good are the search queries? Do they as they similar to each other because you need five search queries are very similar? Okay, Lenny Podcast, Lenny Podcast last month, Lenny Podcast two months ago.
It’s not very exciting. But if the query is a podcast, the keywords are more diverse and then look at the results of the search query and then say you enter the search query. Lenny Podcast data labeling and they come up with 10 pages, 10 results. And then you come up with like, oh, Lenny Podcast on, I don’t know, frontier labs, and you have 10 results. [inaudible 00:31:06] different webpages. Okay, how much of them overlapping… Are we doing both the breadth, getting a lot of page, but also, do we have depth and also do you have relevance because if we come up with a search query, it’s completely irrelevant to the original prompt. So I feel like every aspect of it, it would need a way of evaluating. So I don’t think it’s how many eval should I get, but how many eval do I need to get a good coverage, a high confidence in my application’s performance and also to help me understand where it is not performing well so that I can fix it.
Lenny Rachitsky: Awesome. And I’m hearing also just especially for the very core use case, the most common path people take in your product is where you want to focus.
Chip Huyen: Yeah, yeah.
Lenny Rachitsky: Okay. There’s one more term I want to cover and I want to go a somewhat different direction. RAG? People see this term a lot, R-A-G. What does it mean?
Chip Huyen: So RAG stands for Retrieval-Augmented Generations [inaudible 00:32:08] not a specific true generative AI. So the idea is just for a lot of questions, we need context to answer. So I think it came pretty, I think it’s from the paper 2017. So someone was like, so they realized it’s for a bunch of benchmark. When the question answering benchmarks, they realized it’s like, okay, if we give the model informations about the questions, the next answer can be much, much better. So what they do with that is try to retrieve information from Wikipedia. So for question [inaudible 00:32:39], just retrieve that and then put it into the context and answer. It does much better. So I feel like it sounds like a no-brainer, right? I mean, obviously. So I think that’s what RAG is, as a simplest sense, it’s just providing the model with a relevant context so that it can answer the questions. And it’s where things get really more interesting because traditionally, when it started out, RAG is mostly text.
So we talk about a lot of way of how to prepare data so that the model can retrieve effectively. Let’s say that not everything is a Wikipedia page. A Wikipedia page is pretty contained and you know, okay, everything about it is about a topic. But a lot of time, you have documents of like [inaudible 00:33:19] and they have a weird way of structures of documents. Let’s say that you had documents about Lenny Podcast and in the future, in the beginning a document it’s like, from now on, podcast wouldn’t refer to Lenny’s Podcast. So let’s say somebody in the future is like, “Okay, tell me about Lenny. Lenny’s work.” And because as a [inaudible 00:33:40] document does not have the term Lenny, you just don’t know, you might not retrieve it. And if the document is long enough that it’s chunked into a different part, so the second part doesn’t have the word Lenny, so you cannot reach it. So you have to find a way to process data. So that makes sure it’s like… It can retrieve, the information is just relevant to the query even though it might not immediately obvious that it’s related.
So people come up with only thing of, I think, contextual visual, like giving X chunk of the data, the relevant, maybe in a summary metadata so that it knows or some people use as hypothetical questions. It’s very interesting for even the chunk of documents, I must generate a bunch of questions that the chunks can help answer so that when I have a query, it’s like okay, does it match any of the hypothetical questions? It can fetch it. So it’s very interesting approach. Okay, so maybe before I go to the next thing, I just want to say this data preparations for RAG is extremely important. And I would say this in a lot of the companies that I have seen, that’s the biggest performance, in their RAG solutions coming from better data preparations, not agonizing over what [inaudible 00:34:51] databases to use because [inaudible 00:34:53] database, of course is very important to care about things like latency or if you have very specific access patterns like read-heavy or write-heavy, of course, it’s like it matters. But in term of pure quality answers, I think the data preparation is just [inaudible 00:35:07].
Lenny Rachitsky: When you say data preparation, what’s an example to make that real and concrete for us to understand?
Chip Huyen: So one way is mentioned as in you have chunks of data. So we have think about how big of each chunk should be. Because if it’s sort of think about it’s a context you want to maximize, maybe you can, it’s very simple example. You want to retrieve a thousand words. So if a data chunk is long, then it’s more likely to contain more relevant metadata so it can retrieve more. But if it’s too long then you have a thousand word. And so chunk is like a thousand words, you can reach one chunk. So it’s not very useful. But if it’s too short, then you can retrieve more relevant information also. It can retrieve a wider range of documents and chunks, but at the same time each chunk is too small to contain relevant information.
So we have very nice chunk design, how big each chunk should be. You add contextual informations like summary, metadata, hypothetical questions. Somebody was telling me just a very big performance they got is that from rewriting their data in the question-answering format. Instead of having… So they have a podcast instead of just chunking the podcast, you just reframe, rewrite it into here’s a question, here’s answers and produce a lot of them. It can use AI for that as well. So that’s one example of data processing. A lot of example I see is for people helping, using AI to help specific [inaudible 00:36:40] use and documentations. And we write documentation. Usually a lot of documentation today is written for human reading and AI reading is different because it’s different because humans, we have common sense and we kind of know what it is. So one things are, even for human experts, they have the context that AI doesn’t quite have.
So somebody told me that what’s a big change they have is let’s say, that you have a function. The documentation for this, maybe the library. As a library said okay, the output of this one is maybe talking for, I don’t know, some crazy term, maybe some temperature or something on the graph. It should be like one zero or minus one. And as a human expert maybe understand the scale, what one in the scale mean, but for AI, just really doesn’t understand what that means. So actually, have another annotation layer for AI. It’s like, okay, good temperatures equal one means like that. It’s not like it’s a actual temperature. It’s associated with the scale over there. So just saving all this data processing to make it easier for AI to retrieve the relevant information to answer the questions.
Lenny Rachitsky:
Awesome. Okay. So you’ve talked a bit about how you work with companies on these sorts of things, on their AI strategies, on their AI products, how they build, which tools they build, all these things. I want to spend a little time here because a lot of companies are building AI products. A lot of companies are not having a good time building AI products. Let me ask a few questions along these lines of what you’ve learned working with companies that are doing this well. One is just, I guess, in terms of AI tool adoption and adoption in general within companies, there’s all this talk recently of just all this AI hype. The data is actually showing most companies try it. Doesn’t do a lot, they stop. And so there’s all this just maybe this isn’t going anywhere. So in terms of just adoption of tools in AI within companies, what are you seeing there?
Chip Huyen: For GenAI in company, I think there are two types of GenAI toolings that have been, I’ve seen ones is to internal productivity, like have coding tools, Slack chatbot, internal knowledge. A lot of big enterprises have some a wrapper around models, so with access to maybe some different type of a RAG solution. I think we talk about data or kind of like text-based RAG. We haven’t talked about agentic RAG or I haven’t talked about multi-modal RAG yet. But this, yes, it’s a whole very exciting area around that. So basically, it should allow the employee to access internal document. Somebody ask, okay, I’m having a baby. What could be the maternal or paternal policy or am I having these operations with the health benefit cover that or I want you to interview, I want to refer my friend. What will be the process for that? So a lot of this having chatbot, internal chatbot to help with internal operations.
And another things, another category is more customer facing or partner facing. So product customers support chatbot is a big one. If you’re a hotel chain, you might have a booking chatbot, which is somehow massive. A lot of booking chatbot because I guess it’s… I do have this theory of a lot of applications companies pursue because they can’t measure the concrete outcome. And I feel like booking or a sales chatbot, it’s very clear. There was a conversion rate right now with that chatbot with human operators and what could be conversion rate with a chatbot and certain, somehow I think it’s very clear outcomes and companies are easier to buy into these solutions. So a lot of companies have that customer facing chatbot.
So that is another category of tool and I think that for customers or external facing tools, because people are driven to choose applications with clear outcomes. So the questions of adopting them is really based on whether they see the outcome or not. Of course, it’s not perfect because sometimes the outcome can be bad not because the idea or the application’s idea [inaudible 00:41:52] is bad. It’s just because the process of building it is not that great. Yeah. So it’s tricky. For the internal adoptions of toolings or internal productivities, that’s where it gets tricky. I would say a lot of companies [inaudible 00:42:08] think of AI strategy. I think of AI strategies usually have two key aspects. It’s like use cases and the second is talent. You might have great data for great use cases, but you don’t have talents and you cannot do it.
So a lot of time at the beginning with GenAI and sometimes I’m really admire a lot of companies for that, it’s just like [inaudible 00:42:28] was like, okay, we need our employees to be very GenAI aware, very AI literate. So what I do is I start maybe adopting a bunch of tools for the team to use. They have a lot of up-skilling workshops, they encourage learning and then it’s a really, really good thing. And it’s also willing to spend a lot of money into adopting, giving people chargeability, subscriptions, purchase subscriptions, [inaudible 00:42:56] subscriptions to get the employees to be more AI literate. And that’s the thing is a lot of… There’s a [inaudible 00:43:05] may say, okay, we spend a ton of money on this tooling, but then we don’t see because you can see the usage, but people don’t seem to use them as much and what is the issue. So yeah, so I think that is tricky. Yeah.
Lenny Rachitsky: What do you think is the issue? Is it just they don’t know how to use them? What do you think is the gap here? Do you think we’ll get to a place of just like, wow, work is completely different because of AI for a lot of companies?
Chip Huyen: The main thing is it’s really hard to measure productivity again. So I talk to a lot of people on their website. First of all, [inaudible 00:43:40] is coding. A lot of companies not using coding agents or coding [inaudible 00:43:45] coding. And I was asking, I was like, “Do you think that it helps with your productivity?” And a lot of times, the questions are very [inaudible 00:43:56] okay, I feel like it’s [inaudible 00:43:59] better. And I said, okay, because we have more PRs, we see more code and then immediate [inaudible 00:44:04]. Okay, but of course, code, number of live code is not a good metric for that. So it’s really, really tricky and it’s something funny. So I do ask people to ask their managers because I work with usually VP level, so they have multiple teams under them. So I asked them, okay, do you ask some managers, okay, would you rather have access…
Would you rather give everyone on the team very expensive coding agent subscriptions or you get an extra headcount? Let’s say maybe and almost everyone could say the managers could say headcount. But if you ask VP level or someone who manage a lot of teams, they would say just like [inaudible 00:44:48] good one, AI, a system as tools. And the reason is that we could say okay, because as manager is right, because you are still growing. You’re not as a level when you manage hundreds of thousands of people. So for you, having one HR headcount is big. So you want that not for productivity reasons, but because you just want to have more people working for you. Whereas for executives, you care more about, maybe you have more business metrics that you care about. So you actually think about what actually drive productivity metrics for you. So it is tricky and I think that the question of productivity. I’m not sure it’s fundamentally is the [inaudible 00:45:32] more productive, but it’s just like we don’t have a good way of measuring productivity improvement.
Another thing is also very [inaudible 00:45:40]. And I think it’s like people do tell me that they notice different buckets of employees, different reactions to AI assist tools. First of all, I keep going back to coding because coding is big and it’s easier to reason somehow. So it says I have different reports. One team would tell me that… One of people tell me, okay, amongst on his engineers, he thinks senior engineers would get the most output, would be more productive because it’s like, okay, so that person’s very interesting. So he actually divided his team to three buckets, but he didn’t tell them, obviously. He was like, okay, here’s more currently best performing, average performing and lowest performing. And then there’s a randomized trial. So they give half of each group access to Cursor. And then [inaudible 00:46:31] noticed over time it was like, okay, something funny. The group that get the biggest performance boost, in his opinion, he was very close to his team.
The biggest performance boost [inaudible 00:46:41] the senior engineer, the highest performing. So the highest performing engineer get the biggest boost out of it. And then the second group is the average performing. So his opinion is like, okay, the highest performing engineers is also normal practice. They also know how to solve problems. So they have some solved problem better. Whereas the people who have the lowest performing, they only don’t care much about work. So it’s easier to just go on autopilot, get it to generate that code and just do it or just don’t know how to do it. Another company, however, they tell me just actually, senior engineers are the one most resistant to using AI as this tooling because they said it’s like, okay, but AI, because they are more opinionated and they have very high standard. It was like, okay, but AI code, [inaudible 00:47:30] code just sucks. So just very, very resistant in using it. So I don’t know, I haven’t quite been able to reconcile very different reports on that yet.
Lenny Rachitsky: This is so interesting. So just to make sure I’m hearing what the story, so there’s a company you work with, that did a three bucket test with their engineering team where they created three sorts of groups, the highest performing engineers, mid-performing engineers, lowest performing engineers, and gave some of them, so they gave some of them access to say, Cursor. Was it Cursor or what did they give them access to? It was Cursor, right?
Chip Huyen: I think it was Cursor.
Lenny Rachitsky: Okay, cool. And so within-
Chip Huyen: I didn’t work with them. This is more like a friend company.
Lenny Rachitsky: Okay. It’s a friend’s company.
Chip Huyen: Yeah.
Lenny Rachitsky: So did they give half of the higher performing engineers Cursor and half not or how did they do the split there?
Chip Huyen: Yeah, so they give half of the entire company but half of each bucket. Yeah.
Lenny Rachitsky: Whoa.
Chip Huyen: And then they observe the difference in productivity.
Lenny Rachitsky: I see. So how do they even do that? They’re just like, “Okay, you get cursor, you don’t get cursor.” How did they do that? That’s so interesting.
Chip Huyen: Yeah, I didn’t get into the mechanics of it, but I was like, “I respect you for doing a randomized trial on that.”
Lenny Rachitsky: That is so cool.
Chip Huyen: Yeah. Yeah.
Lenny Rachitsky: Okay. Wow. How large was this engineering team? Was it like hundreds of people?
Chip Huyen: It’s not that large. It’s about maybe 30 to maybe 40. Yeah.
Lenny Rachitsky: 30 to 40. Okay.
Chip Huyen: Yeah.
Lenny Rachitsky: Wow. Okay. So they found that the highest performing engineers had the most benefit from using AI tools and then behind them was the middle tier engineers and the worst performers or the lowest performers. Okay.
Chip Huyen: But it’s also not the same everywhere.
Lenny Rachitsky: Right. Right. Right, right.
Chip Huyen: Some companies are different.
Lenny Rachitsky: Right. This other example you shared of just senior engineers in this one example are most resistant to changing the way they work, which I get. I do feel like the most valuable people right now other than ML researchers and AI researchers like yourself, are senior engineers because it feels like junior engineers are just, so much of this is now done by AI, but an engineer that knows what they’re doing that understands how things work at a large scale with AI tools, just basically infinite junior engineers doing their bidding, feels like an extremely valuable and powerful asset.
Chip Huyen: Yeah, I definitely really appreciate, as you see companies, we appreciate engineers who have a good understanding of the whole systems and be able to have good problem solving skill are thinking holistically instead of locally. Or when our company have seen the way they work, as they told me is we’re completely different now. So they actually restructured engineering org so that they get more senior engineers should be more in the peer review because they get writing guidelines on what is a good engineering practices, what is the process would be like.
Or maybe like okay, so they write a lot of processes on how to work well. And then they have more junior engineers just produce code and submit PR, but senior engineer more in the reviewing case. So I think it might be prepared for the future. So another company actually told me something very similar. So preparing for the future once they only need a very small group of very, very strong engineers to create processes and reviewing code to get into production but get AI or junior engineers to produce code. But then the question becomes just like, how does one become a very strong senior engineer.
Lenny Rachitsky: Right. That’s right. That’s right. That’s the problem. Yeah.
Chip Huyen: Yeah. So I don’t know what’s the process I was thinking about, yeah.
Lenny Rachitsky: No one’s thinking about it. It’s a problem. We won’t have any more in 10, 20 years. There’ll be no more engineers because no one’s hiring junior engineers. Although I could make the case. Junior engineers, people just getting into computer science right now, are just AI native. And in theory, you could argue they will become really good really fast if they’re curious, aren’t just delegating, learning and thinking to AI, but learning how to actually, using it to learn how to code well and architect correctly. You could argue they’ll be the most successful engineers in the future.
Chip Huyen: I do think that what I mentioned said relating to architect. I think I grouped that in my system thinking. I do think it’s very important skill because I think AI can help automate a lot of disjointed skills, but knowing how to utilize the skills together to solve problems is hard. So that’s a webinar between Mehran Sahami who is one my favorite professors. He was a chair of the curriculum at the CS Department at Stanford. So he spent a lot of time thinking about CS educations, what should students learn nowadays in the area of AI coding. And then the other person is Andrew Ng, which is of course, is a legend in the AI space. And Mehran Sahami, Professor Sahami, said something very interesting. He said a lot of people think that CS is about coding, but it’s not. Coding is just a means to an end.
CS is about system thinking, using coding to solve actual problem and problem solving will never go away because what AI can automate more stuff. The problem is just get bigger. But as a process of understanding what caused the issue and how to design step-by-step solution to it, will always be there. So I think an example of, I actually have a lot of issues with AI for in the way of it’s debugging. So I’m not sure you use a lot of AI for coding, but something I have noticed and also seen from my friends, it’s like it is pretty good when you have very clear, well-defined tasks. Maybe write documentations, fix specific features or build an app from scratch. Doesn’t have to interact with a large access in code base, but you added something a little bit more complicated, maybe required interaction with other components and stuff. It’s usually not that good.
And for example, I was using AI to deploy an applications and it was testing out a new hosting service I was not familiar with. It was like, okay. Usually they inform me, so working AI does give me is confidence to try a new tool. Before what AI is like trying new tools has written, not documentations for the beginning, but I was like, okay, just try it out and learn. So I was testing out this new hosting service and it kept getting a bug, so was very, very annoying. And it was like, okay, I asked [inaudible 00:53:51], fix it. And it kept changing the way, maybe change the environment variable, fix the code, maybe not change from the function to this function, maybe change the language, maybe it doesn’t process JavaScript, I don’t know, whatever. And it didn’t work. And it was like, okay, that’s it.
I’m just going to read documentation myself and see what’s wrong. And it turns out, it’s like I’m on another tier, the [inaudible 00:54:16] I want did not, is not available in this tier, right? So I feel like, okay, so the issue with [inaudible 00:54:22] was just trying to focus on fixing things from a different component versus the issue is from a different component. So I think of, okay, be understanding how different components work together and where the source of issue might come from. You need to give a holistic view of it. And it’s made think is like, okay, how do we teach AI system thinking that I have all the human experts having very much [inaudible 00:54:46] scaffold just like, okay, for this kind of problem, look into this, look into that, look into that, and then stuff. So [inaudible 00:54:53] that could be one way, but that’s also made me think is, how do we teach humans, system thinking? Yeah. So yeah, I think it’s very interesting skill. I do think it’s very important.
Lenny Rachitsky: That’s exactly the same insight Bret Taylor shared on the podcast. He’s the co-founder of Sierra. He created Google Maps. He was CEO of Salesforce, Quip, a few other things. And I asked him just like, should people learn to code? And his point is exactly what you said, which is taking computer science classes is not about learning Java and Python. It’s learning how systems work and how code operates and how software works broadly, not just, here’s a function to do a thing.
One thing that I wanted to help people understand, you wrote this book called AI Engineering, which is essentially helping people understand this new genre of engineer and you have this really simple way of thinking about the difference between an ML engineer and an AI engineer, which has a really good corollary to product managers now, of just an AI product manager versus a non-AI product manager. The way you describe it and fill in what I’m missing is just ML engineers built models themselves. AI engineers use existing models to build products. Anything you want to add there?
Chip Huyen: One thing I really dislike about writing books is that it has to define this and I think it’s like no definitions would be perfect because they always be edge cases. But yeah, in general, I think it’s just like GenAI as a service, more as a service, when somebody build the models for you and the base model performance is a pretty [inaudible 00:56:26]. So it’s like it’s enabled people to just like, okay, now I want to integrate AI into my product. I don’t need to learn [inaudible 00:56:34] even though knowing that could really help. But yeah, it makes an entry barrier really low for people who want to use AI to build product and at the same time, AI capabilities are so strong. It’s also increased the possibilities, the type applications that AI can be used for. So I think yes, both entry barriers’ is super low and a demand for AI applications a lot bigger. So it feels, it’s very, very exciting. It’s opens up a whole new ball of possibilities.
Lenny Rachitsky: Yeah. It’s like now you don’t have the time, now you don’t have to spend time building this AI brain. Now you can just use it to do stuff, such an unlock. Okay. Maybe just a final question. You get to see a lot of what’s working, what’s not working, where things are heading. I’m curious just if you had to think about in the next two or three years, just where things are heading, how do you think building products will be different? How do you think companies working will be different if you had to think of maybe the biggest change we expect to see in the next few years, in terms of how companies work?
Chip Huyen: I think in a lot of organizations they don’t move that fast, but at the same time, they move faster than I expected because again, I think it’s like bias and don’t work with dinosaur companies who don’t care. I think a lot of executives who come to me are very forward-looking. So maybe for me, I’m very biased towards organizations is move fast. So yeah, I think one big change I see just in organizational structure. I think this a lot of value plays in… So before we have a lot of disjointed teams. We have very clear engineering team, product team, but then there’s a question of who should write eval? Who should own the metrics? And it turns out, eval, it’s not a separate problem. It’s a system problem because you need to look into different components, how they interact with each other. You need user behaviors because you need to know what users care about so that you can write eval reflect what users care about.
So all of that you can sort it from you look into different component architectures, place guardrails and stuff. So it’s just engineering, but understanding users is what product. So because of a lot of things and eval is extremely important. So the kind of bring product team and engineering team, even marketing team like user acquisition, very close to each other. So yes, since in a ways if people are structuring, so that’s more communications between previously very distinct functions. Another thing is I also see as teams, of course, I think about what can be automated in the next few years and what work cannot be automated. And I seen that people already shedding, actually it’s a little bit scary to think about it, but I also think it’s the teams, they would’ve told me, it’s just like okay, this is good and you and me, but we have got rid of these functions for a lot of things like previously outsourced, for example.
Traditionally, it’s a business outsourcing that’s not core to them and can be in a more systematized. So with that, you can actually use AI to automate a lot of that. And so as a separation people thinking more of what is the value of junior engineers or senior engineers, how should we restructure engineering org for that? Yeah, so I do definitely think that is one thing to successful organization. People are just moving pieces around and thinking about use cases, whether you need to spin out new use cases and who would lead a new effort. That is one big change. Another thing in terms of AI, I think there’s, I’m not sure how true this is. I guess, I’m also on the camp of thinking that it has merit, is a camp of okay, base models we have probably not quite maxed out, but we’re unlikely to see really, really strong, crazily strong model.
So you remember when we have GPT, right? And then GPT2, which is a big step up, an [inaudible 01:00:49] better than GPT and then GPT3, which much, much bigger than GPT4, much, much bigger. And then of course, GPT5, but it’s GPT5, that scale of much bigger step jump compared to the previous, I think it’s debatable. So I think that we had disappointment, the base model performance improvement is not going to be mind-blowing. It was in the last three years. So I think there’s a lot of improvements when I see in the post-training phase, in the application building phase. And yes, also I think that’s where I feel I would see a lot of improvement there. I also very interest in multimodality. So we’ve seen a lot of text base, but I think there’s a lot of audio, videos use cases that is very, very exciting.
And I think audios is not quite as solved. Well, I think because I do work with a couple of voice startups and when it comes to, think about voice, it’s an entirely different beast. So let’s say have chatbot. We go from a text chatbot to voice chatbot. It’s like the consoles are completely different because now with voice chatbot, we need to think about latency because I think multiple steps, first have voice to text, text to text, text question into text answer and then text to voice answer. So you have multiple hops and latency become very important. And there’s a question, what does it make you sound natural? So for example, people think of in AI and humans, when humans talk to each other, if I say, you try to interrupt me and say, Chip [inaudible 01:02:36]. I would pause and I try to hear you out.
But sometime even if I just like say some word, like acknowledge when I, mm-hmm, mm-hmm, that I shouldn’t stop. It’s just continue. So the question of forced interruption and whether it’s, should I stop or not, it’s a big in what perceived as natural conversations. And that’s also regulations because a lot of time, people want to build AI chatbot, voice chatbots that sound like humans, try to trick users into thinking that they’re talking to humans, but also maybe potential regulation saying okay, you have to disclose to users when you talk, if the bot is talking to is human or AI. So I think this a whole space, I think it’s not quite as solved as you think. But it’s not quite like an AI foundation model problem because a human interruption detection, it’s actually a classical machining problem.
It’s a different framing, but you can give classifier for that. Or the question of latency, actually a massive engineering challenge, not an AI challenge. Of course, it can be an AI challenge because people are trying to build voice-to-voice model. So instead of having to firstly transcribe the voice from me into text and then get a model [inaudible 01:03:54] text answer and get another model should turn from text to speech, you can just do voice-to-voice directly. So that is something we’re working on, but it’s very hard. Yeah. So yeah, so even audio, I think of it’s the easier than video because video have both image and voice. It’s already pretty hard. So I think there’s a lot of challenges in that space.
Lenny Rachitsky: That was an awesome list of things. Let me mirror them back real quick. So what you’re predicting in the next few years, things that will change in the way we work, and these actually resonate with so many conversations I’ve had on this podcast. So says, just kind of doubling down on where things are heading. One is the blurring of lines between different functions instead of just design engineering. Everyone’s going to be doing a lot of different things now. Two is, just more of work being automated with agents and all these AI tools and just in theory, productivity going up. Third is, a shifting from pre-training models to post-training, fine-tuning and things like that because to your point, models maybe are slowing down in how smart they’re getting.
Although, I’ll point folks to the, I had a chat with the co-founder of Anthropic. He made a really good point here. He’s like, we’re really bad at understanding what exponentials feel like when we’re in the middle of that. And also, models are being released more often. So the difference between them we may not notice because they’re just happening more often versus GPT3 came out a year before after GPT2. Maybe true, maybe not. And then the fourth point you made is this idea of multimodal, investing in multimodal experiences. I cannot wait for ChatGPT voice mode to get better at interruption, exactly what you’re saying. I’m just talking to it and then someone makes a little sound and it’s like [inaudible 01:05:33]. Okay. And then you have to, and then it’s like, and then it stops talking. It’s so annoying.
Chip Huyen: I’m shocked that we don’t have better voice assistant at home yet. I think I have been testing out a bunch, honestly. I keep hoping, oh my God, that could be the one and then I know how many of them I just had to give away because they’re not that good.
Lenny Rachitsky: I think it’s coming. I hear it’s coming. Anthropic’s working with someone that, I don’t know if it’s launched or not yet.
Chip Huyen: Yeah, [inaudible 01:05:55] want to bring back to what you mentioned about your guest from Anthropic, mentioned about the performance improvement. I think there’s a big change, I think this difference between a model-based capability. So I’m talking about the pre-trained model versus the perceived performance perform. So let’s say, I’m not sure you thought about, are you familiar with the term test time compute?
Lenny Rachitsky: I don’t think so. Help us understand.
Chip Huyen: So this idea is like okay, you have a fixed amount of compute. So you’re going to spend a lot of compute on pre-training or training the model. Pre-training and then I’ve spent a lot of some compute fine-tuning and the ratio of pre-training to the post-training compute is crazy, varies between different lab. And also, since then has a spend compute on generate inference. When I have a trends and fine-tuning model and now you want to serve it to users. So I might type a question in a prompt and if generate, do inference and that requires a compute. And I guess, I feel about discussion of should I spend more compute on pre-training or fine-training or inference because inference and people thought I was just like test time compute. So spending more compute on inference is like calling test time compute as a strategy of just allocating more resources, compute resource to generate inference when I shouldn’t bring better performance and how does that do it?
Let’s say you have a math questions and maybe instead of just generate one answer again generate four different answers and say okay, whichever is the best according to some standard or okay, I have four answers and then maybe three of them say 42 and one of them says 20. You say okay, three of them in agreement. So the answer should be 42. So just people shouldn’t generate a bunch of it. Or another thing is a lot of time like reasoning, thinking, it just be able to generate more thinking tokens, like spend more time thinking before showing the final answers. It’s like require more compute but also give more better performance. So yeah, so I think it’s like from the ease of perspective when the model spend more time exploring different potential answers, thinking longer, it can give you much better final answers. But the base model itself does not change.
Lenny Rachitsky: Awesome.
Chip Huyen: Does it make sense?
Lenny Rachitsky: Yes, that does. Absolutely.
Chip Huyen: Yeah?
Lenny Rachitsky: That is a good corollary to Ben Man’s point.
Chip Huyen: Yeah.
Lenny Rachitsky: Chip, we covered a lot of ground. I’ve gone through everything I was hoping to learn and more. Before we get to a very exciting lightning round, is there anything else that you wanted to share? Anything else you want to leave listeners with?
Chip Huyen: So I do work with a few companies that does these things of they want employees to come up with ideas. So there’s a big debate on what is a better way for AI strategy, should they be top out or bottom up, should executives come up with one or two killer use case and everyone allocate resource to that, should you give engineers and PMs and smart people come up with ideas. And I think it’s a mixture of both. So some companies it was like, okay, we hire a bunch of smart people, let’s see what they come up with and they organize more hackathons or internal challenge to get people to build product. And one thing that I noticed, a lot of people just don’t know what you built. And it shocked me why I feel like we are in some kind of an idea crisis, right?
Now, we have all this really cool tools to have. You do everything from scratch, can have you design, it can have you write code, it can build website. So in theory, we should see a lot more, but at the same time, people are somehow stuck. They don’t know what to build. And I think it’s like, maybe you see a lot of had to do with maybe society expectations because we have gone into this phase of specializations, people very highly specialized and people are supposed to focus on one thing really well instead of being a big picture. And we don’t have a big picture view. It’s hard to come up with ideas of what you build.
So I know what, when I work with this company on this hackathon, we do work on come up with a guideline, how to come up with ideas. And usually what we think of is like, okay, one tip is go look from the last week. For a week, just pay attention to what you do and what frustrates you. And when something frustrates you, think about, is there anything we can do? Can it be done a different way? So it’s not frustrating and you can talk, people can swap to accept [inaudible 01:10:27] or teams, and I even see they come on frustrations. Maybe there’s something you can think about just to build something around that. So yeah, so I feel like just notice how we work, thinking of ways, constantly ask questions, how can this be better? And then I just build something to address the frustrations, I think it’s a good way to learn and adopt AI.
Lenny Rachitsky: I think people have felt exactly what you’re describing every time they open up one of these vibe coding tools where you could just describe anything you want. I’m like, “I don’t know, what do I want?” And I love this very tactical piece of advice, just like what frustrates you, just pay attention to where you’re frustrated. For example, I just built a very cool little vibe coded app. I was working on a newsletter post inside Google Docs and I pasted all these images into the Google Doc, from screenshots and stuff and then I forgot, oh yeah, you can’t take images out of Google Docs. It’s like this Hotel of California experience where you can paste stuff into it, very hard to get images back out. So I just went to all the vibe coded tools and just built an app that I can give you a Google Doc URL and it let me download all the images automatically. And it worked amazingly well and I made it really cute. And I’ll link to it in the show notes.
Chip Huyen: OH, I would love to see that. I’m very bullish on using AI, just create micro tools. It’s just something just make your life a bit easier.
Lenny Rachitsky: A hundred percent. I feel like that’s one of the main ways people are using these tools, just a little niche problem they have. With that, Chip, we’ve reached our very exciting lightning round. I’ve got five questions for you. Are you ready?
Chip Huyen: Yeah, always. No, no, no. It depends on how hard the questions are.
Lenny Rachitsky: They’re very consistent across every guest. So I imagine you’ve heard them before. First question, what are two or three books that you find yourself recommending most to other people?
Chip Huyen: I’m really terrified of book recommendations because I feel like what books [inaudible 01:12:15] you should read really depends on what they want and where they’re in life and where they want to get to. But just several books that I do think’s have really changed the way I think and see the world. So one thing is The Selfish Gene, that’s to understand, it actually helped me with the question whether I want to have kids or not because it’s understanding more of a lot of our functions, the way we operate is the functions of our genes and genes want to do one thing, to procreate.
So yes, in a way, the book also proposed another thing is so everyone wants to live forever and maybe it’s not consciously, but subconsciously, we do want that. And there are two ways. One is via genes. Genes [inaudible 01:13:00] want to continue forever, but [inaudible 01:13:03] two ideas. I think there’s something [inaudible 01:13:05]. It’s just like being able, if you have some ideas out there and then it’s last for a long time, it’s going to live on. I know it’s a little bit abstract, but I thought it’s very interesting.
The other books I really, really like is from the book from Singaporean previous, I think he is [inaudible 01:13:24] as a Father of Singapore, I don’t know, Lee Kuan Yew. I’m not sure what’s the title is, but he was the one who led Singapore from, he’s changed Singapore from a Third World country to a first world country within 25 years. And I have never seen any country leaders spent so much effort into putting down his thought of how to build a country like that.
And as I talk a lot about public policy, how to create policies that encourage people to do the right things that is good for the nations and also talking about foreign affairs, foreign policies, the liberation of the country, but other. So it’s a really good book to think about. For me, it’s a system thinking, but it’s a different kind of system which a country, which a lot of us don’t get a chance to ever experiment in our life. So it’s good to learn about that.
Lenny Rachitsky: What was the name of that second book?
Chip Huyen: It’s called From Third to First World. Actually, I think I have it somewhere here. Yeah.
Lenny Rachitsky: There it is.
Chip Huyen: It’s a very heavy book.
Lenny Rachitsky: Show and tell.
Chip Huyen: Yeah.
Lenny Rachitsky: That’s awesome. I definitely want to read that. That’s a really good [inaudible 01:14:26]. I’ve heard a lot about just the impact he’s had and I’ve seen all these videos on Twitter of just his really wise insights into how to build a thriving society. And clearly, it works.
Chip Huyen: Yeah. Can you believe, how does he time to write such a thick book? It’s insane.
Lenny Rachitsky: That is. Claude, please summarize. I’m just joking. By the way, Selfish Gene, I also absolutely love that book. That is such a good choice. It’s such an under the radar kind of book that really changed the way I see the world as well. So really good pick. Okay, next question. Do you have a favorite recent movie or TV show you really enjoyed?
Chip Huyen: So I watched a lot of movie and TV shows as a research because I working on my first novel and I recently sold it. So I’m interested what makes, it’s a drama. It’s not a science fiction or anything that tech people usually read. So it very, I know it’s a very out of left field and very, so it’s almost like reading, watching TV to see what kind of stories become popular, trying to understand the trope and stuff like that. So I’m not sure if the audience will like…
Lenny Rachitsky: Well, what’s one? What’s one that taught you something about writing?
Chip Huyen: I think like Yanxi Palace. It’s a Chinese TV show.
Lenny Rachitsky: Cool. Okay. I haven’t heard that one on the podcast before. Okay, cool.
Chip Huyen: Yeah.
Lenny Rachitsky: Next question. Do you have a life motto that you often think about, come back to when you’re dealing with something hard, whether it’s in work or in life?
Chip Huyen: This sounds very nihilist. I think to say, in the end, nothing really matters. Usually, I think of in the grand scheme of things, in a billion years, nothing will, no one would ever be there. I think okay, someone will argue with me about that. [inaudible 01:16:05]. So my theory’s like, in a billion years, none of us would ever exist. So whatever messy things, like crazy things we do or how bad we do it, I mean, no one would be remember, wouldn’t be there to remember it. And I think in a way, it sounds scary, but it’s very liberating because it just allows me say, okay, let’s just try things out, right? Why does it matter? And there’s a story of recently, so I have some family member who passed away recently. And I was talking to my dad because I couldn’t be home for that.
I was asking my dad like, “Okay, os there anything I can do to make the person…” Something like comfort. So anything that you can get the persons. And my dad was just like, “What can he possibly want at this moment?” It just made me feel at the end of life, there’s nothing that can bring you, like material can bring you joy. There’s no money, no product, nothing. And in way, it makes me feel like, okay, what really do I really care about at the end of the day? So I guess it’s like I think about it. It’s just like, okay, maybe I fail it, maybe I don’t get that contract. Maybe those things, but in the end of life, I don’t think that actually really matters. So in a way, it’s quite liberating.
Lenny Rachitsky: I know you said it might be nihilistic. This is what Steve Jobs shared too in one of his most famous speeches. Just we all die someday day, so don’t take things so seriously and it is freeing. Absolutely. It just makes you appreciate every moment, every day you have. Just like, yeah, let’s just do something hard and scary. Okay, final question. You talked about how you’re writing a novel. Most people in tech have never written something creative and fiction. What’s just one thing you learned in the process about how to write better stories, better fiction?
Chip Huyen: A lot of time when we read, we get tripped up by some small things. So I think I want to do creative writing because I just want to go a better writer and it tells us maybe try a different audience could have me become better at anticipating what this different type of audience would want to hear and what they care about. So it’s a way for me to get a… So I think if I write it or even any kind of content creations is about predicting the user’s reactions, right?
Lenny Rachitsky: The next token.
Chip Huyen: You do a podcast.
Lenny Rachitsky: Just kidding.
Chip Huyen: Yeah. Yeah, so you do a podcast, it’s like, okay, what kind things that the users could find engaging, right? And I find this a little bit and a lot of companies you have launch a product, you have a narrative coming out and say, okay, how do we position this product in a way that users would want? So I feel like I have done technical writing for a while and I felt like I had some experience trying to predict what engineers would want to hear or care about. But then I don’t have any experience like this, completely different type of audience. So that’s what I want to, creative writing, writing a story. And that’s why I was doing a lot of research [inaudible 01:18:55]. I mean, doing research [inaudible 01:18:56] enjoyed a lot, watching a lot of dramas. I just see what people like. So one thing that I care about is, I think I learned what emotional journey was from a editor.
So when we write something we care about how users would feel across a story. We want something in the beginning, we want something, we need to have a hook so that people continue reading. But we also don’t want too much of drama because we’ll get too tired because you’re emotionally exhausted because it’s like you’re being emotionally manipulated a lot of time. So it gave a emotional journey, maybe have some climax or something more chill, maybe like… And also care about another thing I didn’t realize is, for me, for technical writing, you entirely focus on the content, the argument. It’s very impersonal. For example, people like ML compilers, doesn’t matter if they like the person telling them about compiler or not because it’s just objective [inaudible 01:19:56]. But for a novel, people care about character likeability.
So in the first version of my story, it makes the characters a little bit more, very logical, very rational, and just does everything just very rationally. And then the feedback I got is, I have a very good friend read it and he was like, he’s an amazing person, he’s a great person. And he was like, “Chip, I’ll be honest, I hate that person.” So it doesn’t matter as a story, it’s just like the person is so unlikeable, that’s why he doesn’t want to continue. So is a second version. It makes that person, the character more likable. How she makes that character more likable is that you put in some vulnerability sometimes it’s like okay maybe it’s person have setback because sometimes we can relate to it. So in a lot of ways, it’s very interesting. A lot of it is about understand the emotional bit, like how the users feel, not just about the story but also about the characters.
Lenny Rachitsky: That is so interesting. Wow. I learned a lot more there than I thought. That was awesome. Really good example. Chip, two final questions. Where can folks find you online, if they want to reach out and maybe work with you or maybe even just share the stuff that you offer if folks want to reach out. And then how can listeners be useful to you?
Chip Huyen: I’m on social media, LinkedIn, Twitter. I don’t post a lot, but I keep telling myself that I should do more because I kind of like the conversation with readers. So I’m actually about to I start a Substack. So I have a placeholder for Substack right now and I’m thinking of doing it for more system thinking because I think it’s a very interesting skill. I’m also thinking of doing a YouTube channel on book reviews and basically books than help you think better. So I think it’s the first book I’m a review is probably like this book because it’s my favorite book growing up and I’ve been keep on reading it. So yeah, so how can you be helpful? Send me books that you like, books that help you have changed the way you think or change you the way you do anything. So I would appreciate it.
Lenny Rachitsky: Amazing. I’m excited to read that book.
Chip Huyen: Mm-hmm.
Lenny Rachitsky: Chip, thank you so much for being here.
Chip Huyen: Thank you so much, Lenny, for having me.
Lenny Rachitsky: Bye everyone. Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennyspodcast.com. See you in the next episode.
Glossary
| English | 中文 |
|---|---|
| agentic RAG | agentic RAG(代理式检索增强生成) |
| AI literate / AI literacy | AI 素养 |
| ARR | ARR(年度经常性收入) |
| base model | 基础模型(base model) |
| classifier | 分类器(classifier) |
| Claude Shannon | 克劳德·香农 |
| code review | 代码审查(code review) |
| coding agent | 编码代理(coding agent) |
| conversion rate | 转化率 |
| distillation | 蒸馏(distillation) |
| eval | 评估(eval) |
| fine-tuning | 微调(fine-tuning) |
| frontier lab | 前沿实验室(frontier lab) |
| GenAI | 生成式 AI |
| guardrails | 护栏(guardrails) |
| inference | 推理(inference) |
| labeled data | 标注数据(labeled data) |
| latency | 延迟(latency) |
| MCP | MCP(Model Context Protocol) |
| multi-modal RAG | 多模态 RAG(多模态检索增强生成) |
| multimodality | 多模态(multimodality) |
| post-training | 后训练 |
| PR | PR(Pull Request) |
| pre-training | 预训练 |
| RAG | RAG(检索增强生成) |
| randomized trial | 随机试验 |
| reinforcement learning | 强化学习(reinforcement learning) |
| reward model | 奖励模型(reward model) |
| RLHF | RLHF(基于人类反馈的强化学习) |
| sampling strategy | 采样策略(sampling strategy) |
| soft skills | 软技能 |
| supervised fine-tuning | 监督微调(supervised fine-tuning) |
| supervised learning | 监督学习(supervised learning) |
| system thinking | 系统思维(system thinking) |
| test time compute | 测试时计算(test time compute) |
| tier | 服务层级(tier) |
| token | token |
| unsupervised learning | 无监督学习(unsupervised learning) |
| verifiable rewards | 可验证奖励(verifiable rewards) |
| vibe coding | vibe coding(用自然语言描述需求、由 AI 生成代码的开发方式) |
| voice-to-voice model | 语音到语音模型(voice-to-voice model) |
| weights | 权重(weights) |
Reformatted by reformat_english.py