A/B 测试终极指南 | Ronny Kohavi（Airbnb、Microsoft、Amazon）

Ronny Kohavi 2023-07-27

A/B 测试终极指南 | Ronny Kohavi（Airbnb、Microsoft、Amazon）

文字稿

Ronny Kohavi (00:00:00): 我很明确地表示，我是”测试一切”的坚定拥护者——你做的任何代码变更、引入的任何功能，都必须放在某个实验里。因为我一次又一次地观察到了令人惊讶的结果：甚至很小的 bug 修复、很小的改动，有时也会产生出人意料的影响。

Ronny Kohavi (00:00:22): 所以我认为实验不可能做得太多。有时候你必须把资源分配给那些高风险、高回报的想法。我们要尝试一些大概率会失败的东西，但如果它赢了，那就是一个本垒打。

Ronny Kohavi (00:00:38): 而且你必须准备好理解并接受，大多数尝试都会失败。我见过太多次，人们提出新设计或激进的新想法，并且深信不疑——这没问题。我只是不断提醒他们：“如果你要做大事，那就试试看，但要准备好 80% 的时候会失败。“

嘉宾介绍

Lenny (00:01:05): 欢迎来到 Lenny’s Podcast，在这里我采访世界级的产品领导和增长专家，从他们打造和增长当今最成功产品的宝贵实战经验中学习。

Lenny (00:01:14): 今天的嘉宾是 Ronny Kohavi。Ronny 被许多人视为 A/B 测试和实验领域的世界级专家。最近，他在 Airbnb 担任副总裁兼技术 Fellow，领导搜索体验团队。在此之前，他在 Microsoft 担任企业副总裁，领导 Microsoft 实验平台团队。再之前，他在 Amazon 担任数据挖掘和个性化总监。

Lenny (00:01:38): 目前他全职从事咨询和教学工作。他也是实验领域的权威著作《Trustworthy Online Controlled Experiments》的作者。在我们的节目说明中，你会找到一个优惠码，可以折扣参加他在 Maven 上的直播小班课程。

本期内容预告

Lenny (00:01:53): 在我们的对话中，我们会非常实操地深入 A/B 测试。Ronny 分享了他关于何时应该开始在公司的实验、如何改变公司文化使其更具实验驱动性、哪些迹象表明你的实验可能无效、为什么信任是成功实验文化和平台最重要的要素等建议。如果你想在公司开始运行实验，如何入手。他还解释了 P 值到底是什么，以及所谓的 Twyman’s law，另外还有一些关于 Airbnb 和实验的热门观点。这期节目适合所有对在公司创建实验驱动文化感兴趣的人，或者想要优化现有实验体系的人。在简短的赞助商信息之后，请享受与 Ronny Kohavi 的这期节目。

赞助商信息

Lenny (00:02:39): 本期节目由 Mixpanel 赞助。以公平的、随增长扩展的价格，深入了解用户在漏斗每个阶段的行为。Mixpanel 让你快速获取关于用户的答案——从认知、获客到留存。通过在 Mixpanel 中直接捕获网站活动、广告数据和多点归因，你可以改善完整用户漏斗的各个方面。基于第一方行为数据而非第三方 cookie，Mixpanel 在更强大和更易用方面都超越了 Google Analytics。探索适合各种规模团队的方案，看看 Mixpanel 能为你做什么，请访问 mixpanel.com/friends/lenny。此外，他们也在招聘，所以去看看 mixpanel.com/friends/lenny。

Lenny (00:03:27): 本期节目由 Round 赞助。Round 是一个由科技领导者为科技领导者打造的私密网络。Round 结合了教练、学习和真诚关系的最佳要素，帮助你明确方向并加速前进——这就是为什么他们的候补名单上有数千名科技高管。Round 的使命是塑造科技的未来及其对社会的影响。在科技领域领导是一项独特的挑战，而在理解你日常经历的领导者身边，做好这件事会容易得多。当我们与对的人相遇并建立关系时，我们更有可能学习、发现新机会、灵活思考并实现目标。建立和管理你的人脉不必让人觉得是在”社交”。加入 Round，让自己被来自科技领域最具创新力公司的领导者环绕。建立关系，获得启发，付诸行动。访问 round.tech/apply 并使用优惠码 Lenny 跳过候补名单。网址是 round.tech/apply。

最令人意外的实验结果

Lenny (00:04:30): Ronny，欢迎来到节目。

Ronny Kohavi (00:04:33): 谢谢邀请。

Lenny (00:04:34): 你被很多人视为 A/B 测试和实验领域最顶尖的专家，我认为这是每家产品公司最终都会尝试去做的事情——而且往往做得很糟糕。所以我非常期待深入探讨实验和 A/B 测试的世界，帮助大家跑出更好的实验。再次感谢你的到来。

Ronny Kohavi (00:04:54): 很好的目标，谢谢。

Lenny (00:04:56): 让我从一个有趣的问题开始吧。你运行过的 A/B 测试中，最出人意料的或者说结果最令你惊讶的是哪个？

Ronny Kohavi (00:05:06): 我在书里和课堂上常用的开篇案例，应该是我们能公开讨论的最令人惊讶的公开案例了。这是一个相当有趣的实验。有人提议改变必应（Bing）搜索引擎上广告的展示方式。他基本上的想法是：“把第二行拿上来，提升到第一行，让标题行变得更大。”

Ronny Kohavi (00:05:37): 当你想到这一点——如果你去看我的书或者课程，有实际的截图展示发生了什么——但实事求是地说，这看起来就是一个”也就那样”的想法。为什么这会是一个值得做的、有趣的事情呢？事实上，当我们回头去看待办列表的时候，它在上面躺了好几个月，无人问津，很多其他事项的优先级都比它高。

Ronny Kohavi (00:06:05): 但关键在于，这个改动实现起来微不足道。所以如果你考虑投资回报率，我们只需要几个工程师花几个小时实现它，就能拿到数据。

Ronny Kohavi (00:06:19): 事情正是如此。必应团队里有人不断在待办列表里看到这个条目，说：“天哪，我们花了太多时间讨论它了，我直接实现就好了。“他确实这么做了——花了几天时间实现，然后按照必应的惯例，上线了实验。

Ronny Kohavi (00:06:37): 然后有趣的事情发生了。我们收到一个告警——重大升级，收入指标出了问题。这个告警以前也触发过几次，都是真正的错误，比如有人把收入记录了两次，或者有数据问题。但这次没有 bug。那个简单的想法让收入增长了大约 12%。

Ronny Kohavi (00:07:01): 这种事简直不会发生。我们稍后可以谈谈 Twyman’s law，但当时的第一个反应就是：“这好得难以置信。找个 bug 吧。“我们确实去找了。我们反复检查了好几次，重复了好几次实验，没有任何问题。在那个必应规模小得多的时期，这个改动价值一亿美元。

Ronny Kohavi (00:07:22): 关键在于，它没有损害用户体验指标。通过做一些表面功夫来增加收入很容易——展示更多广告就是提高收入的简单方法，但会损害用户体验。我们做过实验证明了这一点。而这次，这是一个纯粹的本垒打——提高了收入，没有显著损害护栏指标。我们对这样一个微不足道的改动产生了如此大的效果感到敬畏。那是必应历史上收入影响最大的单一改动。

Lenny (00:07:57): 基本上就是两行互换位置，对吧？搜索结果里换了两行。

Ronny Kohavi (00:08:02): 就是把第二行挪到第一行。之后你会去跑大量实验来理解到底发生了什么。是不是因为标题行的字体更大，有时颜色不同？所以我们跑了一大批实验。

Ronny Kohavi (00:08:16): 通常就是这样。我们有了一个突破。你开始更多地理解——我们能做什么？然后突然转向：“好，还有哪些事情可以提升收入？“我们想出了很多后续想法，帮助很大。

Ronny Kohavi (00:08:34): 但对我来说，这是一个微小的改动成为必应历史上最佳收入创意的例子，而我们当时并没有正确评估它。没有人给它后来回想起来应得的优先级。这种情况经常发生。我是说，我们经常因为在预测实验结果方面表现糟糕而感到谦卑。

Lenny (00:09:01): 这让我想起我在 Airbnb 时做过的一个经典实验，我们稍后会聊到 Airbnb。搜索团队只是跑了一个小实验——如果每次有人点击搜索结果时都打开一个新标签页，而不是直接跳转到那个房源页面，会怎样。那是搜索领域最大的胜利之一——

Ronny Kohavi (00:09:18): 顺便说一下，我不知道你是否了解这件事的历史，我在课上讲过这个。我们在大约 2008 年就做过这个实验。这比 Airbnb 早得多。我记得当时争论很激烈。为什么要在一个新标签页里打开？用户并没有要求这个。设计师们也强烈反对。我们还是跑了那个实验。同样，这是一个高度令人意外的结果，我们从中学到了很多。

“新标签页”实验的传播与遗忘

Ronny Kohavi (00:09:49): 我们最开始做的，是在英国为打开 Hotmail 做的，然后我们把它移到了 MSN，让搜索在新标签页中打开，所有这一系列实验都带来了非常非常大的收益。我们发表了这些成果。我必须说，当我来到 Airbnb 时，我跟我们共同的朋友 Ricardo 聊过这件事。当时这个做法确实被做过，效果很好，但后来就半被遗忘了，这也是关于组织记忆的一个教训。当你有成功的案例时，一定要把它们沉淀下来、记住它们。所以在 Airbnb，在我加入之前很长一段时间，房源页面在新标签页中打开这个做法已经存在了，但后来新设计的东西却没有沿用这种方式。我把它重新介绍给团队，我们看到了巨大的提升。

Lenny (00:10:35): 致敬 Ricardo，我们共同的朋友，是他帮忙促成了这次对话。人们一直在寻找一种实验的圣杯——一小时的工作换来巨大的成果。我猜想这非常罕见，不应期待它经常发生。以你的经验来看，你多久会发现一块就这样躺在那里的金块？

实验成功率与”英寸式推进”

Ronny Kohavi (00:10:57): 是的。这又是一个我非常关心的话题。每个人都想要这些惊人的结果，我在书的第一章展示了多个这样的例子——微小的努力，巨大的收益。

Ronny Kohavi (00:11:13): 但正如你所说，它们非常罕见。我认为大多数时候，胜利是这样一寸一寸地取得的。我在书里展示了一张真实图表，是必应广告团队如何随时间提升每千次搜索收入的过程，你可以看到每个月都有小幅提升、小幅提升。有时候因为法律原因或其他因素会出现退化。比如有人担心我们的广告标注不够规范，你不得不突然去做一些你知道会损害收入的事情。但是的，我认为大多数结果都是这样一寸一寸推进的。你做大量小幅改进。我想说最好的例子是几个我能公开讲的。

Ronny Kohavi (00:12:00): 一个是在必应，相关性团队，数百人全部致力于提升必应的相关性。他们有一个指标，我们稍后会聊到 OEC，即总体评估准则（Overall Evaluation Criterion）。但他们的目标是每年将这个指标提升 2%。这是一个很小的数字，而这 2% 你可以看到这里是 0.1，这里是 0.15，这里是 0.2，然后它们加起来大约每年 2%，这非常了不起。

Ronny Kohavi (00:12:28): 另一个我能公开说的 Airbnb 的例子是，我在任期间搜索相关性方面跑了大约 250 个实验。同样，小改进不断累积，最终带来了 6% 的收入提升。所以当你想到 6%，这是一个很大的数字，但它不是来自一个想法，而是许多更小的想法，每个都给你带来一点收益。

Ronny Kohavi (00:13:00): 事实上，还有一个我可以说出来的数字。在这些实验中，92% 未能提升我们试图改善的指标。所以只有 8% 的想法真正成功地推动了关键指标。

Lenny (00:13:17): 这里有太多线索我想追问，但让我先跟这一条。你刚提到 92% 的实验失败了。以你在很多公司见过大量实验运行的经验来看，这典型吗？人们跑实验时应该预期什么？他们应该预期多大的失败比例？

实验失败率的行业数据

Ronny Kohavi (00:13:31): 首先，我在职业生涯中发表过三个不同的数字。在微软总体来说，大约 66%，也就是三分之二的创意失败了。不要把 66 这个数字看作精确值，大概是三分之二。在必应，这是一个优化了很长时间的领域，失败率在 85% 左右。所以对一个你已经优化了很久的东西来说，提升更难。然后在 Airbnb，92% 这个数字是我观察到的最高失败率。

Ronny Kohavi (00:14:09): 我也引用过其他来源的数据。并不是我所在的团队特别差，Booking、Google Ads 等其他公司发表过的数字也在 80% 到 90% 的创意失败率之间。这就是实验的重要性所在。需要认识到的是，当你有一个实验平台时，很容易获得这个数字——你看有多少实验跑了，其中多少被上线了。并不是每个实验都对应一个创意。

Ronny Kohavi (00:14:39): 所以可能的情况是，当你有一个想法时，第一次实现后启动一个实验，轰的一下就糟糕透顶，因为有 bug。事实上，大约 10% 的实验会在第一天就被中止。这些通常不是因为想法不好，而是存在实现上的问题或者我们没有考虑到的东西，迫使你中止实验。

Ronny Kohavi (00:15:01): 你可能迭代、转向再来。最终，如果你做了两三次或四次转向或 bug 修复，你可能会成功上线。但那些 80% 到 92% 的失败率是按实验计算的。

Ronny Kohavi (00:15:17): 非常令人谦卑。我知道每个开始跑实验的团队，总是先觉得自己与众不同，成功率会高得多，结果都被现实教育了。

Lenny (00:15:29): 你提到”点击链接打开新标签页”这个模式在很多不同地方都奏效了。

Ronny Kohavi (00:15:36): 是的。

Lenny (00:15:37): 还有其他类似的版本吗？你有没有收集过一个清单，“这里是当我们想要推动某些指标时通常有效的做法”——你能分享一些吗？我不知道你脑子里有没有这样一个清单。

经验法则与实验模式

Ronny Kohavi (00:15:48): 我可以给你两个资源。一个是我们在微软写的一篇论文，叫 Rules of Thumb，当时我们做的事情就是审视数千个实验，从中提取出一些模式。这是一篇论文，我们可以把链接放到备注里。

Lenny (00:16:07): 完美。

Ronny Kohavi (00:16:08): 不过还有一个更准确的，我想说，我推荐给人们的有用资源。是一个叫 goodui.org 的网站，goodui.org 正是一个试图大规模做你所描述的事情的网站。

Ronny Kohavi (00:16:25): 那个人叫 Jacob [听不清 00:16:28]。他让人们把实验结果发给他，然后把它们归类成模式。到现在大概有 140 个模式了吧。然后对每个模式他会说，“好吧，这个帮助了谁？帮助了多少次，效果多大？” 所以你能了解这个方案成功了，五次中有三次。而且效果非常好。实际上，你可以发现打开了一个新窗口。

Lenny (00:16:54): 我觉得把这个喂给 ChatGPT，基本上就有了一个能创建路线图的产品经理工具了。

Ronny Kohavi (00:17:01): 顺便说一下，总的来说，这很大程度上是组织记忆（institutional memory），也就是你能不能把事情记录得足够好，让组织记住成功和失败，并从中学习？

Ronny Kohavi (00:17:17): 我认为一些公司犯的错误之一是，他们做了大量实验，但从不回过头去总结经验教训。所以我实际上在组织学习这个想法上投入了很多精力，举办最令人惊讶的实验季度会议。

Ronny Kohavi (00:17:32): 顺便说一下，“令人惊讶”也是一个人们经常不太清楚的问题。什么是令人惊讶的实验？对我来说，令人惊讶的实验是事前的预估结果和实际结果之间差异很大。也就是说，这个差值的绝对值很大。

Ronny Kohavi (00:17:53): 你可能预期某个东西会很好，结果却是持平的。好吧，你学到了一些东西。但如果你预期某个东西效果很小，结果却非常好，就像那个广告标题推广的例子，那你就学到了很多。或者反过来，如果你预期某个东西效果很小，结果却是非常负面的，你可以通过理解为什么会这么负面来学到很多。这就很有意思。

Ronny Kohavi (00:18:17): 所以我们不仅关注赢家，也关注令人惊讶的输家，那些人们觉得闭着眼都应该跑的实验。然后出于某种原因，结果却非常负面。有时候，恰恰是那种负面结果能给你带来洞见。实际上，我正好想起一个这样的例子，我应该提一下。

意料之外的实验结果

Ronny Kohavi (00:18:36): 我们在微软做了一个改进 Windows 索引器的实验，团队在离线测试中展示了它在索引方面做得好得多，他们展示了一些相关性更高的结果，还有各种好的方面。然后他们把它作为实验跑了。你知道发生了什么吗？令人惊讶的结果。索引的相关性确实更高了，但它把电池续航给毁了。

Ronny Kohavi (00:19:03): 所以这就是一个从天而降、你完全没预料到的东西。它在笔记本电脑上消耗了大量的 CPU。它在毁掉笔记本电脑。所以，好吧，我们学到了一些东西。把它记录下来。记住这件事，这样我们在设计下一个迭代时就能把这个额外因素考虑进去。

记录与传承实验经验

Lenny (00:19:23): 你对人们真正记住这些意外有什么建议？你说这很大程度上是组织层面的。你建议人们怎么做，才能在三年后有人离职时还记得这些？

Ronny Kohavi (00:19:34): 记录下来。我们内部有一个很大的幻灯片集，记录了这些成功和失败，我们鼓励人们去查看。另一件非常有用的事情就是保留你的全部实验历史，并具备按关键词搜索的能力。

Ronny Kohavi (00:19:52): 所以我有一个想法，输入几个关键词，看看从跑过的数千个实验中……顺便说一下，这些数字是很合理的。在微软，告诉你一下，我 2019 年离开时，我们每年的实验量大约是 2 万到 2.5 万个。所以每个工作日，我们大约启动 100 个新实验组。很大的数字。所以当你在像 Bing 这样跑着成千上万实验的团队里工作时，你希望能问，“有没有人做过关于这个、这个或这个的实验？” 所以这个搜索能力是集成在平台里的。

Ronny Kohavi (00:20:32): 但不止于此，我觉得举办最成功的……最有意思的，抱歉，不仅仅是成功的，最有意思的实验季度会议非常关键。这也推动了实验的飞轮效应。

实验是否会导致只做微优化

Lenny (00:20:45): 这正好引出了我想聊的一个话题，就是人们常常有一种担忧，觉得跑太多实验、太数据驱动不好，觉得实验只会把你引向那些微优化，而你并没有真正创新、做大事。你对此怎么看？在你看来，会不会过度依赖实验？

Ronny Kohavi (00:21:07): 我很明确，我是”测试一切”的坚定拥护者，也就是说你做的任何代码变更、引入的任何功能，都必须放在某个实验里。因为我一次又一次观察到这种令人惊讶的结果，即使是小的 bug 修复，即使是微小的变更，有时也会产生出乎意料的影响。

Ronny Kohavi (00:21:30): 所以我认为不可能实验做得太多。但我认为确实有可能只专注于增量变更，因为有些人会说，“如果我们只围绕这个测试 17 个东西，“你得思考一下，这就像股票一样。你需要一个投资组合。你需要一些增量实验，朝你知道只要尝试够多就一定会成功的方向推进。但有些实验，你必须有时候分配给那些高风险、高回报的想法。我们要尝试一些很可能失败的东西，但如果它赢了，那就是一支本垒打。

Ronny Kohavi (00:22:14): 所以你必须把一部分精力分配给这类实验，而且你必须准备好理解并接受大多数会失败。我看到过无数次有人提出新设计或激进的新想法，他们对它深信不疑，这没问题。我只是不断提醒他们，“嘿，如果你想搞大的，去试吧，但要做好 80% 会失败的准备。“

高风险实验：Bing 的社交整合

Ronny Kohavi (00:22:42): 有一个真实的例子，我能够谈论它是因为我们把它写进了我的书里——我们在 Bing 试图改变搜索的格局。其中一个想法，一个宏大的想法，就是与社交进行整合。所以我们接入了 Twitter 的 fire hose 数据流，也接入了 Facebook，我们在这个想法上投入了 100 人年的工作量。

Ronny Kohavi (00:23:14): 然后它失败了。你现在看不到它了。它存在了大约一年半，所有实验结果从负面到持平。这是一次尝试。尝试本身是合理的。我觉得我们花的时间稍微长了一点才决定这是一个失败。但至少我们有数据。我们做了数百个实验。没有一个取得突破。我记得给 Qi Lu 发邮件附上一些统计数据，说明是时候中止了，是时候承认失败了。他决定继续。这就是一个百万美元的问题：你继续下去，也许下个月突破就来了，还是中止？几个月后，我们中止了。

Lenny (00:24:07): 这让我想起 Netflix，他们也尝试过一个社交功能，同样失败了。在 Airbnb 早期，也有过一个很大的社交尝试，做的是”这里是你朋友住过的 Airbnb”，完全没有影响。所以也许这是我们应该记录下来的一个经验。

Ronny Kohavi (00:24:21): 是的，这很难。这确实很难。但话说回来，这正是实验的价值所在——实验是一个能给你数据的神谕。你可能对某些想法很兴奋，你可能相信这是个好主意。但最终，神谕就是受控实验。它告诉你用户是否真的从中受益——你和用户、公司和用户是否都受益。

什么时候不值得做 A/B 测试

Lenny (00:24:48): 运行一个实验显然有一些开销和代价——搭建整个实验、分析结果。有没有什么东西是你觉得不值得做 A/B 测试的？

Ronny Kohavi (00:24:59): 首先，A/B 测试有一些必要的前提条件。我直说吧，不是所有领域都适合做 A/B 测试。你没法对并购做 A/B 测试。那是一次性的事情，你要么收购，要么不收购。

Ronny Kohavi (00:25:14): 所以你确实需要具备一些前提条件。你需要有足够的单元数，主要是用户数，才能让统计学的运算成立。所以如果你规模太小，可能还太早，不适合做 A/B 测试。但我发现，在软件领域，做 A/B 测试是如此容易，搭建平台也是如此容易。

Ronny Kohavi (00:25:39): 我不是说搭建平台很容易。但一旦你搭建好了平台，运行一个实验的边际成本应该趋近于零。我们在微软就达到了那个状态——过了一段时间之后，运行实验的成本如此之低，以至于没有人会质疑”所有东西都应该做实验”这个理念。

Ronny Kohavi (00:25:59): 不过，我觉得我们 Airbnb 就没有达到那个状态。Airbnb 的平台远没有那么成熟，需要更多的分析师来解读结果、发现其中的问题。所以我确实认为这里存在一个权衡。你愿意在平台上投入多少。是有可能把边际成本做到接近零的。但当你还没达到那个程度时，成本依然很高，可能就有理由不做 A/B 测试。

初创公司何时开始 A/B 测试

Lenny (00:26:28): 你提到规模太小可能不适合做 A/B 测试，这对初创公司来说是一个持续的问题——我们什么时候应该开始做 A/B 测试？你有没有什么经验法则或者拇指规则，告诉我们什么时候真的应该开始考虑跑 A/B 测试了？

Ronny Kohavi (00:26:42): 是的，这是每个人都问的那个价值百万美元的问题。实际上，我们会在笔记里放个链接，我去年做过一个演讲，我把它叫做”实用默认值”（practical defaults）。其中我展示的一点是，除非你至少有数万用户，否则对于你关注的大多数指标来说，数学上、统计上根本算不过来。

Ronny Kohavi (00:27:05): 事实上，我给了一个具体的实际数字：一个具有某种转化率的零售网站，试图检测至少 5% 的正向变化——这是初创公司应该关注的幅度。他们不应该关注 1% 的变化，而应该关注 5% 和 10%。那么你需要大约 20 万用户。

Ronny Kohavi (00:27:25): 所以当你有数万用户的时候就开始做实验。但这时你只能检测到大的效果。然后当你达到 20 万用户时，魔法就开始了。那时你可以开始测试更多的东西。那时你就有能力测试一切，确保你没有在退化，并从实验中获得价值。所以你要一个经验法则——20 万用户，你就拥有魔法了。在这之前，开始建设文化，开始搭建平台，开始整合。这样随着你的规模增长，你就会开始看到价值。

OEC：总体评估准则

Lenny (00:28:00): 说得好。回到人们对实验的一个担忧——实验会阻碍你创新和下大赌注。我知道你有一个叫做总体评估准则（overall evaluation criterion）的框架，我觉得它能帮助解决这个问题。你能谈谈这个吗？

Ronny Kohavi (00:28:14): OEC，即总体评估准则，我觉得很多刚开始涉足 A/B 测试的人都会忽略它。问题在于：你在优化什么？这个问题比人们想象的要难得多，因为说”我们要优化金钱、优化收入”很容易。但这不是一个正确的问题，因为你可以做很多坏事来提升收入。所以必须有一个制衡指标，告诉你：我如何在提升收入的同时不损害用户体验？

Ronny Kohavi (00:28:53): 我们拿搜索来举个好例子。你可以在页面上放更多广告，你会赚更多钱。这是毫无疑问的。你在短期内会赚更多钱。问题在于，用户体验会怎样，这会在长期对你产生什么影响？

Ronny Kohavi (00:29:13): 我们跑过这些实验，我们能够精确地映射出——这个数量的广告会导致流失率增加这么多，这个数量的广告会导致用户找到成功结果所需的时间增加这么多。然后我们构建了一个基于所有这些指标的 OEC，让你可以说：“好的，我愿意接受这笔额外的收入，只要我没有把用户体验损害超过这个程度。“这里存在一个权衡。

Ronny Kohavi (00:29:41): 一个很好的表述方式是，把它作为一个约束优化问题。我希望你增加收入，但我会给你一个固定的平均版面预算。对于一个查询，你可以有零个广告；对于另一个查询，你可以有三个广告；对于第三个查询，你可以有更宽、更大的广告。我只计算你占用的像素数——垂直像素数。然后我给你一个预算额度。如果你能在同样的预算下赚到更多的钱，那就没问题。

Ronny Kohavi (00:30:16): 所以对我来说，这把问题从一个定义不清的”我们就是要赚更多钱”——任何页面都可以开始贴更多广告、短期内赚更多钱，但这不是目标——转变为一个合理的问题。目标是长期增长和收入。那你就需要加入这些其他准则：我对用户体验做了什么？一种方法就是设置这个约束。另一种方法就是直接监测这些其他指标。还是拿我们做过的来说，去观察用户体验——用户达到一次成功点击需要多长时间？有多少比例的会话是成功的？这些都是总体评估准则中的关键指标，是我们在使用的。

Ronny Kohavi (00:30:55): 我再给你举一个例子，来自酒店行业，或者我们共同工作过的 Airbnb。你可以说”我想提高转化率”，但你可以更聪明一点，说”仅仅把用户转化为购买、为房源付款是不够的。我希望他们在几个月后真正入住时是满意的。”

Ronny Kohavi (00:31:19): 所以这可以成为你 OEC 的一部分——“他们对那个房源的评分会是多少？当他们真正入住的时候。“而这带来了一个有趣的问题，因为你现在没有这个数据。你要三个月后，等他们真正入住时才会有。所以你必须构建一个训练集，让你能够预测这个用户——Lenny 在这个便宜的地方会不会满意。还是说不行，我应该给他推荐更贵的东西，因为 Lenny 喜欢住更好的地方，那里水是热的、能从水龙头里流出来的那种。

Lenny (00:31:52): 这倒是真的。好的，所以听起来这个方法的核心基本上就是：有一个牵引指标确保你没有损害对业务真正重要的东西，然后非常清楚地知道我们最在意的长期指标是什么。

Ronny Kohavi (00:32:05): 对我来说，关键词是终身价值（lifetime value），也就是说你必须这样定义 OEC，使它在因果上能预测用户的终身价值。正是这一点促使你正确地思考问题——我做的这件事只是短期有利，还是长期有利？一旦你建立了终身价值这个模型，人们就会说：“好的，那留存率呢？我们可以衡量。完成一个任务的时间呢？我们也可以衡量。“而这些就是那些使 OEC 真正有用的制衡指标。

Lenny (00:32:43): 那要理解这些长期指标，我听到的是用模型、预测和推算，还是说你有时会建议用长期对照组（long-term holdout）或其他方法？你觉得观察长期效果最好的方式是什么？

长期实验与建模

Ronny Kohavi (00:32:57): 对，我觉得可以从两个角度来看。一个是你可以为了学习的目的运行长期实验。比如我提到过，在 Bing 我们确实做了这样的实验——增加广告、减少广告，从而了解关键指标会发生什么变化。

Ronny Kohavi (00:33:16): 另一个是你完全可以构建模型，利用一些背景知识或者用数据科学去分析历史数据……我再给你举一个好例子。我到 Amazon 的时候，向我汇报的团队中有一个是邮件团队——不是那种你买东西后收到的交易邮件，而是负责发送推荐邮件的团队。“这是你买过的某位作者的新书。这是我们推荐的产品。“问题是，我们怎么给这个团队算功劳？

Ronny Kohavi (00:33:49): 最初的做法是，每当用户从邮件点击过来并在 Amazon 上购买东西，我们就把这笔功劳记给那封邮件。结果发现，这个做法没有任何制衡指标。你发的邮件越多，记给这个团队的收入就越多。于是这就导致了垃圾邮件轰炸。真的是一个非常有趣的问题。团队直接大幅提高了发送邮件的数量，宣称赚了更多的钱，他们的 fitness function（适应度函数）也改善了。

Ronny Kohavi (00:34:20): 所以我们回过头来说：“好的，我们可以把它表述为一个约束满足问题——你每 X 天才能给用户发一封邮件”；或者，我们最终采用的做法是——“让我们来建模，算一算给用户发垃圾邮件的代价。”

Ronny Kohavi (00:34:37): 代价是什么？当他们退订的时候，我们就不能再给他们发邮件了。所以我们做了一项独立的数据科学研究，问的问题是：“一次退订我们损失的价值是多少？“我们得出了一个数字，大概是几美元。但关键在于，现在我们有了这个制衡指标。我们说：“这是邮件产生的收入。这是我们在长期价值上的损失。权衡点在哪里？“然后当我们把这些纳入公式后，正在发送的邮件活动中超过一半是负收益的。

Ronny Kohavi (00:35:14): 这在 Amazon 是一个巨大的洞察——如何发送正确的营销活动。而这正是我喜欢这类发现的地方。我们把退订纳入考量这一事实，引导我们发现了一个新功能——“好吧，我们不要因为邮件而损失用户未来的终身价值。当他们退订时，默认选项应该是让他们只退订这一类邮件。”

Ronny Kohavi (00:35:41): 所以当你收到一封邮件说某位作者出了新书时，默认的退订选项是”把我从该作者的邮件中退订”。这样一来，负面影响——制衡指标——就小得多了。这同样是一个突破，让我们能发送更多邮件，并根据用户从哪些邮件中退订来判断哪些邮件真正有价值。

Lenny (00:36:06): 我喜欢那种出人意料的结果。

Ronny Kohavi (00:36:08): 我们都喜欢。这就是令人谦卑的现实。人们常说 A/B 测试有时只会带给你渐进式的改善……但我实际上认为，许多这种小的洞察会带来关于方向选择的根本性洞见——某些策略应该采取，某些东西应该开发。帮助很大。

大型重新设计的失败

Lenny (00:36:31): 这让我想到，每次我做一个产品的全面重新设计，我不记得有哪次结果是正向的。然后团队总是不得不去补救他们刚弄坏的东西，试图搞清楚到底哪里搞砸了。你也有这样的经验吗？

Ronny Kohavi (00:36:47): 完全同意，是的。实际上我在 LinkedIn 帖子里发表过一些案例，展示了大量大型发布和重新设计戏剧性地失败了，这非常常见。正确的做法是说：“是的，我们想做重新设计，但要分步进行，过程中不断测试和调整。“这样你就不需要一次性上 17 个新改动，其中很多注定会失败。朝你认为有利的方向逐步推进。边走边调。

Lenny (00:37:24): 我觉得这些经历中最糟糕的是，花了三到六个月来构建。到上线的时候，“我们不可能不上线这个。所有人都朝这个方向在干活。所有新功能都假设这个东西能跑起来。“你基本上被卡住了。

Ronny Kohavi (00:37:41): 这就是沉没成本谬误。我们在这个上面投入了这么多年，那就上线吧，即使它对用户不好。不，那太糟糕了。是的。所以认识到”大多数想法会失败”这个令人谦卑的现实还有另一个好处。如果你相信我发表的那个统计数据，那么把 17 个改动放在一起做，更可能是负面的。用更小的增量来做，从中学习——这叫 OFAT，one-factor-at-a-time（单因子实验）。做一个因子，从中学习，然后调整。17 个里面，也许你有 4 个好主意。那些才是最终能上线并产生正向效果的。

如何应对团队对大型重新设计的执念

Lenny (00:38:22): 我总体上同意这一点，也一直尽量规避大型重新设计，但完全避免它们很难。经常有团队成员非常激昂地说：“我们必须重新思考整个体验。靠渐进式改进根本走不到那里。“你有没有找到什么有效的方法，帮助人们看到这个视角，或者至少让一个更大的赌注更可能成功？

Ronny Kohavi (00:38:42): 顺便说一下，我并不反对大型重新设计。我尽量给团队提供数据来说明：“看，这里有大量大型重新设计失败的案例。“试着把你的重新设计分解——如果你做不到一次一个因子，那就分解成一小批一小批的因子。从这些更小的改动中学习什么有效、什么无效。

Ronny Kohavi (00:39:08): 当然，做一个完整的重新设计也是可以的。只是——正如你自己说的——要做好失败的准备。你真的想在一个东西上花六个月或一年时间，然后跑 A/B 测试，才发现你把收入或其他关键指标拖低了好几个百分点吗？一个数据驱动的组织是不会允许你上线的。你的年度考评你打算写什么？

Lenny (00:39:33): 但从来没有人觉得会失败。他们觉得，“不会的，我们搞定了，我们跟那么多人聊过。”

Ronny Kohavi (00:39:38): 但我认为那些开始跑实验的组织，会从那些较小的改动中很早就被现实教育而变得谦卑。对吧？你说得对。我给你讲一个有趣的故事。我从 Amazon 到 Microsoft 的时候，加入了一个团队，然后因为某种原因，那个团队在我加入一个月后就解散了。

Ronny Kohavi (00:39:57): 于是有人来找我，说，“你看，你刚加入公司，级别是 partner。你想想看怎么能帮到 Microsoft。“我说，“我要搭建一个实验平台，“因为在 Microsoft 没有人在跑实验。而我们在 Amazon 尝试的想法中，超过 50% 都失败了。经典的回应是，“我们这里的 PM 更厉害。”

Ronny Kohavi (00:40:26): 对吧？大家完全否认 Microsoft 实施的想法中可能有 50% 是不行的——顺便说一下，那还是在三年的开发周期里。Office 发布就要那么久。经典模式是每三年发一次版。

Ronny Kohavi (00:40:42): 后来数据出来了，显示 Bing 是第一个真正大规模实施实验的。我们和公司其他团队分享了这些令人惊讶的结果。当 Office 还在……这里要归功于 Qi Lu 和 Satya Nadella，是他们说，“Ronny，你去试着让 Office 跑实验。我们给你撑腰。“这很难，但我们做到了。花了一些时间，但 Office 开始跑实验了，然后他们意识到自己的很多想法也在失败。

Lenny (00:41:20): 你刚才提到有一个失败重新设计的合集网站。是在你的书里，还是一个你可以分享给大家的链接，用来帮助说服团队的？

Ronny Kohavi (00:41:29): 我在课上会讲这个，但我觉得我也在 LinkedIn 上发过，回答过一些相关问题。我很乐意把它放到笔记里。

Lenny (00:41:36): 好的，太好了。我们会放到节目笔记里。因为我觉得正是这类数据经常能帮助说服一个团队：“也许我们不应该从零开始重新构思整个 onboarding 流程。也许我们应该边迭代边学习。”

Lenny (00:41:48): 本节目由 Eppo 赞助播出。Eppo 是一个下一代 A/B 测试平台，由 Airbnb 前员工为现代增长团队打造。DraftKings、Zapier、ClickUp、Twitch 和 Cameo 等公司都依赖 Eppo 来驱动他们的实验。

Lenny (00:42:02): 无论你在哪里工作，跑实验变得越来越不可或缺，但市面上没有商业工具能与现代增长团队的技术栈集成。这导致你要么浪费时间搭建内部工具，要么通过一个笨拙的营销工具来跑实验。

Lenny (00:42:15): 我在 Airbnb 的时候，最喜欢的事情之一就是我们内部的实验平台，我可以按设备类型、国家、用户阶段来切片和钻取数据。

Lenny (00:42:25): Eppo 做到了这一切，甚至更多——快速输出结果，避免令人烦恼的漫长分析周期，帮你轻松定位你发现的任何问题的根因。Eppo 让你超越基本的点击率指标，转而使用你的北极星指标，比如激活、留存、订阅和支付。Eppo 支持前端测试、后端测试、邮件营销，甚至机器学习客户端。去 geteppo.com 了解 Eppo，没错就是 geteppo.com，让你的实验速度提升 10 倍。

大赌注与局部最优

Lenny (00:42:55): 有时候是不是也值得说，“我们干脆把整个东西重新想一遍，试一把，“来跳出局部最小值或局部最大值？

Ronny Kohavi (00:43:03): 是的。我觉得你说得有道理。我确实想分配一定比例的资源给大赌注。就像你说的，我们把这个东西优化到极致了。能不能彻底重新设计？这是一个非常合理的想法。你也许能跳出局部最优。但我要告诉你的是，80% 的情况下你会失败。所以做好准备。人们通常的期望是，“我的重新设计一定会成功。“不，你大概率会失败，但如果真的成功了，那就是突破性的。

Lenny (00:43:35): 我喜欢这个 80% 的经验法则。这只是一个简单的思考方式吗？80% 的——

Ronny Kohavi (00:43:39): 这是我的经验法则。我听有人说是 70% 或 80%。但大致就在这个范围。我觉得当你在讨论在已知路径和高风险高回报之间分配多少投入时，这通常是大多数组织最终采用的分配比例。你采访过 Shreyas。他提到 Google 大约是 70% 做搜索广告，20% 做一些应用和新东西，然后 10% 做基础设施。

Lenny (00:44:16): 我觉得这里面最重要的一点是，如果你不跑实验，你上线的功能中有 70% 在伤害你的业务。

指标持平就不该上线

Ronny Kohavi (00:44:23): 嗯，不一定是伤害，是持平到负面。有些是持平的。顺便说一句，持平对我来说——如果一个结果不显著，那就是不上线，因为你引入了更多代码。上线是有维护成本的。我听人说过，“看，我们已经花了那么多时间。如果我们不上线，团队士气会受打击。“而我的回应是，“不，各位，这不对。我们要搞清楚，上线这个项目没有价值，反而在让代码库变得更复杂。维护成本会上升。“指标持平就不上线，除非是法律要求。当法务部门过来说，“你必须做 X 或 Y，“那你不得不在持平甚至负面的情况下上线。这是可以理解的。

Ronny Kohavi (00:45:08): 但同样，我觉得很多人犯的错误是说，“法务说我们必须这么做，所以我们就认栽吧。“不，法务给你的是一个你必须遵守的框架。试试三种不同的方案，然后上线上伤害最小的那个。

Airbnb 的实验文化

Lenny (00:45:25): 这让我想起 Airbnb 推出品牌重塑的时候，就连那个也是作为实验跑的——整个首页重新设计、新 logo，等等。而且我记得还做了长期留 holdout 组，最终结果好像是正向的，如果我没记错的话。

Lenny (00:45:41): 说到 Airbnb，我想聊聊 Airbnb。我知道你能分享的内容有限，但很有意思的是，Airbnb 似乎正在走向另一个方向——变得更加自上而下，以 Brian 的愿景为导向。Brian 甚至说过他对跑实验的积极性没那么高了。他不想像以前那样跑那么多实验。现在业务发展得不错，所以很难说这种做法有问题。你在那里工作了很多年，基本上管的是搜索团队。大概说说看，你的体验是怎样的？然后你觉得大概的走向如何？

Ronny Kohavi (00:46:15): 如你所知，我受限制不能谈论 Airbnb。我可以说几句我允许说的。第一，在我负责的搜索相关性团队，所有东西都经过了 A/B 测试。所以虽然 Brian 可以专注于一些设计层面的事情，但真正做神经网络和搜索的人，所有东西都被 A/B 测试到极致。没有什么是不经过 A/B 测试就上线的。我们有围绕改善某些指标的目标，一切都是通过 A/B 测试来做的。

Ronny Kohavi (00:46:50): 至于其他团队，有些做了，有些没做。我想说的是，当你说”业务发展得不错”时，我觉得我们并不知道反事实。我相信如果 Airbnb 留住了像 Greg Greeley 这样大力推动数据驱动的人，并且跑更多实验，今天的状态会更好。但这是反事实，我们无从得知。

Lenny (00:47:14): 这个视角非常有意思。Airbnb 简直就是一个有趣的天然实验——用另一种方式做事。一方面弱化了实验，另一方面他们在疫情期间关掉了付费广告。我不知道现在是什么状况，但感觉付费广告已经变成增长策略中很小的一部分了。谁知道他们有没有恢复到以前的水平，但我觉得，五年、十年后再回头看，这将是一个非常有趣的案例研究。

Ronny Kohavi (00:47:38): 这是一个一次性的实验，很难对 Airbnb 正在做的某些事情赋予价值。我个人认为，如果运行更多受控实验，它的规模和成功程度本可以大得多。但我不能谈论一些我亲自运行的实验——那些实验表明，一些最初未经测试的东西实际上是负面的，可以做得更好。

Lenny (00:48:04): 好吧，很神秘。还有一个问题。你在疫情期间在 Airbnb，那对 Airbnb 来说是一段相当疯狂的时期。我们请过 Sanchan 上播客，谈过旅行基本停滞时的种种混乱，当时有一种感觉认为 Airbnb 完了，旅行未来好几年都不会恢复。在那种必须快速行动、做出疯狂决定和重大决策的环境下，你对实验怎么看？那段时间是什么样的？

Ronny Kohavi (00:48:34): 实际上我认为在这样的状态下，运行 A/B 测试更加重要，对吧？因为你希望能看到的是，如果我们做了这个变更，它在当前环境下是否真的有帮助？这里有一个外部可泛化性的概念。它在疫情期间能奏效吗？之后还能泛化吗？这些是你可以通过受控实验真正回答的问题，有时这意味着你可能需要在六个月后复现实验——比如当新冠疫情的影响不再那么严重的时候。

Ronny Kohavi (00:49:11): 说必须快速决策——我想请你看看成功率。如果在和平时期你有三分之二到 80% 的时候是错的，为什么在战时、在疫情期间你会突然变对呢？

Ronny Kohavi (00:49:26): 所以我不相信因为预订量大幅下降，公司就应该突然不再数据驱动、换一种做法。我认为如果 Airbnb 保持原样，什么都不做，收入也会以同样的方式回升。

Lenny (00:49:49): 非常有意思。

Ronny Kohavi (00:49:49): 事实上，如果你看看当时做的一项重大投资——线上体验（online experiences），最初的数据并不太乐观。而今天，它只是一个脚注。

Lenny (00:50:01): 是的，又一个写进历史教材的案例研究——Airbnb 体验。我想稍微转换一下话题，谈谈你的书，你之前提到过几次。书名叫 Trustworthy Online Controlled Experiments，我认为它基本上就是 A/B 测试领域的权威之书。请问，在写作和出版这本书以及看到反响的过程中，最让你意外的是什么？

Ronny Kohavi (00:50:24): 令人惊喜的是，它的销量超出了我们的预期，也超出了剑桥大学出版社的预测。最初，剑桥在我们做了一场教程之后找我们写书，我当时觉得，“我不知道，这是一个太小的细分领域。”

Ronny Kohavi (00:50:47): 他们说，“你能卖出几千本，对世界也有帮助。“然后我找到了我的合著者，他们非常出色。我们写了一本书，我们认为它不是以统计为导向的，公式比通常见的少，重点放在实践层面和信任上——信任是关键。

Ronny Kohavi (00:51:10): 正如我所说，这本书更加成功。英文版卖出了超过 20,000 册。它被翻译成了中文、韩文、日文和俄文。看到我们帮助世界通过实验变得更加数据驱动，这很棒，我为此感到高兴。这是一个令人愉快的惊喜。

Ronny Kohavi (00:51:31): 顺便说一下，这本书的所有收益都捐给了慈善机构。所以如果我在这里推销这本书，我并没有从中获得经济利益。我觉得我们做的这个决定是一个好决定。所有收益都捐给了慈善机构。

Lenny (00:51:47): 太棒了。我不知道这件事。我们会在节目笔记中附上这本书的链接。“信任”（Trust）就在书名里。你刚刚提到信任对实验有多重要。很多人在问，“如何更快地跑实验？“而你则非常强调信任。为什么信任在实验中如此重要？

Ronny Kohavi (00:52:03): 对我来说，实验平台是安全网，也是神谕。它实际上有两个功能。安全网意味着如果你上线了糟糕的东西，你应该能够快速中止，对吧？安全部署、安全速度——有不同的叫法。但这是平台能给你的一项核心价值。

Ronny Kohavi (00:52:25): 另一个功能，也是更标准的功能，是在两周实验结束时，我们会告诉你关键指标和其他替代指标、调试指标以及护栏指标发生了什么变化。信任是积累起来的，也很容易失去。

Ronny Kohavi (00:52:43): 所以对我来说，当你展示结果并说”这是科学，这是受控实验，这是结果”时，你最好相信这是值得信赖的，这一点非常重要。

Ronny Kohavi (00:52:57): 所以我在这方面花了很多精力。我认为这让我们获得了组织层面的信任，让大家真正相信……好处是，当我们建立了所有这些检查机制来确保实验是正确的，如果有什么问题，我们会停下来并说，“嘿，这个实验有问题。”

Ronny Kohavi (00:53:17): 我认为这是其他地方一些早期实现没有做到的，而且是一个很大的失误。我在书中提到了这件事，所以我可以在这里也提一下。

Ronny Kohavi (00:53:28): Optimizely 在早期对统计非常 naive（天真）。他们基本上说，“嘿，我们是实时的。我们可以实时计算你的 P 值，“然后你可以在 P 值达到统计显著时停止实验。这是一个很大的错误。这会大幅膨胀你的第一类错误（type one error），或者说假阳性率。所以如果你认为你的第一类错误率是 5%，或者说你的目标是 P 值小于 0.05，使用实时 P 值监控来优化方案，你的错误率可能会达到 30%。

Ronny Kohavi (00:54:06): 这导致的结果是，开始使用 Optimizely 的人以为平台告诉他们实验非常成功。但当他们真正开始观察时，“嗯，它告诉我们这是正向的收入，但我在长期中看不到这一点。到现在我们应该已经赚了双倍的钱了。”

Ronny Kohavi (00:54:23): 所以关于平台信任的问题开始浮现。有一篇非常著名的帖子，有人写了”Optimizely 差点让我被炒了”，作者基本上说，“看，我来到组织里，我说，‘我们取得了所有这些成功。‘但后来我说，‘有什么不对劲。’”

Ronny Kohavi (00:54:40): 他讲述了自己如何运行了一个 A/A 测试——A 和 B 之间没有任何区别——而 Optimizely 告诉他统计显著的次数太多了。Optimizely 后来吸取了教训。有好几个人指出了这个问题，我在亚马逊上对那本书的早期版本写书评时也指出了这一点。我说，“嘿，你们的统计方法不对。”

Ronny Kohavi (00:55:05): 斯坦福的 Ramesh Johari 指出了这个问题，成为了公司的顾问，然后他们修复了它。但对我来说，这是一个非常好的关于如何失去信任的例子。他们在市场上失去了大量信任。他们失去所有这些信任，是因为他们构建的东西有着严重膨胀的错误率。

Lenny (00:55:26): 想想你一直在跑这些实验，而它们实际上没有告诉你准确的结果，这确实挺可怕的。如果你刚开始跑实验，有哪些迹象表明你的做法可能不靠谱？然后如何避免那种情况？对于那些正在尝试跑实验的人，你有什么建议可以分享？

Ronny Kohavi (00:55:47): 我的书里有一整章讲这个，但我先说一个最常见的问题，远超其他，那就是 sample ratio mismatch（样本比例失调）。那么，什么是 sample ratio mismatch？

样本比例失调

Ronny Kohavi (00:56:00): 如果你设计实验，把 50% 的用户分到对照组，50% 分到实验组，这应该是通过随机数或哈希函数来实现的。如果结果偏离了 50%，那就是一个红旗。

Ronny Kohavi (00:56:15): 举个真实例子。假设你正在跑一个实验，规模很大，有一百万用户，你得到了 50.2。人们会说，“嗯，我也不知道。不会刚好是 50，50.2 差不多吧？“其实有一个公式可以直接算。我有一个电子表格供有兴趣的人使用，你输入对照组有多少用户，实验组有多少用户，我的设计是 50/50，它就能告诉你这种情况纯粹由随机产生的概率是多少。

Ronny Kohavi (00:56:45): 在刚才这个例子里，你把数字代进去，它可能会告诉你这种情况只有在五十万次实验中才会出现一次。除非你真的跑了五十万次实验，否则得到 50.2 比 49.8 的比例是极不可能的。因此，实验出了问题。

Ronny Kohavi (00:57:06): 我记得当我们实现这个检查的时候，我们看到有多少实验存在这个问题，非常惊讶。2018 年我们发表了一篇论文，分享了微软的数据：尽管我们已经跑了一段时间的实验，仍有大约 8% 的实验存在 sample ratio mismatch。

Ronny Kohavi (00:57:29): 这个数字很大。想想看，你一年跑两万个实验。其中 8% 是无效的。必须有人去搞清楚，到底发生了什么？我们知道结果不可信，但为什么？

Ronny Kohavi (00:57:44): 随着时间的推移，你开始理解——是数据管道有问题，或者是爬虫（bots）在作怪。爬虫是导致 sample ratio mismatch 的一个非常常见的原因。我团队发表的那篇论文就讨论了如何诊断 sample ratio mismatch。

Ronny Kohavi (00:58:06): 在过去大概一年半里，看到这么多第三方公司实现了 sample ratio mismatch 检测，而且所有公司都在报告，“天哪，6%、8%、10%。“所以有时候回头想想还挺有意思的：在你有了这个 sample ratio mismatch 检测之前，过去有多少结果其实是无效的？

Lenny (00:58:32): 确实挺可怕的。最常见的原因是你在代码中错误的位置分配用户吗？

Ronny Kohavi (00:58:40): 说到最常见，我认为最常见的是爬虫。它们以不同的比例访问对照组和实验组。因为你修改了网站，爬虫可能无法解析页面，然后就会更频繁地请求访问。这是一个典型的例子。另一个原因就是数据管道。

Ronny Kohavi (00:58:58): 我们遇到过这样的情况：我们在某些条件下试图清除无效流量，但由于对照组和实验组之间的差异，清除操作本身就是有偏差的。我还见过有人在网站的某个页面中间启动实验，但没有意识到某个营销活动正在从侧面向这个页面导入用户。

产品经理与信任博弈

Ronny Kohavi (00:59:13): 原因有很多。这种情况发生的频率之高令人惊讶。我给你讲一个有趣的故事：我们最初在平台上加了这个检测，只是放了一个横幅，写着”你存在 sample ratio mismatch，不要相信这些结果。“结果我们发现人们直接无视了它，照样展示带有这个横幅的实验结果。

Ronny Kohavi (00:59:37): 于是我们把记分卡整个留空了，放了一个大大的红色提示：“无法查看此结果。你存在 sample ratio mismatch。点击查看结果。“为什么我们需要那个确认按钮？因为调试原因时你需要能够查看，有时候指标本身能帮你理解为什么会出现 sample ratio mismatch。

Ronny Kohavi (01:00:00): 我们清空了记分卡，加了这个按钮，然后我们发现人们按了按钮之后仍然展示有 sample ratio mismatch 的实验结果。于是我们最终达成了一个绝妙的折中方案：记分卡上的每一个数字都加上红色删除线，这样如果你截图，其他人就能看出来你存在 sample ratio mismatch。

Lenny (01:00:24): 这帮产品经理真是够了。

Ronny Kohavi (01:00:26): 这就是直觉。人们会说，“嗯，我的效果很小，所以我还是可以展示结果的。“人们渴望看到成功。这是一种自然的偏差，我们必须非常自觉地对抗这种偏差，当结果看起来好得难以置信时，要去调查。

Twyman’s law

Lenny (01:00:45): 这恰好引出你之前简要提到的一个概念，叫 Twyman’s law。能聊聊这个吗？

Ronny Kohavi (01:00:51): 好的。Twyman’s law 的一般说法是：任何看起来有趣或不同的数字，通常都是错的。这句话最早是由一位在英国广播媒体工作的人说的，我非常推崇这个定律。我对人们的主要建议是——如果结果看起来好得难以置信，你的实验正常波动通常不到 1%，突然出现了 10% 的波动，先别急着开庆祝晚宴。这只是你的第一反应对吧？“带大家去吃顿大餐，我们刚刚提升了数百万美元的收入。“先别急着庆祝，去调查、去看，因为大概率是结果本身有问题。我可以说十个里面有九个，当我们称之为 Twyman’s law 的时候，确实会在实验中找到某种缺陷。

Ronny Kohavi (01:01:45): 当然也有例外。我之前分享的第一个实验，推广长标题那个，是成功的。但那个结果被反复复制验证，反复双重和三重检查，一切都是可靠的。而许多其他看起来效果巨大的结果，最终都被证明是假的。所以我非常推崇 Twyman’s law。我有一个幻灯片，也可以放到节目笔记里，里面分享了一些 Twyman’s law 的真实案例。

Lenny (01:02:14): 太好了。我想谈谈在公司推广实验时遇到的失败情况。但在那之前，我希望你能解释一下 P 值。我知道人们对它有误解，现在可能是个好时机，帮大家理解一下，P 值到底在告诉你什么，比如 0.05？

P 值的正确理解

Ronny Kohavi (01:02:30): 我不知道这个场合是否适合解释 P 值，因为 P 值的定义很简单，但它背后隐藏的东西非常复杂。我想说的一点是，很多人把 1 减去 P 值当作你的实验组优于对照组的概率。你跑了一个实验，得到 P 值为 0.02，他们就认为实验组优于对照组的概率是 98%。这是错误的。所以与其定义 P 值，我更想提醒大家，最常见的解读方式是不正确的。

Ronny Kohavi (01:03:08): P 值是一个条件概率，或者说是一个假设之下的概率。它假设原假设（null hypothesis）为真，然后计算我们观察到的数据与该假设——这个原假设——匹配的概率。

Ronny Kohavi (01:03:27): 要得到大多数人真正想要的那个概率，我们需要应用贝叶斯定理（Bayes’ rules），将概率从”在假设成立条件下数据的概率”反转为”在数据条件下假设成立的概率”。为此我们需要一个额外的数值，即你正在检验的假设成功的先验概率（prior probability）。

Ronny Kohavi (01:03:49): 这个先验概率是未知的。我们能做的是利用历史数据说，“看，实验失败的比例是三分之二或 80%。“然后把这个数值代入计算。我们在一篇论文里做过这件事，我会在节目笔记中给出链接，这样你就可以计算你真正想要的那个数字——即所谓假阳性风险（false positive risk）。

Ronny Kohavi (01:04:10): 我觉得这是大家需要内化的一点：你真正应该关注的是这个假阳性风险，它往往比人们以为的 5% 要高得多。Airbnb 有一个经典案例，他们的实验失败率非常高——当你在 Airbnb，或者 Airbnb 搜索那里，成功率只有 8%，如果你得到了一个统计显著的结果，P 值小于 0.05，那么有 26% 的概率这是一个假阳性结果。不是 5%，是 26%。

Ronny Kohavi (01:04:54): 这才是你脑子里应该有的数字。这也是为什么我在 Airbnb 工作时，我们做的其中一件事就是说，“好，如果你的 P 值在 0.01 到 0.05 之间，就重跑一次，做复现。“复现的时候，你可以把两次实验合并，用 Fisher 方法（Fisher’s method）或 Stouffer 方法（Stouffer’s method）得到一个合并 P 值，也就是联合概率（joint probability）。这个值通常会低得多。所以如果你得到两个 0.05 左右的结果，那么出错的概率就会低得多。

Lenny (01:05:26): 哇，我从来没听人这样描述过。这让我想到，我们团队里的数据科学家也总是说，“这不够完美。我们不是百分之百确定这个实验是正面的。“但总体来说，如果我们上线的是正向的实验，大概率是在做好事。偶尔出错也没关系。

Ronny Kohavi (01:05:42): 顺便说一句，这没错。总体来看，你大概比五五开要好一些。但大家没有意识到我刚才提到的 26% 有多高。我之所以想要确定，是因为我认为这关系到学习、组织知识（institutional knowledge）的积累。你想要做到的是与整个组织分享成功经验，所以你需要非常确信自己确实成功了。通过降低 P 值门槛，通过要求团队把 P 值控制在 0.01 以下、对更高的值做复现，你就能大大提高成功率，假阳性率（false positive rate）也会低得多。

开始做实验的建议

Lenny (01:06:20): 很有启发。这也说明了追踪公司或某个产品历史上实验失败比例的价值。假如听众里有人想开始做实验，他们目前已经有几万用户了，你会建议的前几步是什么？

Ronny Kohavi (01:06:38): 那么，如果组织里有人之前参与过实验相关工作，那是一个很好的内部咨询途径。我认为关键的决策是自建还是购买（build or buy）。我在 LinkedIn 上发过一个八期系列，邀请嘉宾来讨论这个问题。感兴趣的人可以去看看供应商和代理商对自建还是购买这个问题的看法。通常不是非此即彼，而是两者兼有。一部分自建、一部分购买，问题在于你是自建 10% 还是自建 90%。

Ronny Kohavi (01:07:17): 对于刚起步的人，现在市面上的第三方产品已经相当不错了。我刚开始做这行的时候不是这样的。当我在 Amazon 开始跑实验的时候，我们自建平台，因为当时什么都没有。Microsoft 也是一样。但今天，有足够多的供应商提供可靠的实验平台，所以我认为可以考虑使用其中一个。

推动实验文化的转变

Lenny (01:07:44): 假设你所在的公司对实验和 A/B 测试有抵触情绪，不管是初创公司还是大公司。你发现什么方法有助于推动文化转变？这通常需要多长时间，尤其是在大公司？

Ronny Kohavi (01:07:57): 我的经验主要来自 Microsoft，我们当时以 Bing 作为滩头阵地（beach head）。我们先跑了一些实验，然后被要求聚焦 Bing，于是我们在 Bing 上规模化了实验，搭建了一个大规模的实验平台。

Ronny Kohavi (01:08:13): Bing 成功之后，我们能够分享所有那些令人意外的结果，公司里越来越多的人就愿意接受了。另外一个很有帮助的因素是人才的自然流动。Bing 的人转到其他团队，带动那些团队说，“嘿，有一种更好的软件开发方式。”

Ronny Kohavi (01:08:34): 所以我认为，如果刚起步，找一个实验容易开展的团队。我的意思是他们频繁发布的团队，别去那个每六个月才发布一次的团队，或者像 Office 以前每三年才发布一次的那种。找那些频繁发布的团队——他们用 sprint 迭代，每一两周发布一次，有时候每天都发布。Bing 以前一天发布好几次。

Ronny Kohavi (01:08:59): 然后确保你理解 OEC（Overall Evaluation Criterion，总体评估准则）的问题——他们优化的目标清楚吗？有些团队能得出一个好的 OEC，有些团队则比较困难。

Ronny Kohavi (01:09:11): 我记得一个有趣的例子是 microsoft.com 网站——注意不是 MSN，是 microsoft.com——它有多个不同的利益方，有人把它当支持站点用，有人通过它销售软件，还有人用它发布安全和更新的通知。它有太多目标了。我记得当时那个团队说”我们想跑实验”，我把那个团队和一些管理者召集过来，问他们：“你们知道自己要优化什么吗？”

Ronny Kohavi (01:09:47): 非常有意思，因为他们的回答让我很意外。他们说，“嘿 Ronny，我们读了你的论文，知道有 OEC 这个概念。我们决定把网站停留时间（time on site）作为我们的 OEC。“我说，“等一下。你们的主要目标之一是支持站点。用户在支持站点上花更多时间是好事还是坏事？“然后房间里一半人觉得时间越长越好，另一半觉得时间越长越糟。所以，如果一个 OEC 在方向上大家都无法达成一致，那它就是有问题的。

实验平台的价值

Lenny (01:10:18): 这是一个很好的建议。顺着这个思路，我知道你非常推崇平台化、搭建一个实验平台来跑实验，而不是做一次性的实验。能不能简单谈谈，让人们大概了解他们的实验方法应该往什么方向发展？

Ronny Kohavi (01:10:32): 好的，动机就是把实验的边际成本（marginal cost）降到零。越多的自助化——你打开一个网站，设置实验，定义目标，定义你想要的指标——越好。人们往往没意识到，如果做得对，指标数量增长得非常快。在 Bing，你可以定义一万个指标放在你的记分卡（scorecard）里。数量很大。

Ronny Kohavi (01:11:02): 数量大到人们说计算上效率太低了。我们把它们拆成了模板（templates）：如果你在做 UI 实验，就获得这一组两千个指标；如果你在做收入相关的实验，就获得那一组两千个指标。

Ronny Kohavi (01:11:15): 所以核心思想是搭建一个平台，让你能够快速设置并运行实验，然后进行分析。我想补充一点，在 Airbnb，分析功能相对薄弱，所以他们雇了大量的数据科学家来弥补平台自身分析能力不足的问题。

Ronny Kohavi (01:11:36): 这种情况在其他组织中也存在，本质上是一个权衡。如果你在搭建一个好的平台，那就投入其中，让越来越多的自动化帮助人们查看分析结果，而不需要每次都拉数据科学家进来。

实验成熟度模型

Ronny Kohavi (01:11:53): 我们发表了一篇论文。同样，我会在节目笔记中给出链接，里面有一个很好的六轴矩阵，展示如何从爬行、到行走、到奔跑、到飞翔，以及在每个阶段你需要在六个轴上构建什么。所以有时候我做咨询时，会走进一个组织，问他们：“你们认为自己在这六个轴上处于什么位置？“这应该就是你下一步该做什么的指引。

Lenny (01:12:21): 这大概会是迄今为止节目笔记最丰富的一期了。也许最后一个问题。我们谈到了信任对运行实验有多重要，也谈到了虽然人们总说速度，但信任最终才是最重要的。尽管如此，我还是想问问关于速度的问题。你有没有什么建议可以帮助人们更快地运行实验、更快地获得可以落地实施的结果？

如何加速实验

Ronny Kohavi (01:12:40): 好的，我讲几点。第一，如果你的平台足够好，那么实验结束时，你应该很快就能拿到记分卡（scorecard）。也许需要一天，但不应该是你还得等数据科学家一个星期才能出结果。对我来说，这是加速的第一要务。

Ronny Kohavi (01:13:00): 在数据的高效利用方面，有一些方差缩减（variance reduction）的方法，可以帮助你降低指标的方差，从而需要更少的用户，更快地获得结果。你可以想到的一些例子是对指标进行封顶（capping）。比如你的收入指标非常偏斜，你可以说：“如果有人消费超过 1000 美元，我们就把它算作 1000 美元。“在 Airbnb，一个关键指标是预订夜晚数（nights booked）。

Ronny Kohavi (01:13:30): 结果发现有些人会预订几十个夜晚，他们可能像中介一样，预订上百个夜晚。你可以说：“好的，我们把它封顶。一个人在一个月内预订不太可能超过 30 天。“这种方差缩减技术可以让你更快地获得统计显著的结果。

Ronny Kohavi (01:13:53): 第三个技术叫 CUPED，这是我们发表过一篇文章介绍的。同样，我可以在节目笔记里给出链接。它利用实验前的数据来调整结果。我们可以证明，这样得到的结果是无偏的，但方差更低，因此需要更少的用户。

Lenny (01:14:11): Ronny，在我们进入非常精彩的闪电问答环节之前，还有什么想分享的吗？

Ronny Kohavi (01:14:15): 没有了，我觉得我们问了很多好问题。希望大家喜欢这期内容。

Lenny (01:14:20): 我知道他们会喜欢的。

Ronny Kohavi (01:14:21): 闪电问答。

闪电问答

Lenny (01:14:22): 闪电问答，开始了。我直接进入正题。你向别人推荐最多的两三本书是什么？

Ronny Kohavi (01:14:29): 有一本很有趣的书叫 Calling Bullshit，尽管名字有点极端，作为书名来说我觉得有些过了，但它确实包含了很多我很喜欢的精彩见解。在我看来，它很好地体现了 Twyman’s law 的精髓——当某些事情过于极端时，你的”胡扯探测器”就应该响起来，说：“嘿，我不信。“这是我的第一推荐。

Ronny Kohavi (01:14:57): 还有一本稍早一些的书我很喜欢，叫 Hard Facts, Dangerous Half-Truths And Total Nonsense，作者是斯坦福商学院的教授。书里非常有趣地揭示了许多我们从小到大习以为常的事情，其实根本没有依据。

Ronny Kohavi (01:15:21): 还有一本更奇特的书，我很喜欢，算是偏心理学领域的，叫 Mistakes Were Made (But Not by Me)，讲的是我们人类陷入的各种谬误，以及其中令人谦卑的结论。

Lenny (01:15:37): 这些书的书名都太搞笑了，而且它们之间有一个共同的主题。下一个问题，最喜欢的近期电影或电视剧是什么？

Ronny Kohavi (01:15:47): 我最近看了一部短剧叫《切尔诺贝利》，讲那场灾难的。我觉得拍得非常好，强烈推荐，基于真实事件。和往常一样，影视剧有一些艺术加工的空间。比较有意思的是，在结尾他们说：“电影中的这个女性角色实际上并不是一个真实的人，而是三十位数据科学——不是数据科学家，是三十位科学家，在现实中是他们向领导层展示了所有数据，告诉他们该怎么做。”

Lenny (01:16:22): 我记得那段。一个有趣的事实：我出生在乌克兰的敖德萨，离切尔诺贝利不算太远。我记得我爸告诉我那天他被叫去上班，去清理树上的一些东西。我想是爆炸产生的灰烬之类的。我们住得比较远，应该没有被辐射暴露，但确实在那个区域范围内。挺吓人的。我老婆每次我身体有什么问题，她就会说：“那一定是切尔诺贝利的事。“好的，下一个问题。你面试别人时最喜欢问的面试题是什么？

Ronny Kohavi (01:16:56): 这取决于面试类型，但在我做技术面试时——我现在做得少了——有一个我很喜欢的问题：对于 C++ 这样的语言，说说 static 修饰符的作用。这个问题能让很多人栽跟头，令人惊讶。你可以问它在变量上的作用，也可以问在函数上的作用。令人惊叹的是，超过 50% 面试工程岗位的人答不上来，而且错得离谱。

Lenny (01:17:31): 这绝对是这个问题的答案里最技术性的了。

Ronny Kohavi (01:17:34): 非常技术性，是的。

Ronny Kohavi (01:17:34): 我喜欢。

Lenny (01:17:36): 好的。你最近发现的最喜欢的产品是什么？

Ronny Kohavi (01:17:39): Blink 摄像头。Blink 摄像头是一种小型摄像头，装两节 AA 电池就能用大约六个月。他们宣称最多能用两年，我的经验通常是六个月左右。但它就是让我觉得很神奇——你可以把这些东西随便丢在院子里，然后看到你本来永远不会知道的事情。比如一些路过的小动物。我们有只臭鼬，一直搞不清它从哪进来的，于是我放出去五个摄像头，就看到了它从哪里进来。

Lenny (01:18:18): 它从哪进来的？

Ronny Kohavi (01:18:19): 它从栅栏上一个大概这么高的洞钻进来的。我有视频，那东西就把自己挤着从下面钻过来。我们绝不会想到它从邻居那边、从那个洞过来的。但没错，这些东西确实改变了很多。当你出门旅行时，能说一句”我能看到我家，一切正常”总是好的。有一次我们遇到了一个误报，警察来了，我们拍到了一段很精彩的视频——他们怎么进入房子、怎么拔出枪的。

Lenny (01:18:56): 你得把这个发到 TikTok 上，那是好内容。哇。好的，Blink 摄像头。我也要尽快在我家装上。

Ronny Kohavi (01:19:04): 是的。

结构化叙事

Lenny (01:19:06): 你在团队开发产品的方式上做过什么相对较小的改变，却对团队的执行力产生了很大的影响？

Ronny Kohavi (01:19:14): 我觉得这是我在 Amazon 学到的东西，就是结构化叙事（structured narrative）。Amazon 有一些变体，有时叫六页纸（six pager）之类的。但我在 Amazon 的时候，至今还记得 Jeff 发的那封邮件：“不再用 PowerPoint。我要强迫你们写叙事文档。”

Ronny Kohavi (01:19:34): 我把这话记在了心里。团队在提出功能时，很多都不再用 PowerPoint，而是用一份结构化文档开头，告诉你需要什么、你的想法需要回答哪些问题。然后我们以团队形式一起评审。

Ronny Kohavi (01:19:51): Amazon 那时候是纸质版的。现在全都基于 Word 或 Google Docs，大家可以在上面评论，我觉得这个影响非常棒。我觉得能给人诚实的反馈，让他们感激，而且这些反馈以笔记形式留在文档上、会议结束后依然保留——真的很棒。

Lenny (01:20:13): 最后一个问题：你在生活中做过 A/B 测试吗？比如恋爱、家庭、孩子方面？如果做过，你试了什么？

Ronny Kohavi (01:20:21): 样本量不够啊。还记得我说过，要做真正的 A/B 测试，你需要一万个单位吗？我想说几点。一是我会尽量向家人、朋友和所有人强调一个概念，叫做”证据等级”（hierarchy of evidence）。当你读到某些信息时，信任是有等级的。如果只是轶事性的，不要信。如果是一项实验，但只是观察性的，可以给予一定程度的信任。随着你逐步上升到自然实验、对照实验，以及多个对照实验，你的信任等级也应该相应提高。所以我觉得这是很多人在看到新闻时忽略的一个很重要的东西——这信息的来源是什么？

Ronny Kohavi (01:21:06): 我有一个分享过的演讲，列举了所有那些人们做过并发表的观察性研究。后来不知怎么的，有人做了对照实验，结果证明这些结论的方向是反的。所以我觉得关于证据等级这个概念，有很多值得学习的地方，我也把它分享给家人、孩子和朋友。我觉得有一本书就是基于这个思路写的。类似《如何阅读一本书》（How to Read a Book）那种。

Lenny (01:21:34): 好，Ronny，我们录播客这个实验，我觉得是 100% 正面的，P 值 0.0。非常感谢你来。

Ronny Kohavi (01:21:44): 非常感谢你的邀请，问题也非常好。

Lenny (01:21:47): 太棒了，很感激。最后两个问题。如果大家想联系你，在网上哪里能找到你？另外，听众能为你做些什么吗？

Ronny Kohavi (01:21:55): 在网上找我很简单，就是 LinkedIn。大家能为我做什么？理解对照实验作为一种机制，用来做出正确的数据驱动决策。运用科学。想深入学习的可以读我的书。再次说明，所有收益都捐给慈善机构。如果你想进一步学习，我每个季度在 Maven 上教一门课。我们会在节目笔记里附上链接，也为一直听到这期播客最后的听众提供一些折扣。

Lenny (01:22:31): 好的，太棒了。我们会在顶部放出来，这样大家不会错过，会有一个折扣码用来优惠你的课程。Ronny，再次非常感谢你来。这次太棒了。

Ronny Kohavi (01:22:39): 非常感谢。

Lenny (01:22:40): 大家再见。非常感谢收听。如果你觉得这期有价值，可以在 Apple Podcasts、Spotify 或你喜欢的播客应用上订阅本节目。另外，也请考虑给我们评分或写评论，因为这真的能帮助其他听众发现这个播客。你可以在 lennyspodcast.com 找到所有往期节目或了解更多关于本节目的信息。下期再见。

术语表

原文	中文
A/A test	A/A 测试
A/B testing	A/B 测试
air support	撑腰（上级支持）
backlog	待办列表
Bayes’ rules	贝叶斯定理
beach head	滩头阵地（beach head）
build or buy	自建还是购买（build or buy）
capping	封顶
constraint satisfaction problem	约束满足问题
controlled experiment	对照实验
counterfactual	反事实
cross pollination	人才流动（自然交叉传播）
CUPED	CUPED（利用实验前数据的方差缩减方法）
escalation	升级（告警升级）
external generalizability	外部可泛化性
false positive rate	假阳性率
false positive risk	假阳性风险（false positive risk）
Fisher’s method	Fisher 方法
fitness function	适应度函数
guardrail metrics	护栏指标
hierarchy of evidence	证据等级
holdout	留 holdout 组（对照留出组）
home run	本垒打
institutional knowledge	组织知识（institutional knowledge）
joint probability	联合概率
lifetime value	终身价值
local minima / local maxima	局部最小值 / 局部最大值
marginal cost	边际成本
naive	naive（天真的）
natural experiment	自然实验
neural networks	神经网络
North Star metrics	北极星指标
null hypothesis	原假设
observational study	观察性研究
OEC (Overall Evaluation Criterion)	OEC（总体评估准则）
OFAT (one-factor-at-a-time)	单因子实验
onboarding	onboarding（首次使用引导流程，原文保留）
online experiences	线上体验（online experiences）
oracle	神谕
P value	P 值
paid ads	付费广告
partner level	partner 级别
PM	PM（产品经理）
prior probability	先验概率
sample ratio mismatch	样本比例失调（原文保留 sample ratio mismatch）
scorecard	记分卡
search relevance	搜索相关性
show notes	节目笔记
six pager	六页纸
Statsig	Statsig（统计显著性，原文保留）
Stouffer’s method	Stouffer 方法
structured narrative	结构化叙事
sunk cost fallacy	沉没成本谬误
surrogate metrics	替代指标
templates	模板
Twyman’s law	Twyman’s law（特威曼定律，原文保留）
type one error	第一类错误
variance reduction	方差缩减

此文档由 AI 分片翻译（translate_long_document）

The ultimate guide to A/B testing | Ronny Kohavi (Airbnb, Microsoft, Amazon)

Ronny Kohavi: I’m very clear that I’m a big fan of test everything, which is any code change that you make, any feature that you introduce has to be in some experiment. Because again, I’ve observed this sort of surprising result that even small bug fixes, even small changes can sometimes have surprising, unexpected impact.

And so I don’t think it’s possible to experiment too much. You have to allocate sometimes to these high risk, high reward ideas. We’re going to try something that’s most likely to fail. But if it does win, it’s going to be a home run.

And you have to be ready to understand and agree that most will fail. And it’s amazing how many times I’ve seen people come up with new designs or a radical new idea. And they believe in it, and that’s okay. I’m just cautioning them all the time to say, “If you go for something big, try it out, but be ready to fail 80% of the time.”

Lenny: Welcome to Lenny’s Podcast, where I interview world-class product leaders and growth experts to learn from their hard win experiences building and growing today’s most successful products.

Today my guest is Ronny Kohavi. Ronny is seen by many as the world expert on A/B testing and experimentation. Most recently, he was VP and technical fellow of relevance at Airbnb where he led their search experience team. Prior to that, he was corporate vice president at Microsoft, where he led Microsoft Experimentation Platform team. Before that, he was director of data mining and personalization at Amazon.

He’s currently a full-time advisor and instructor. He’s also the author of the go-to book on experimentation called Trustworthy Online Controlled Experiments. And in our show notes, you’ll find a code to get a discount on taking his live cohort-based course on Maven.

In our conversation, we get super tactical about A/B testing. Ronny shares his advice for when you should start considering running experiments at your company, how to change your company’s culture to be more experiment driven, what are signs your experiments are potentially invalid, why trust is the most important element of a successful experiment, culture, and platform. How to get started if you want to start running experiments at your company. He also explains what actually is a P value and something called Twyman’s law, plus some hot takes about Airbnb and experiments in general. This episode is for anyone who’s interested in either creating an experiment driven culture at their company or just fine-tuning one that already exists. Enjoy this episode with Ronny Kohavi after a short word from our sponsors.

Ronny, welcome to the podcast.

Ronny Kohavi: Thank you for having me.

Lenny: So you’re known by many as maybe the leading expert on A/B testing and experimentation, which I think is something every product company eventually ends up trying to do, often badly. And so I’m excited to dig quite deep into the world of experimentation and A/B testing to help people run better experiments. So thank you again for being here.

Ronny Kohavi: That’s a great goal. Thank you.

Lenny: Let me start with kind of a fun question. What is maybe the most unexpected A/B tests you’ve run or maybe the most surprising result from an A/B test that you’ve run?

Ronny Kohavi: So I think the opening example that I use in my book and in my class is the most surprising public example we can talk about. And this was kind of an interesting experiment. Somebody proposed to change the way that ads were displayed on Bing, the search engine. And he basically said, “Let’s take the second line and move it, promote it to the first line so that the title line becomes larger.”

And when you think about that, and if you’re going to look in my book, or in the class, there’s an actual diagram of what happened, the screenshots. But if you think about it, just realistically it looks like a meh idea. Why would this be such a reasonable, interesting thing to do? And indeed, when we went back to the backlog, it was on the backlog for months, and languished there, and many things were rated higher.

But the point about this is it’s trivial to implement. So if you think about return on investment, we could get the data by having some engineers spend a couple of hours implementing it.

And that’s exactly what happened. Somebody at Bing who kept seeing this in the backlog and said, “My God, we’re spending too much time discussing it. I could just implement it.” He did. He spent a couple of days implementing it, as is the common thing at Bing, he launched the experiment.

And a funny thing happened. We had an alarm. Big escalation, something is wrong with the revenue metric. Now this alarm fired several times in the past when there were real mistakes, where somebody would log revenue twice, or there’s some data problem. But in this case, there was no bug. That simple idea increased revenue by about 12%.

And this is something that just doesn’t happen. We can talk later about Wyman’s law, but that was the first reaction, which is, “This is too good to be true. Let’s find a bug.” And we did. And we looked for several times, and we replicated the experiment several times, and there was nothing wrong with it. This thing was worth $100 million at the time when Bing was a lot smaller.

And the key thing is it didn’t hurt the user metrics. So it’s very easy to increase revenue by doing theatrics. Displaying more ads is a trivial way to raise revenue, but it hurts the user experience. And we’ve done the experiments to show that. In this case, this was just a home run that improved revenue, didn’t significantly hurt the guardrail metrics. And so we were in awe of what a trivial change. That was the biggest revenue impact to Bing in all its history.

Lenny: And that was basically shifting in two lines, right? Switching two lines in the search results.

Ronny Kohavi: And this was just moving the second line to the first line. Now you then go and run a lot of experiments to understand what happened here. Is it the fact that the title line has a bigger font, sometimes different color? So we ran a whole bunch of experiments.

And this is what usually happens. We have a breakthrough. You start to understand more about, what can we do? And there suddenly a shift towards, “Okay, what are other things we could do that would allow us to improve revenue?” We came up with a lot of follow on ideas that helped a lot.

But to me, this was an example of a tiny change that was the best revenue generating idea in Bing’s history, and we didn’t rate it properly. Nobody gave this the priority that in hindsight, it deserves. And that’s something that happens often. I mean, we are often humbled by how bad we are at predicting the outcome of experiments.

Lenny: This reminds me of a classic experiment at Airbnb while I was there, and we’ll talk about Airbnb in a bit. The search team just ran a small experiment of what if we were to open a new tab every time someone clicked on a search result, instead of just going straight to that listing. And that was one of the biggest wins in search-

Ronny Kohavi: And by the way, I don’t know if you know the history of this, but I tell about this in class. We did this experiment way back around 2008 I think. And so this predates Airbnb. I remember it was heavily debated. Why would you open something in a new tab? The users didn’t ask for it. It was a lot of pushback from the designers. And we ran that experiment. And again, it was one of these highly surprising results that made it that we learned so much from it.

So we first did this. It was done in the UK for opening Hotmail, and then we moved it to MSN, so it would open search in new tab, and all the set of experiments were highly, highly beneficial. We published this. And I have to tell you, when I came to Airbnb, I talked to our joint friend Ricardo about this. And it was sort of done, it was very beneficial, and then it was semi forgotten, which is one of the things you learned about institutional memories. When you have winners, make sure to address them and remember them. So it was at Airbnb done for a long time before I joined that listings opened in a new tab, but other things that were designed in the future were not done. And I reintroduced this to the team, and we saw big improvements.

Lenny: Shout out to Ricardo, our mutual friend who helped make this conversation happen. There’s this holy grail of experiments that I think people are always looking for of one hour of work and it creates this massive result. I imagine this is very rare, and don’t expect this to happen. I guess in your experience, how often do you find one of these gold nuggets just lying around?

Ronny Kohavi: Yeah. So again, this is a topic that’s near and dear to my heart. Everybody wants these amazing results, and I show them in chapter one in my book, multiple of these small efforts, huge gain.

But as you said, they’re very rare. I think most of the time, the winnings are made this inch by inch. And there’s a graph that I show in my book, a real graph of how Bing ads has managed to improve the revenue per a thousand searches over time, and every month you can see a small improvement and a small improvement. Sometimes the degradation because of legal reasons or other things. There were some concern that we were not marking the ads properly. So you have to suddenly do something that you know is going to hurt revenue. But yes, I think most results are inch by inch. You improve small amounts, lots of them. I think that the best example that I can say is a couple of them that I can speak about.

One is at Bing, the relevance team, hundreds of people all working to improve bing relevance. They have a metric, we’ll talk about OEC, the overall evaluation criterion. But they have a metric that their goal is to improve it by 2% every year. It’s a small amount, and that 2% you can see here’s a 0.1, and here’s a 0.15, here’s a 0.2, and then they add up to around 2% every year, which is amazing.

Another example that I am allowed to speak about from Airbnb is the fact that we ran some 250 experiments in my tenure there in search relevance. And again, small improvements added up. So this became overall a 6% improvement to revenue. So when you think about 6%, it’s a big number, but it came out not of one idea, but many, many smaller ideas that each gave you a small gain.

And in fact, again, there’s another number I’m allowed to say. Of these experiments, 92% failed to improve the metric that we were trying to move. So only 8% of our ideas actually were successful at moving the key metrics.

Lenny: There’s so many threads I want to follow here, but let me follow this one right here. You just mentioned of 92% of experiments failed. Is that typical in your experience seeing experiments running a lot of companies? What should people expect when they’re running experiments? What percentage should they expect to fail?

Ronny Kohavi: Well, first of all, I published three different numbers for my career. So overall at Microsoft, about 66%, two thirds of ideas fail. And don’t think the 66 is accurate. It’s about two thirds. At Bing, which is a much more optimized domain after we’ve been optimizing it for a while, the failure rate was around 85%. So it’s harder to improve something that you’ve been optimizing for a while. And then at Airbnb, this 92% number is the highest failure rate that I’ve observed.

Now I’ve quoted other sources. It’s not that I worked at groups that were particularly bad, Booking, Google Ads, other companies published numbers that are around 80 to 90% failure rate of ideas. This is where it’s important of experiments. It’s important to realize that when you have a platform, it’s easy to get this number. You look at how many experiments were run and how many of them launched. Not every experiment maps to an idea.

So it’s possible that when you have an idea, your first implementation, you start an experiment. Boom, it’s egregiously bad, because you have a bug. In fact, 10% of experiments tend to be aborted on the first date. Those are usually not that the idea is bad, but that there is an implementation issue or something we haven’t thought about, that forces an abort.

You may iterate and pivot again. And ultimately, if you do two, or three, or four pivots or bug fixes, you may get to a successful launch. But those numbers of 80 to 92% failure rate are of experiments.

Very humbling. I know that every group that starts to run experiments, they always start off by thinking that somehow, they’re different. And their success rate’s going to be much, much higher, and they’re all humbled.

Lenny: You mentioned that you had this pattern of clicking a link and opening a new tab as a thing that just worked at a lot of different places.

Ronny Kohavi: Yeah.

Lenny: Are there other versions of this? Do you do you collect a list of, “Here’s things that often work when we want to move” there’s some you could share. I don’t know if you have a list in your head.

Ronny Kohavi: I can give you two resources. One of them is a paper that we wrote called Rules of Thumb, and what we tried to do at that time at Microsoft was to just look at thousands of experiments that run and extract some patterns. And so that’s one paper that we can then put in the notes.

Lenny: Perfect.

Ronny Kohavi: But there’s another more accurate, I would say, resource that’s useful that I recommend to people. And it’s a site called goodui.org, and goodui.org is exactly the site that tries to do what you’re saying at scale.

So guy’s name is Jacob [inaudible 00:16:28]. He asks people to send them results of experiments, and he puts them into patterns. There’s probably 140 patterns I think at this point. And then for each pattern he says, “Well, who has that helped? How many times and by how much?” So you have an idea of this worked, three out of five times. And it was a huge win. In fact, you can find that open a new window in there.

Lenny: I feel like you feed that into ChatGPT, and you have basically a product manager creating a roadmap tool.

Ronny Kohavi: In general, by the way, a lot of that is institutional memory, which is can you document things well enough so that the organization remembers the successes and failures, and learns from them?

I think one of the mistakes that some company makes is they launch a lot of experiments and never go back and summarize the learnings. So I’ve actually put a lot of effort in this idea of institutional learning, of doing the quarterly meeting of the most surprising experiments.

By the way, surprising is another question that people often are not clear about. What is a surprising experiment? To me, a surprising experiment is one where the estimated result beforehand and the actual result differ by a lot. So that absolute value of the difference is large.

Now you can expect something to be great and it’s flat. Well, you learn something. But if you expect something to be small and it turns out to be great, like that ad title promotion, then you’ve learned a lot. Or conversely, if you expect that something will be small and it’s very negative, you can learn a lot by understanding why this was so negative. And that’s interesting.

So we focused not just on the winners, but also surprising losers, things that people thought would be a no-brainer to run. And then for some reason, it was very negative. And sometimes, it’s that negative that gives you insight. Actually, I’m just coming up with one example that of that, that I should mention.

We were running this experiment at Microsoft to improve the windows indexer, and the team was able to show on offline tests that it does much better at indexing, and they showed some relevance is higher, and all these good things. And then they ran it as an experiment. You know what happened? Surprising result. Indexing the relevance was actually high, but it killed a battery life.

So here’s something that comes from left field that you didn’t expect. It was consuming a lot more CPU on laptops. It was killing the laptops. And therefore, okay, we learned something. Let’s document it. Let’s remember this, so that we now take this other factor into account as we design the next iteration.

Lenny: What advice do you have for people to actually remember these surprises? You said that a lot of it is institutional. What do you recommend people do so that they can actually remember this when people leave, say three years later?

Ronny Kohavi: Document it. We had a large deck internally of these successes and failures, and we encourage people to look at them. The other thing that’s very beneficial is just to have your whole history of experiments and do some ability to search by keywords.

So I have an idea. Type a few keywords and see if from the thousands of experiments that ran… And by the way, these are very reasonable numbers. At Microsoft, just to let you know, when I left in 2019, we were on a rate of about 20 to 25,000 experiments every year. So every working, day we were starting something like 100 new treatments. Big numbers. So when you’re running in a group like Bing, which is running thousands and thousands of experiments, you want to be able to ask, “Has anybody did an experiment on this or this or this?” And so that searching capability is in the platform.

But more than that, I think just doing the quarterly meeting of the most successful… Most interesting, sorry, not just successful, most interesting experiments is very key. And that also helps the flywheel of experimentation.

Lenny: It’s a good segue to something I wanted to touch on, which is there’s often, I guess a weariness of running too many experiments and being too data-driven, and the sense that experimentation just leads you to these micro optimizations, and you don’t really innovate and do big things. What’s your perspective on that? And then, can you be too experiment driven in your experience?

Ronny Kohavi: I’m very clear that I’m a big fan of test everything, which is any code change that you make, any feature that you introduce has to be in some experiment. Because again, I’ve observed this surprising result that even small bug fixes, even small changes can sometimes have surprising unexpected impact.

And so I don’t think it’s possible to experiment too much. I think it is possible to focus on incremental changes because some people say, “Well, if we only tested 17 things around this,” you have to think about, it’s like in stock. You need a portfolio. You need some experiments that are incremental that move you in the direction that you know you’re going to be successful over time if you just try enough. But some experiments, you have to allocate sometimes to these high risk, high reward ideas. We’re going to try something that’s most likely to fail, but if it does win, it’s going to be a home run.

And so you have to allocate some efforts to that, and you have to be ready to understand and agree that most will fail. And I’ve amazing how many times I’ve seen people come up with new designs, or a radical new idea, and they believe in it, and that’s okay. I’m just cautioning them all the time to say, “Hey, if you go for something big, try it out, but be ready to fail 80% of the time.”

And one true example, that again, I’m able to talk about because we put it in my book, is we were at Bing trying to change the landscape of search. And one of the ideas, the big ideas was we are going to integrate with social. So we hooked into the Twitter fire hose feed and we hooked into Facebook, and we spent 100 person years on this idea.

And it failed. You don’t see it anymore. It existed for about a year and a half, and all the experiments were just negative to flat. And it was an attempt. It was fair to try it. I think it took us a little long to fail, to decide that this is a failure. But at least we had the data. We had hundreds of experiments that we tried. None of them were a breakthrough. And I remember mailing Qi Lu with some statistics showing that it’s time to abort, it’s time to fail on this. And he decided to continue more. And it’s a million dollar question. Do you continue, and then maybe the breakthrough will come next month, or do you abort? And a few months later, we aborted.

Lenny: That reminds me of at Netflix, they tried a social component that also failed. At Airbnb, early on there was a big social attempt to make, “Here’s your friends that have stayed at these Airbnbs,” completely had no impact. So maybe that’s one of these learnings that we should document.

Ronny Kohavi: Yeah, this is hard. This is hard. But again, that’s the value of experiments, which are this oracle that gives you the data. You may be excited about things. You may believe it’s a good idea. But ultimately, the oracle is the controlled experiment. It tells you whether users are actually benefiting from it, whether you and the users, the company and the users.

Lenny: There’s obviously a bit of overhead and downside to running an experiment, setting all up, and analyzing the results. Is there anything that you ever don’t think is worth A/B testing?

Ronny Kohavi: First of all, there are some necessary ingredients to A/B testing. And I’ll just say outright, not every domain is amenable to A/B testing. You can’t A/B test mergers and acquisitions. It’s something that happens once, you either acquire or you don’t acquire.

So you do have to have some necessary ingredient. You need to have enough units, mostly users, in order for the statistics to work out. So if you’re too small, it may be too early to A/B test. But what I find is that in software, it is so easy to run A/B testing and it is so easy to build a platform.

I don’t say it’s easy to build a platform. But once you build a platform, the incremental cost of running an experiment should approach zero. And we got to that at Microsoft, where after a while, the cost of running experiments was so low that nobody was questioning the idea that everything should be experimented with.

Now, I don’t think we were there at Airbnb for example. The platform at Airbnb was much less mature, and required a lot more analysts in order to interpret the results and to find issues with it. So I do think there’s this trade off. You’re willing to invest in the platform. It is possible to get the marginal cost to be close to zero. But when you’re not there, it’s still expensive, and there may be reasons why not to run A/B tests.

Lenny: You talked about how you may be too small to run A/B tests, and this is a constant question for startups is, when should we start running A/B tests? Do you have kind of a heuristic or a rule of thumb of, here’s a time you should really start thinking about running an A/B test?

Ronny Kohavi: Yeah, a dollar question that everybody asks. So actually, we’ll put this in the notes, but I gave a talk last year, what I called it is practical defaults. And one of the things I show there is that unless you have at least tens of thousands of users, the math, the statistics just don’t work out for most of the metrics that you’re interested in.

In fact, I gave an actual practical number of a retail site with some conversion rate, trying to detect changes that are at least 5% beneficial, which is something that startups should focus on. They shouldn’t focus on the 1%, they should focus on the 5 and 10%. Then you need something like 200,000 users.

So start experimenting when you’re in the tens of thousands of users. You’ll only be able to detect large effects. And then once you get to 200,000 users, then the magic starts happening. Then you can start testing a lot more. Then you have the ability to test everything and make sure that you’re not degrading and getting value out of experimentation. So you ask for rule of thumb, 200,000 users, you’re magical. Below that, start building the culture, start building the platform, start integrating. So that as you scale, you start to see the value.

Lenny: Love it. Coming back to this kind of concern people have of experimentation, keeps you from innovating and taking big bets, I know you have this framework overall evaluation criterion, and I think that helps with this. Can you talk a bit about that?

Ronny Kohavi: The OEC or the overall evaluation criterion is something that I think many people that start to dabble in A/B testing miss. And the question is, what are you optimizing for? And it’s a much harder question that people think because it’s very easy to say we’re going to optimize for money, revenue. But that’s the wrong question, because you can do a lot of bad things that will improve revenue. So there has to be some countervailing metric that tells you, how do I improve revenue without hurting the user experience?

So let’s take a good example with search. You can put more ads on a page and you will make more money. There’s no doubt about it. You will make more money in the short term. The question is, what happens to the user experience, and how is that going to impact you in the long term?

So we’ve run those experiments, and we were able to map out this number of ads causes this much increase to churn, this number of ads causes this much increase to the time that users take to find a successful result. And we came up with an OEC that is based on all these metrics that allows you to say, “Okay, I’m willing to take this additional money if I’m not hurting the user experience by more than this much.” So there’s a trade-off there.

One of the nice ways to phrase this, as a constraint optimization problem. I want you to increase revenue, but I’m going to give you a fixed amount of average real estate that you can use. So for one query, you can have zero ads. For another query, you can have three ads. For a third query, you can have wider, bigger ads. I’m just going to count the pixels that you take, the vertical pixels. And I will give you some budget. And if you can under the same budget make more money, you’re good to go.

So that to me turns the problem from a badly defined, let’s just make more money. Any page can start plastering more ads and make more money short term, but that’s not the goal. The goal is long-term growth and revenue. Then you need to insert these other criteria, and what am I doing to the user experience? One way around it is to put this constraint. Another one is just to have these other metrics. Again, something that we did, to look at the user experience. How long does it take the user to reach a successful click? What percentage of sessions are successful? These are key metrics that were part of the overall evaluation criteria, that we’ve used.

I can give you another example by the way, from the hotel industry or Airbnb that we both worked at. You can say, “I want to improve conversion rate,” but you can be smarter about it and say, “It’s not just enough to convert a user to buy or to pay for a listing. I want them to be happy with it several months down the road when they actually stay there.”

So that could be part of your OEC to say, “What is the rating that they will give to that listing when they actually stay there?” And that causes an interesting problem, because you don’t have this data now. You’re going to have it three months from now when they actually stay. So you have to build the training set that allows you to make a prediction about whether this user, whether Lenny is going to be happy at this cheap place. Or whether no, I should offer him something more expensive, because Lenny likes to stay at nicer places where the water actually is hot and comes out of the faucet.

Lenny: That is true. Okay, so it sounds like the core to this approach is basically have a drag metric that makes sure you’re not hurting something that’s really important to the business, and then being very clear on what’s the long-term metric we care most about.

Ronny Kohavi: To me, the key word is lifetime value, which is you have to define the OEC such that it is causally predictive of the lifetime value of the user. And that’s what causes you to think about things properly, which is, am I doing something that just helps me short term, or am I doing something that will help me in the long term? Once you put that model of lifetime value, people say, “Okay, what about retention rates? We can measure that. What about the time to achieve a task? We can measure that.” And those are these countervailing metrics that make the OEC useful.

Lenny: And to understand these longer term metrics, what I’m hearing is use models, and forecast, and predictions, or would you suggest sometimes use a long-term holdout or some other approach? What do you find is the best way to see these long term?

Ronny Kohavi: Yeah, so there’s two ways that I like to think about it. One is you can run long-term experiments for the goal of learning something. So I mentioned that at Bing, we did run these experiments where we increased the ads and decreased the ads, so that we will understand what happens to key metrics.

The other thing is you can just build models that use some of our background knowledge or use some data science to look at historical… I’ll give you another good example of this. When I came to Amazon, one of the teams that reported to me was the email team that it was not the transactional emails when you buy something, you get an email. But it was the team that sent these recommendations. “Here’s a book by an author that you bought. Here’s a product that we recommend.” And the question is, how do we give credit to that team?

And the initial version was, whenever a user comes from the email and purchases something on Amazon, we’re going to give that email credit. Well, it turned out this had no countervailing metric. The more emails you send, the more money you’re going to credit the team. And so that led to spam. Literally a really interesting problem. The team just ramped up the number of emails that they were sending out, and claimed to make more money, and their fitness function improved.

So then we backed up and then we said, “Okay, we can either phrase this as a constraint satisfaction problem. You’re allowed to send user an email every X days or,” which is what we ended up doing is, “Let’s model the cost of spamming the users.”

What’s that cost? Well, when they unsubscribe, we can’t mail them. So we did some data science study on the side and we said, “What is the value that we’re losing from an unsubscribe?” And we came up with a number, it was a few dollars. But the point was, now we have this countervailing metric. We say, “Here’s the money that we generate from the emails. Here’s the money that we’re losing on long-term value. What’s the trade-off?” And then when we started to incorporate those formula, more than half the campaigns that were being sent were negative.

So it was a huge insight at Amazon about how to send the right campaigns. And this is what I like about these discoveries. This fact that we integrated the unsubscribe led us to a new feature to say, “Well, let’s not lose their future lifetime value through email. When they unsubscribe, let’s offer them by default to unsubscribe from this campaign.”

So when you get an email, there’s a new book by the author. The default to unsubscribe would be unsubscribe me from author emails. And so now, the negative, the countervailing metric is much smaller. And so again, this was a breakthrough in our ability to send more emails, and understand based on what users were unsubscribing from, which ones are really beneficial.

Lenny: I love the surprising results.

Ronny Kohavi: We all love them. This is the humbling reality. And people talk about the fact that A/B testing sometimes leads you to incremental… I actually think that many of these small insights lead to fundamental insights about which areas to go, some strategies we should take, some things we should develop. Helps a lot.

Lenny: This makes me think about how every time I’ve done a full redesign of a product, I don’t think ever, has it ever been a positive result. And then the team always ends up having to claw back what they just hurt and try to figure out what they messed up. Is that your experience too?

Ronny Kohavi: Absolutely, yeah. In fact, I’ve published some of these in LinkedIn posts showing a large set of big launches and redesigns that dramatically failed, and it happens very often. So the right way to do this is to say, “Yes, we want to do a redesign, but let’s do it in steps and test on the way and adjust,” so you don’t need to take 17 new changes, that many of them are going to fail. Start to move incrementally in a direction that you believe is beneficial. Adjust on the way.

Lenny: The worst part of those experiences I find is it took three to six months to build it. And by the time it’s launched, it’s just like, “We’re not going to unlaunch this. Everyone’s been working in this direction. All the new features are assuming this is going to work,” and you’re basically stuck.

Ronny Kohavi: I mean, this is a sunk cost fallacy. We invested so many years in it. Let’s launch this, even though it’s bad for the user. No, that’s terrible. Yeah. Yeah. So this is the other advantage of recognizing this humble reality that most ideas fail. If you believe in that statistics that I published, then doing 17 changes together is more likely to be negative. Do them in smaller increments, learn from, it’s called OFAT one-factor-at-a-time. Do one factor, learn from it, and adjust. Of the 17, maybe you have four good ideas. Those are the ones that will launch and be positive.

Lenny: I generally agree with that, and always try to avoid a big redesign, but it’s hard to avoid them completely. There’s often team members that are really passionate like, “We just need to rethink this whole experience. We’re not going to incrementally get there.” Have you found anything effective in helping people either see this perspective, or just making a larger bet more successful?

Ronny Kohavi: By the way, I’m not opposed to large redesigns. I try to give the team the data to say, “Look, here are lots of examples where big redesigns fail.” Try to decompose your redesign if you can’t decompose it to one factor at a time, to a small set of factors at a time. And learn from these smaller changes what works and what doesn’t.

Now, it’s also possible to do a complete redesign. Just, as you said yourself, be ready to fail. I mean, do you really want to work on something for six months or a year, and then run the A/B test, and realize that you’ve hurt revenues or other key metrics by several percentage points? And a data-driven organization will not allow you to launch. What are you going to write in your annual review?

Lenny: But nobody ever thinks it’s going to fail. They think, “No, we got this. We’ve talked to so many people.”

Ronny Kohavi: But I think organizations that start to run experiments are humbled early on from the smaller changes. Right? You’re right. I’ll tell you a funny story. When I came from Amazon to Microsoft, I joined the group, and for one reason or another, that group disbanded a month after I joined.

Ronny Kohavi: And so people came to me and said, “Look, you just joined the company. You’re at partner level. You figure out how you can help Microsoft.” And I said, “I’m going to build an experimentation platform,” because nobody at Microsoft is running experiments. And more than 50% of ideas in Amazon that we tried failed. And the classical response was, “We have better PMs here.”

Right? There was this complete denial that it’s possible that 50% of ideas that Microsoft is implementing, in a three-year development cycle by the way. This is how long it took Office to release. It was a classical every three years we release.

And the data came about showing that Bing was the first to truly implement experimentation at scale. And we shared with the rest of the companies the surprising results. And so when Office was… And this was credit to Qi Lu and Satya Nadella, they were ones that says, “Ronny, you try to get Office to run experiments. We’ll give you the air support.” And it was hard, but we did it. It took a while, but Office started to run experiments, and they realized that many of their ideas are failing.

Lenny: You said that there’s a site of a failed redesigns. Is that in your book or is that a site that you can point people to, to help build this case?

Ronny Kohavi: I teach this in my class, but I think I’ve posted this on LinkedIn and answered some questions. I’m happy to put that in the notes.

Lenny: Okay, cool. We’ll put that in the show notes. Because I think that’s the kind of data that often helps convince a team, “Maybe we shouldn’t rethink this entire onboarding flow from scratch. Maybe we should iterate towards and learn as we go.”

Wherever you work, running experiments is increasingly essential, but there are no commercial tools that integrate with a modern growth team stack. This leads to wasted time building internal tools or trying to run your own experiments through a clunky marketing tool.

When I was at Airbnb, one of the things that I loved most about working there was our experimentation platform, where I was able to slice and dice data by device types, country, user stage.

Is it ever worth just going, “Let’s just rethink this whole thing and just give it a shot,” to break out of a local minima or local maxima essentially?

Ronny Kohavi: Yeah. So I think what you said is fair. I do want to allocate some percentage of resources to big bets. As you said, we’ve been optimizing this thing to hell. Could we completely redesign it? It’s a very valid idea. You may be able to break out of a local minima. What I’m telling you is 80% of the time, you will fail. So be ready for that. What people usually expect is, “My redesign is going to work.” No, you’re most likely going to fail, but if you do succeed, it’s a breakthrough.

Lenny: I like this 80% rule of thumb. Is that just a simple way of thinking about it? 80% of your-

Ronny Kohavi: That’s my rule of thumb. And I’ve heard people say it’s 70% or 80%. But it’s in that area where I think when you talk about how much to invest in the known versus the high risk, high reward, that’s usually the right percentage that most organizations end up doing this allocation, right? You interviewed Shreyas. I think he mentioned that Google is like 70% the searching ads, and it’s 20% for some of the apps and new stuff, and then it’s the 10% for infrastructure.

Lenny: And I think the most important point there is if you’re not running an experiment, 70% of stuff you’re shipping is hurting your business.

Ronny Kohavi: Well, it’s not hurting, it’s flat too negative. Some of them are flat. And by the way, flat to me, if something is not Statsig, that’s a no ship, because you’ve just introduced more code. There is a maintenance overhead to shipping your stuff. I’ve heard people say, “Look, we already spent all this time. The team will be demotivated if we don’t ship it.” And I’m, “No, that’s wrong guys. Let’s make sure that we understand that shipping this project has no value, is complicating the code base. Maintenance costs will go up.” You don’t ship on flat, unless it’s a legal requirement. When legal comes along and says, “You have to do X or Y,” you have to ship on flat or even negative. And that’s understandable.

But again, I think that’s something that a lot of people make the mistake of saying, “Legal told us we have to do this, therefore we’re going to take the hits.” No, legal gave you a framework that you have to work under. Try three different things, and ship the one that hurts the least.

Lenny: That reminds me when Airbnb launched the rebrand, even that they ran as an experiment with the entire homepage redesigned, the new logo, and all that. And I think there was a long-term holdout even, and I think it was positive in the end from what I remember.

Speaking of Airbnb, I want to chat about Airbnb briefly. I know you’re limited in what you can share, but it’s interesting that Airbnb seems to be moving in this other direction where it’s becoming a lot more top-down, Brian vision oriented. And Brian’s even talked about how he’s less motivated to run experiments. He doesn’t want to run as many experiments as they used to. Things are going well, and so it’s hard to argue with the success potentially. You worked there for many years. You ran the search team essentially. I guess, what was your experience like there? And then roughly, what’s your sense of how things are going, where it’s going?

Ronny Kohavi: Well as you know, I’m restricted from talking about Airbnb. I will say a few things that I am allowed to say. One is in my team in search relevance, everything was A/B tested. So while Brian can focus on some of the design aspects, the people who are actually doing the neural networks and the search, everything was A/B tested to hell. So nothing was launching without an A/B test. We had targets around improving certain metrics, and everything was done A/B test.

Now other teams, some did, some did not. I will say that when you say things are going well, I think we don’t know the counterfactual. I believe that had Airbnb kept people like Greg Greeley, which was pushing for a lot more data driven, and had Airbnb run more experiments, it would’ve been in a better state than today. But it’s the counterfactual. We don’t know.

Lenny: That’s a really interesting perspective. Airbnb’s such an interesting natural experiment of a way of doing things differently. There’s de-emphasizing experiments, and also, they turned off paid ads during Covid. And I don’t know where it is now, but it feels like it’s become a much smaller part of the growth strategy. Who knows if they’ve ramped it up to back to where it’s today, but I think it’s going to be a really interesting case study looking back five, 10 years from now.

Ronny Kohavi: It’s a one-off experiment where it’s hard to assign value to some of the things that Airbnb is doing. I personally believe it could have been a lot bigger and a lot more successful if it had run more controlled experiments. But I can’t speak about some of those that I ran and that showed that some of the things that were initially untested were actually negative and could be better.

Lenny: All right. Mysterious. One more question. Airbnb, you were there during Covid, which was quite a wild time for Airbnb. We had Sanchan on the podcast talking about all the craziness that went on when travel basically stopped, and there was a sense that Airbnb was done, and travel’s not going to happen for years and years. What’s your take on experimentation in that world where you have to really move fast, make crazy decisions, and make big decisions? What was it like during that time?

Ronny Kohavi: So I think actually in a state like that, it’s even more important to run A/B tests, right? Because what you want to be able to see is if we’re making this change, is it actually helping in the current environment? There’s this idea of external generalizability. Is it going to work out now during Covid? Is it going to generalize later on? These are things that you can really answer with the controlled experiments, and sometimes it means that you might have to replicate them six months down when Covid say is not as impactful as it is.

Saying that you have to make decisions quickly, to me, I’ll point you to the success rate. If in peace time you’re wrong two thirds to 80% of the time, why would you be subtly right in wartime, in Covid time?

So I don’t believe in the idea that because bookings went down materially, the company should suddenly not be data driven and do things differently. I think if Airbnb stayed the course, did nothing, the revenue would’ve gone up in the same way.

Lenny: Fascinating.

Ronny Kohavi: In fact, if you look at one investment, one big investment that was done at the time was online experiences, and the initial data wasn’t very promising. And I think today, it’s a footnote.

Lenny: Yeah. Another case study for the history books, Airbnb experiences. I want to shift a little bit and talk about your book, which you mentioned a couple times. It’s called Trustworthy Online Controlled Experiments, and I think it’s basically the book on A/B testing. Let me ask you, what surprised you most about writing this book, and putting it out, and the reaction to it?

Ronny Kohavi: I was pleasantly surprised that it sold more than what we thought, more than what Cambridge predicted. So when first we were approached by Cambridge after a tutorial that we did to write a book, I was like, “I don’t know, this is too small of a niche area.”

And they were saying, “So you’ll be able to sell a few thousand copies and help the world.” And I found my co-authors, which are great. And we wrote a book that we thought is not statistically oriented, has fewer formulas than you normally see, and focuses on the practical aspects and on trust, which is the key.

The book, as I said, was more successful. It sold over 20,000 copies in English. It was translated to Chinese, Korean, Japanese, and Russian. And so it’s great to see that we help the world become more data-driven with experimentation, and I’m happy because of that. I was pleasantly surprised.

By the way, all proceeds from the book are donated to charity. So if I’m pitching the book here, there is no financial gain for me from having more copies sold. I think we made that decision, which was a good decision. All proceeds go with the charity.

Lenny: Amazing. I didn’t know that. We’ll link to the book in the show notes. Trust is in the title. You just mentioned how important trust is to experimentation. A lot of people talk about, “How do I run experiments faster?” You focus a lot on trust. Why is trust so important in running experiments?

Ronny Kohavi: So to me, the experimentation platform is the safety net, and it’s an oracle. So it serves really two purposes. The safety net means that if you launch something bad, you should be able to abort quickly, right? Safe deployments, safe velocity. There’s some names for this. But this is one key value that the platform can give you.

The other one, which is the more standard one, is at the end of the two-week experiment, we will tell you what happened to your key metric and to many of the other surrogate, and debugging, and guardrail metrics. Trust builds up, it’s easy to lose.

And so to me, it is very important that when you present this and say, “This is science, this is a controlled experiment, this is the resolve,” you better believe that this is trustworthy.

And so I focus on that a lot. I think it allowed us to gain the organizational trust that this is really… And the nice thing is when we built all this checks to make sure that the experiment is correct, if there were something wrong with it, we would stop and say, “Hey, something is wrong with the experiment.”

And I think that’s something that some of the early implementations in other places did not do, and it was a big mistake. I’ve mentioned this in my book so I can mention this here.

Optimizely in its early days were very statistically naive. They sort of said, “Hey, we’re real time. We can compute your P values in real time,” and then you can stop an experiment when the P value is statistically significant. That is a big mistake. That inflates your, what’s called type one error or the false positive rate materially. So if you think you’ve got a 5% type one error, or you aim for that P value less than 0.05, using real time P value monitoring to optimize the offer, you would probably have a 30% error rate.

So what this led is that people that started using Optimizely thought that the platform was telling them they were very successful. But when they actually started to see, “Well it told us this is positive revenue, but I don’t see this over time. By now, we should have made double the money.”

So their questions started to come up around the trust in the platform. There’s a very famous post that somebody wrote about how, “Optimizely almost got me fired,” by a person who basically said, “Look, I came to the org. I said, ‘We have all these successes.’ But then I said, ‘Something is wrong.’”

And he tells of how he ran an A/A test when there is no difference between the A and the B. And Optimizely told him that it was statistically significant too many times. Optimizely learned. Optimizely, several people pointed, I pointed this out in my Amazon review of the book that the authors wrote early on. I said, “Hey, you’re not doing the statistics correctly.”

Ramesh Johari at Stanford pointed this out, became a consultant to the company, and then they fixed it. But to me, that’s a very good example of how to lose trust. They lost a lot of trust in the market. They lost all this trust because they built something that had very much inflated error rate.

Lenny: That is pretty scary to think about you’ve been running all these experiments, and they weren’t actually telling you accurate results. What are signs that what you’re doing may not be valid if you’re starting to run experiments? And then how do you avoid having that situation? What kind of tips can you share for people trying to run experiments?

Ronny Kohavi: There’s a whole chapter of that in my book, but I’ll say one of the things that is the most common occurrence by far, which is a sample ratio mismatch. Now, what is a sample ratio mismatch?

If you design the experiment to send 50% of users to control and 50% of users to treatment, it’s supposed to be a random number, or a hash function. If you get something off from 50%, it’s a red flag.

So let’s take a real example. Let’s say you’re running an experiment, and it’s large, it’s got a million users, and you’ve got 50.2. So people say, “Well, I don’t know. It’s not going to be exactly the same as 50.2. Reasonable or not?” Well, there’s a formula that you can plug in. I have a spreadsheet available for those that are interested, and you can tell, here’s how many users are in control. Here’s how many users have in treatment. My design was 50/50, and it tells you the probability that this could have happened by chance.

Now in a case like this, you plug in the numbers, it might tell you that this should happen one in half a million experiments. Well, unless you’ve run half a million experiment, very unlikely that you would get a 50.2 versus 49.8 split. And therefore, something is wrong with the experiment.

I remember when we implemented this check, we were surprised to see how many experiments suffered from this. Right? And there’s a paper that was published, 2018, where we share that at Microsoft, even though we’d be running experiments for a while, is around 8% of experiments that suffered from the sample ratio mismatch.

And it’s a big number. I think about this. You’re running 20,000 experiments a year. So many of them, 8% of them are invalid. And somebody has to go down and understand, what happened here? We know that we can’t trust the results, but why?

So over time, you begin to understand there’s something wrong with the data pipeline. There’s something that happens with bots. Bots are a very common factor for causing sample ratio mismatch. So that paper that was published by my team talks about how to diagnose sample ratio mismatches.

In the last probably year and a half, it was amazing to see all these third party companies implement sample ratio mismatches, and all of them were reporting, “Oh my god, 6%, 8%, 10%.” So it’s sometimes fun to go back and say, how many of your results in the past were invalid before you had this sample ratio mismatched test?

Lenny: Yeah, that’s frightening. Is the most common reason this happens is you’re assigning users in the wrong place in your code?

Ronny Kohavi: So when you say most common, I think the most common is bots. Somehow, they hit the controller, the treatment in different proportions. Because you change the website, the bot may fail to parse the page, and try to hit it more often. And that’s a classical example. Another one is just the data pipeline.

We’ve had cases where we were trying to remove bad traffic under certain conditions, and it was skewed because of the control and treatment. I’ve seen people that start an experiment in the middle of the site on some page, but they don’t realize that some campaign is pushing people from the side.

So there’s multiple reasons. It is surprising how often this happens. And I’ll tell you a funny story, which is when we first added this test to the platform, we just put a banner say, “You have a sample ratio mismatch. Do not trust these results.” And we noticed that people ignored it. They were starting to present results that had this banner.

And so we blanked out the scorecard. We put a big red, “Can’t see this result. You have a sample ratio mismatch. Click to expose the results.” And why we do we need that okay? We need that okay button because you want to be able to debug the reasons, and sometimes the metrics help you understand why you have a sample ratio mismatch.

So we blanked out the scorecard, we had this button, and then we started to see that people pressed the button and still presented the results of experiments with sample ratio mismatch. And so we ended up with an amazing compromise, which is every number in the scorecard was highlighted with a red line, so that if you took a screenshot, other people could tell you how to sample ratio mismatch.

Lenny: Freaking product managers.

Ronny Kohavi: This is intuition. People just say, “Well, my [inaudible 01:00:30] was small, therefore I can still present the results.” People want to see success. I mean, this is a natural bias, and then we have to be very conscientious and fight that bias and say when something looks too good to be true, investigate.

Lenny: Which is a great segue to something you mentioned briefly, something called Twyman’s law. Yeah. Can you talk about that?

Ronny Kohavi: Yeah. So Twyman’s law, the general statement is if any figure that looks interesting or different is usually wrong. It was first said by this person in the UK who worked in radio media, but I’m a big fan of it. And my main claim to people is if the result looks too good to be true, your normal movement of an experiment is under 1% and you suddenly have a 10% movement, hold the celebratory dinner. It was just your first reaction, right? Let’s take everybody to a fancy dinner, because we just improved revenue by millions of dollars. Hold that dinner, investigate, see, because there’s a large probability that something is wrong with the result. And I will say that nine out of 10, when we call it Twyman’s law, it is the case that we find some flaw in the experiment.

Now there are obviously outliers. That first experiment that I shared where we promoted that made long titles, that was successful. But that was replicated multiple times, and double and triple checked, and everything was good about it. Many other results that were so big turn out to be false. So I’m a big fan of Twyman’s law. There’s a deck, I could also give this in the note, where I shared some real examples of Twyman’s law.

Lenny: Amazing. I want to talk about rolling this out of companies and things that you run into that fail. But before I get to that, I’d love for you to explain P value. I know that people kind of misunderstand it, and this might be a good time to just help people understand, what is it actually telling you, P value of say 0.05?

Ronny Kohavi: I don’t know if this is the right forum for explaining P values, because the definition of a P value is simple. What it hides is very complicated. So I’ll say one thing, which is many people assign one minus P value as the probability that your treatment is better than control. So you ran an experiment, you got a P value of 0.02. They think there’s a 98% probability that the treatment is better than the control. That is wrong. So rather than defining P values, I want to caution everybody that the most common interpretation is incorrect.

P value assumes, it’s a conditional probability or an assumed probability. It assumes that the null hypothesis is true. And we’re computing the probability that the data we’re seeing matches the hypothesis, this null hypothesis.

In order to get the probability that most people want, we need to apply Bayes’ rules and invert the probability from the probability of the data given the hypothesis, to the probability of the hypothesis given the data. For that, we need an additional number, which is the probability, the prior probability that the hypothesis that you’re testing is successful or not.

Ronny Kohavi: That’s an unknown. What we do is we can take historical data and say, “Look, people fail two thirds of the time or 80% of the time.” And we can apply that number and compute that. We’ve done that in a paper that I will give in the notes, so that you can assess the number that you really want, what’s called a false positive risk.

So I think that’s something for people to internalize, that what you really want to look at is this false positive risk, which tends to be much, much higher than the 5% that people think, right? So I think the classical example in the Airbnb where the failure rate was very, very high, is that when you get a statistically significant result, let me actually pull the note so that I know the actual number. If you’re at Airbnb, or Airbnb search where the success rate is only 8%, if you get a statistically significant result with a P value less than 0.05, there is a 26% chance that this is a false positive result. It’s not 5%, it’s 26%.

So that’s the number that you should have in your mind. And that’s why when I worked at Airbnb, one of the things we did is we said, “Okay, if you’re less than 0.05, but above 0.01, rerun, replicate.” When you replicate, you can combine the two experiments, and get a combined P value using something called Fisher’s method or Stouffer’s method, and that gives you the joint probability. And that’s usually much, much lower. So if you get two 0.5’s or something like that, then the probability that you’ve got them is much, much lower.

Lenny: Wow, I’ve never heard it described that way. It makes me think about how even data scientists in our teams are always just like, “This isn’t perfect. We’re not 100% sure this experiment is positive.” But on balance, if we’re launching positive experiments, we’re probably doing good things. It’s okay if sometimes we’re wrong.

Ronny Kohavi: By the way, it’s true. On balance, you’re probably better than 50/50, but people don’t appreciate how much that 26% that I mentioned is high. And the reason that I want to be sure is that I think it leads to this idea of the learning, the institutional knowledge. What you want to be able to say is share with the org’s success. And so you want to be really sure that you’re successful. So by lowering the P value, by forcing teams to work with the P value maybe below 0.01 and do replication on higher, then you can be much more successful, and the false positive rate will be much, much lower.

Lenny: Fascinating. And also shows the value of keeping track of what percentage your experiments are failing historically at the company or within that specific product. Say someone listening wants to start running experiments, say they have tens of thousands of users at this point. What would be the first couple steps you’d recommend?

Ronny Kohavi: Well, so if they have somebody in the org that has previously been involved with a experiment, that’s a good way to consult internally. I think the key decision is whether you want to build or buy. There’s a whole series of eight sessions that I posted on LinkedIn where I invited guest speakers to talk about this problem. So if people are interested, they can look at what the vendors say and what agency said about build versus buy question. And it’s usually not a zero one, it’s usually both. You build some and you buy some, and it’s a question of do you build 10% or do you build in 90%?

I think for people starting, the third party products that are available today are pretty good. This wasn’t the case when I started working. So when I started running experiments at Amazon, we were building the platform because nothing existed. Same at Microsoft. I think today, there’s enough vendors that provide good experimentation platforms that are trustworthy, that I would say not a good way to consider using one of those.

Lenny: Say you’re at a company where there’s resistance to experimentation and A/B testing, whether it’s a startup or a bigger company. What have you found works in helping shift that culture, and how long does that usually take, especially at a larger company?

Ronny Kohavi: My general experience is with Microsoft, where we went with this beach head of Bing. We were running a few experiments and then we were asked to focus on Bing, and we scaled experimentation and built a platform at scale at Bing.

Once Bing was successful and we were able to share all these surprising results, I think that many, many more people in the company were amenable. It was also the case that helped a lot that, there’s a usual cross pollination. People from Bing move out to other groups, and that helped these other groups say, “Hey, there’s a better way to build software.”

So I think if you’re starting out, find a place, find a team where experimentation is easy to run. And by that, I mean they’re launching often, right? Don’t go with the team that launches every six months, or Office used to launch every three years. Go with the team that launches frequently. They’re running on sprints, they launch every week or two. Sometimes they launch daily. I mean, Bing used to launch multiple times a day.

And then make sure that you understand the question of the OEC. Is it clear what they’re optimizing for? There are some groups where you can come up with a good OEC. Some groups are harder.

I remember one funny example was the microsoft.com website, which this is not MSN, this is microsoft.com, has multiple different constituencies that are trying to determine this is a support site, and this is the ability to sell software through this site, and warn you about safety and updates. It has so many goals. I remember when the team said, “We want to run experiments,” and I brought the group in and some of the managers and I said, “Do you know what you’re optimizing for?”

It was very funny because they surprised me. They said, “Hey Ronny, we read some of your papers. We know there’s this term called OEC. We decided the time on site is our OEC.” And I said, “Wait a minute. Some of your main goals is support site. Is people spending more time on the support site a good thing or a bad thing?” And then half the room thought that more time is better, and half the room thought that more time is worse. So an OEC is bad if directionally, you can’t agree on it.

Lenny: That’s a great tip. Along these same lines, I know you’re a big fan of platforms and building a platform to run experiments, versus just one-off experiments. Can you just talk briefly about that to give people a sense of where they probably should be going with their experimentation approach?

Ronny Kohavi: Yeah, so I think the motivation is to bring the marginal cost of experiments down to zero. So the more you self-service, go to a website, set up your experiment, define your targets, define the metrics that you want, right? People don’t appreciate that the number of metrics starts to grow really fast if you’re doing things right. At Bing, you could define 10,000 metrics that you wanted to be in your scorecard. Big numbers.

So it was so big, and people said it’s computationally inefficient. We broke them into templates so that if you were launching a UI experiment, you would get this set of 2,000. If you were doing a revenue experiment, you would get this set of 2,000.

So the point was build a platform that can quickly allow you to set up and run an experiment, and then analyze it. I think one of the things that I will say at Airbnb is the analysis was relatively weak, and so lots of data scientists were hired to be able to compensate for the fact that the platform didn’t do enough.

Ronny Kohavi: And this happens in other organizations too, where there’s this trade-off. If you’re building a good platform, invest in it so that more and more automation will allow people to look at the analysis, without the need to involve a data scientist.

We published a paper. Again, I’ll give it in the notes with this nice matrix of six axes, and how you move from crawl, to walk, to run, to fly, and what you need to build on those six axes. So one of the things that I do sometimes when I consult is I go into the org and say, “Where do you think you are on these six axes?” And that should be the guidance for what are the things you need to do next.

Lenny: This is going to be the most epic show notes episode we’ve had yet. Maybe a last question. We talked about how important trust is to running experiments, and how even though people talk about speed, trust ends up being most important. Still, I want to ask you about speed. Is there anything you recommend for helping people run experiments faster and get results more quickly that they can implement?

Ronny Kohavi: Yeah, so I’ll say a couple of things. One is if your platform is good, then when the experiment finishes, you should have a scorecard soon after. Maybe takes a day, but it shouldn’t be that you have to wait a week for the data scientist. To me, this is the number one way to speed up things.

Now, in terms of using the data efficiently, there are mechanisms out there under the title of variance reduction that help you reduce the variance of metrics so that you need less users, so that you can get results faster. Some examples that you might think about are capping metrics. So if your revenue metric is very skewed, maybe you say, “Well, if somebody purchased over $1,000, let's make that$ 1,000.” At Airbnb, one of the key metrics for example, is nights booked.

Well, it turns out that some people book tens of nights. They’re like an agency or something, hundreds of nights. You may say, “Okay, let’s just cap this. It’s unlikely that people book more than 30 days in a given month.” So that various reduction technique will allow you to get statistically significant results faster.

And a third technique is called cupid, which is an article that we published. Again, I can give it in the notes, which uses the pre-experiment data to adjust the result. And we can show that you get the result as unbiased, but with lower variance, and hence, it requires fewer users.

Lenny: Ronny, is there anything else you want to share before we get to our very exciting lightning round?

Ronny Kohavi: No, I think we’ve asked a lot of good questions. Hope people enjoy this.

Lenny: I know they will.

Ronny Kohavi: Lightning round.

Lenny: Lightning round. Here we go. I’m just going to roll right into it. What are two or three books that you’ve recommended most to other people?

Ronny Kohavi: There’s a fun book called Calling Bullshit, which despite the name, which is a little extreme, I think, for a title, it actually has a lot of amazing insights that I love. And it sort of embodies, in my opinion, a lot of the Twyman’s law showing that things that are too extreme, your bullshit meter should go up and say, “Hey, I don’t believe that.” So that’s my number one recommendation.

There’s a slightly older book that I love called Hard Facts, Dangerous Half-Truths And Total Nonsense by the Stanford professors from the Graduate School of Business. Very interesting to see many of the things that we grew up with as well understood turn out to have no justification.

So a stranger book, which I love, sort of on the verge of psychology, it’s called Mistakes Were Made (But Not by Me), about all the fallacies that we fall into, and the humbling results from that.

Lenny: The titles of these are hilarious, and there’s a common theme across all these books. Next question, what is a favorite recent movie or TV show?

Ronny Kohavi: So I recently saw a short series called Chernobyl, the disaster. I thought it was amazingly well done. Highly recommended it, based on true events. As usual, there’s some freedom for the artistic movie. It was kind of interesting at the end, they say, “This woman in the movie wasn’t really a woman. It was a bunch of 30 data scientists.” Not data scientists, 30 scientists that in real life, presented all the data to the leadership of what to do.

Lenny: I remember that. Fun fact, I was born in Odessa, Ukraine, which was not so far from Chernobyl. And I remember my dad told me he had to go to work. They called him into work that day to clean some stuff off the trees. I think ash from the explosion or something. It was far away where I don’t think we were exposed, but we were in the vicinity. That’s pretty scary. My wife, every time something’s wrong with me, she’s like, “That must be a Chernobyl thing.” Okay, next question. Favorite interview question you like to ask people when you’re interviewing them?

Ronny Kohavi: So it depends on the interview, but when I do a technical interview, which I do less of, but one question that I love that it’s amazing how many people it throws away for languages like C++, is tell me what the static qualifier does. And for multiple, you can do it for a variable, you can do it for function. And it is just amazing that I would say more than 50% of people that interview for engineering job cannot get this, and get it awfully wrong.

Lenny: Definitely the most technical answer to this question yet.

Ronny Kohavi: Very technical, yeah.

I love it.

Lenny: Okay. What’s a favorite recent product you’ve discovered that you love?

Ronny Kohavi: Blink cameras. So a Blink camera is this small camera. You stick in two AA batteries, and it lasts for about six months. They claim up to two years. My experience is usually about six months. But it was just amazing to me how you can throw these things around in the yard and see things that you would never know otherwise. Some animals that go by. We had a skunk that we couldn’t figure out how he was entering, so I threw five cameras out and I saw where he came in.

Lenny: Where’d he come in?

Ronny Kohavi: He came in under a hole on the fence that was about this high. I have a video of this thing just squishing underneath. We never would’ve assumed that it came from there, from the neighbor. But yeah, these things have just changed. And when you’re away on a trip, it’s always nice to be able to say, “I can see my house. Everything’s okay.” At one point, we had a false alarm, and the cops came in and had this amazing video of how they’re entering the house and pulling the guns out.

Lenny: You got to share that on TikTok. That’s good content. Wow. Okay. Blink cameras. We’ll set those up in my house asap.

Ronny Kohavi: Yes.

Lenny: What is something relatively minor you’ve changed in the way your teams develop product, that has had a big impact on their ability to execute?

Ronny Kohavi: I think this is something that I learned at Amazon, which is a structured narrative. So Amazon has some variance of this, which sometimes go by the name of a six pager or something. But when I was at Amazon, I still remember that email from Jeff, which is, “No more PowerPoint. I’m going to force you to write a narrative.”

I took that to heart. And many of the features that the team presented instead of a PowerPoint, you start off with a structured document that tells you what you need, the questions you need to answer for your idea. And then we review them as a team.

Ronny Kohavi: And Amazon, these were paper-based. Now it’s all based on Word or Google Docs where people comment, and I think the impact of that was amazing. I think the ability to give people honest feedback and have them appreciate, and have it stay after the meeting in these notes on the document, just amazing.

Lenny: Final question, have you ever run an A/B test on your life, either your dating life, your family, your kids? And if so, what did you try?

Ronny Kohavi: So there aren’t enough units. Remember I said you need 10,000 of something to run true A/B tests? I will say a couple of things. One is I try to emphasize to my family, and friends, and everybody, this idea called the hierarchy of evidence. When you read something, there’s a hierarchy of trust levels. If something is anecdotal, don’t trust it. If there was an experiment, it was observational. Give it some bit of trust. As you get more up and up to a natural experiment, and controlled experiments, and multiple controlled experiments, your trust levels should go up. So I think that that’s a very important thing that a lot of people miss when they see something in the news is, where does it come from?

I have a talk that I’ve shared of all these observational studies that people made that were published. And then somehow, a control experiment was run later on and proved that it was directionally incorrect. So I think there’s a lot to learn about this idea of the hierarchy of evidence, and share it with our family, and kids, and friends. I think there’s a book that’s based on this. It’s like How to Read a Book.

Lenny: Well, Ronny, the experiment of us recording a podcast I think is 100% positive P value 0.0. Thank you so much for being here.

Ronny Kohavi: Thank you so much for inviting me and for great questions.

Lenny: Amazing. I appreciate that. Two final questions. Where can folks find you online if they want to reach out, and is there anything that listeners can do for you?

Ronny Kohavi: Finding me online is easy. It’s LinkedIn. And what can people do for me? Understand the idea of control experiments as a mechanism to make the right data-driven decisions. Use science. Learn more by reading my book if you want. Again, all proceeds go to charity. And if you want to learn more, there’s a class that I teach every quarter on Maven. We’ll put in the notes how to find it, and some discount for people who managed to stay all the way to the end of this podcast.

Lenny: Yeah, that’s awesome. We’ll include that at the top so people don’t miss it, so there’s going to be a code to get a discount on your course. Ronny, thank you again so much for being here. This was amazing.

Ronny Kohavi: Thank you so much.

Lenny: Bye everyone. Thank you so much for listening. If you found this valuable, you can subscribe to the show on Apple Podcasts, Spotify, or your favorite podcast app. Also, please consider giving us a rating or leaving a review, as that really helps other listeners find the podcast. You can find all past episodes or learn more about the show at lennyspodcast.com. See you in the next episode.

Glossary

English	中文
A/A test	A/A 测试
A/B testing	A/B 测试
air support	撑腰（上级支持）
backlog	待办列表
Bayes’ rules	贝叶斯定理
beach head	滩头阵地（beach head）
build or buy	自建还是购买（build or buy）
capping	封顶
constraint satisfaction problem	约束满足问题
controlled experiment	对照实验
counterfactual	反事实
cross pollination	人才流动（自然交叉传播）
CUPED	CUPED（利用实验前数据的方差缩减方法）
escalation	升级（告警升级）
external generalizability	外部可泛化性
false positive rate	假阳性率
false positive risk	假阳性风险（false positive risk）
Fisher’s method	Fisher 方法
fitness function	适应度函数
guardrail metrics	护栏指标
hierarchy of evidence	证据等级
holdout	留 holdout 组（对照留出组）
home run	本垒打
institutional knowledge	组织知识（institutional knowledge）
joint probability	联合概率
lifetime value	终身价值
local minima / local maxima	局部最小值 / 局部最大值
marginal cost	边际成本
naive	naive（天真的）
natural experiment	自然实验
neural networks	神经网络
North Star metrics	北极星指标
null hypothesis	原假设
observational study	观察性研究
OEC (Overall Evaluation Criterion)	OEC（总体评估准则）
OFAT (one-factor-at-a-time)	单因子实验
onboarding	onboarding（首次使用引导流程，原文保留）
online experiences	线上体验（online experiences）
oracle	神谕
P value	P 值
paid ads	付费广告
partner level	partner 级别
PM	PM（产品经理）
prior probability	先验概率
sample ratio mismatch	样本比例失调（原文保留 sample ratio mismatch）
scorecard	记分卡
search relevance	搜索相关性
show notes	节目笔记
six pager	六页纸
Statsig	Statsig（统计显著性，原文保留）
Stouffer’s method	Stouffer 方法
structured narrative	结构化叙事
sunk cost fallacy	沉没成本谬误
surrogate metrics	替代指标
templates	模板
Twyman’s law	Twyman’s law（特威曼定律，原文保留）
type one error	第一类错误
variance reduction	方差缩减

Reformatted by reformat_english.py