更好的贝叶斯过滤

Paul Graham 2003-01-01

更好的贝叶斯过滤

2003年1月

(本文是在2003年垃圾邮件会议上的演讲。它描述了我在改进《垃圾邮件计划》中描述的算法性能方面所做的工作,以及我未来的计划。)

我在这里要介绍的第一个发现是研究论文的懒惰评估算法。随便写你想写的内容,不要引用任何以前的工作,愤怒的读者会发送给你所有你应该引用的论文的参考文献。我在《垃圾邮件计划》[1]登上Slashdot后发现了这个算法。

垃圾邮件过滤是文本分类的一个子集,这是一个已经成熟的领域,但是关于贝叶斯垃圾邮件过滤本身的第一篇论文似乎是在1998年同一会议上发表的两篇,一篇是Pantel和Lin的[2],另一篇是微软研究院的一个小组[3]。

当我听说这项工作时,我有点惊讶。如果人们在四年前就已经开始研究贝叶斯过滤,为什么不是每个人都在使用它呢?当我阅读这些论文时,我找到了原因。Pantel和Lin的过滤器是两个中更有效的,但它只捕获了92%的垃圾邮件,有1.16%的误报率。

当我尝试编写贝叶斯垃圾邮件过滤器时,它捕获了99.5%的垃圾邮件,误报率低于0.03%[4]。当两个人尝试相同的实验得到截然不同的结果时,总是令人担忧。在这里尤其令人担忧,因为这两组数字可能会得出相反的结论。不同的用户有不同的要求,但我认为对于许多人来说,92%的过滤率和1.16%的误报率意味着过滤不是一个可接受的解决方案,而99.5%的过滤率和低于0.03%的误报率意味着它是可接受的。

那么为什么我们会得到如此不同的数字呢?我没有尝试重现Pantel和Lin的结果,但是通过阅读论文,我看到了五个可能解释这种差异的原因。

一个是他们用很少的数据训练他们的过滤器:160封垃圾邮件和466封非垃圾邮件。过滤器的性能在这么小的数据集上应该还在提升。所以他们的数字可能甚至不能准确衡量他们算法的性能,更不用说一般的贝叶斯垃圾邮件过滤了。

但我认为最重要的区别可能是他们忽略了邮件头。对于任何从事垃圾邮件过滤工作的人来说,这似乎是一个错误的决定。然而,在我尝试编写的第一个过滤器中,我也忽略了邮件头。为什么?因为我想让问题保持整洁。那时我对邮件头了解不多,它们在我看来充满了随机的东西。这里有一个教训给过滤器编写者:不要忽略数据。你会觉得这个教训太明显了,不需要提及,但我已经不得不学习了好几次。

第三,Pantel和Lin对词干进行了提取,意味着他们将例如”mailing”和”mailed”都简化为根词”mail”。他们可能觉得他们被迫这样做是因为他们的语料库很小,但如果是这样,这是一种过早的优化。

第四,他们计算概率的方式不同。他们使用了所有的标记,而我只使用15个最显著的。如果你使用所有的标记,你往往会错过较长的垃圾邮件,就是那种有人告诉你他们的生活故事直到他们从某个多层次营销计划中致富的类型。而且这样的算法很容易被垃圾邮件发送者欺骗:只需添加一大块随机文本来平衡垃圾邮件术语。

最后,他们没有针对误报进行偏置。我认为任何垃圾邮件过滤算法都应该有一个方便的旋钮,你可以转动它来降低误报率,代价是降低过滤率。我通过将非垃圾邮件语料库中标记的出现次数加倍来实现这一点。我认为将垃圾邮件过滤视为一个直接的文本分类问题是个坏主意。你可以使用文本分类技术,但解决方案可以也应该反映文本是邮件这一事实,特别是垃圾邮件。邮件不仅仅是文本;它有结构。垃圾邮件过滤不仅仅是分类,因为误报比漏报要糟糕得多,你应该将它们视为不同类型的错误。而且错误的来源不仅仅是随机变化,而是一个活跃的人类垃圾邮件发送者积极工作以击败你的过滤器。

标记

在Slashdot文章之后我听说的另一个项目是Bill Yerazunis的CRM114[5]。这证明了我刚才提到的设计原则的反例。它是一个直接的文本分类器,但它是如此惊人的有效,以至于它甚至不知道自己在做什么的情况下几乎完美地过滤垃圾邮件。

一旦我理解了CRM114的工作原理,似乎我最终将不得不从基于单个词的过滤转向这样的方法。但是首先,我想,我会看看我用单个词能走多远。答案是,惊人的远。

我主要在研究更智能的标记化。在当前的垃圾邮件上,我已经能够达到接近CRM114的过滤率。这些技术与Bill的技术大多是正交的;最优的解决方案可能包含两者。

《垃圾邮件计划》使用了非常简单的标记定义。字母、数字、破折号、撇号和美元符号是组成字符,其他都是标记分隔符。我也忽略了大小写。

现在我有一个更复杂的标记定义:保留大小写。感叹号是组成字符。句号和逗号如果出现在两个数字之间则是组成字符。这让我能够完整地获得IP地址和价格。像2025这样的价格范围产生两个标记,20-25这样的价格范围产生两个标记,20和$25。出现在To、From、Subject和Return-Path行中,或者url中的标记会相应地被标记。例如,Subject行中的”foo”变成”Subject*foo”。(星号可以是任何你不允许作为组成字符的字符。)这样的措施增加了过滤器的词汇量,使其更有区分力。例如,在当前的过滤器中,Subject行中的”free”有98%的垃圾邮件概率,而正文中的相同标记只有65%的垃圾邮件概率。

以下是一些当前的概率[6]:

SubjectFREE 0.9999 free!! 0.9999 Tofree 0.9998 Subjectfree 0.9782 free! 0.9199 Free 0.9198 Urlfree 0.9091 FREE 0.8747 From*free 0.7636 free 0.6546

在《垃圾邮件计划》过滤器中,所有这些标记都会有相同的概率,0.7602。那个过滤器识别了大约23,000个标记。当前的一个识别了大约187,000个。

拥有更大的标记宇宙的缺点是有更多未命中的机会。将你的语料库分散到更多的标记上与使其变小的效果相同。例如,如果你将感叹号视为组成字符,那么你可能最终没有带有七个感叹号的free的垃圾邮件概率,即使你知道只有两个感叹号的free有99.99%的概率。

对此的一个解决方案是我称之为退化的方法。如果你找不到标记的精确匹配,就将其视为不太具体的版本。我认为终端感叹号、大写字母以及出现在五个标记上下文中的任何一个使标记更具体。例如,如果我找不到”Subjectfree!”的概率,我会查找”Subjectfree”、“free!”和”free”的概率,并取离0.5最远的那个。

以下是过滤器在Subject行中看到”FREE!!!”且没有其概率时考虑的替代方案[7]:

SubjectFree!!! Subjectfree!!! SubjectFREE! SubjectFree! Subjectfree! SubjectFREE SubjectFree Subjectfree FREE!!! Free!!! free!!! FREE! Free! free! FREE Free free

如果你这样做,一定要考虑首字母大写以及全大写和全小写的版本。垃圾邮件倾向于有更多的祈使语气句子,而在这些句子中,第一个词是动词。所以首字母大写的动词比全小写时有更高的垃圾邮件概率。在我的过滤器中,“Act”的垃圾邮件概率是98%,而”act”只有62%。

如果你增加过滤器的词汇量,你可能会根据你旧的”相同”定义多次计算相同的词。从逻辑上讲,它们不再是相同的标记了。但如果这仍然困扰你,让我从经验补充,你似乎多次计算的词往往正是你想要的。

更大词汇量的另一个影响是,当你查看传入的邮件时,你会发现更多有趣的标记,即那些概率远离0.5的标记。我使用15个最有趣的来决定邮件是否为垃圾邮件。但是当你使用这样的固定数字时,你可能会遇到问题。如果你发现很多最大有趣的标记,结果可能取决于决定同等有趣标记排序的随机因素。处理这个问题的一种方法是将一些视为比其他更有趣。

例如,标记”dalco”在我的垃圾邮件语料库中出现3次,在我的合法邮件语料库中从未出现。标记”Url*optmails”(意思是url中的”optmails”)出现1223次。然而,按照我过去计算标记概率的方式,两者都会有相同的垃圾邮件概率,即0.99的阈值。

这感觉不对。有理论论据支持给这两个标记显著不同的概率(Pantel和Lin这样做),但我还没有尝试过。但至少似乎如果我们发现超过15个标记只出现在一个语料库或另一个中,我们应该优先考虑出现次数多的。所以现在有两个阈值。对于只出现在垃圾邮件语料库中的标记,如果它们出现超过10次,概率是0.9999,否则是0.9998。在规模范围的另一端,对于只在合法邮件语料库中找到的标记也是如此。

我以后可能会大幅缩放标记概率,但这少量的缩放至少确保标记以正确的方式排序。

另一种可能性是考虑不仅仅是15个标记,而是所有超过一定兴趣阈值的标记。Steven Hauser在他的统计垃圾邮件过滤器中这样做[8]。如果你使用阈值,使其非常高,否则垃圾邮件发送者可以通过用更多无害词汇填充邮件来欺骗你。

最后,应该如何处理html?我尝试了整个范围的选项,从忽略它到解析它全部。忽略html是个坏主意,因为它充满了有用的垃圾邮件迹象。但如果你解析它全部,你的过滤器可能会退化为仅仅是一个html识别器。最有效的方法似乎是中间路线,注意一些标记但不是其他。我查看a、img和font标签,忽略其余的。链接和图像你当然应该查看,因为它们包含url。

我可能在处理html方面可以更聪明,但我不认为为此投入大量时间是值得的。充满html的垃圾邮件很容易过滤。更聪明的垃圾邮件发送者已经避免它了。所以未来的性能应该不太取决于你如何处理html。

性能

在2002年12月10日到2003年1月10日期间,我收到了大约1750封垃圾邮件。其中,4封通过了。过滤率约为99.75%。

我错过的四封垃圾邮件中有两封通过是因为它们碰巧使用了在我合法邮件中经常出现的词。

第三封是那些利用不安全的cgi脚本向第三方发送邮件的邮件之一。仅仅基于内容很难过滤它们,因为邮件头是无辜的,而且他们小心使用词汇。即便如此,我通常能抓住它们。这一封以0.88的概率勉强通过,刚好低于0.9的阈值。

当然,查看多个标记序列会很容易抓住它。“以下是您的反馈表的结果”是一个即时的暴露。

第四封垃圾邮件是我称之为未来垃圾邮件的,因为这是我期望垃圾邮件演变成的样子:一些完全中性的文本后跟一个url。在这种情况下,是有人告诉我他们终于完成了他们的主页,问我是否愿意去看看。(该页面当然是色情网站的广告。)

如果垃圾邮件发送者小心处理邮件头并使用新鲜的url,那么在未来垃圾邮件中过滤器就没有什么可注意的了。我们当然可以通过发送爬虫来查看页面来反击。但这可能没有必要。未来垃圾邮件的响应率必须很低,否则每个人都会这样做。如果足够低,垃圾邮件发送者发送它就不划算,我们就不必在过滤它上太努力工作。

现在是真正令人震惊的消息:在同一个月期间,我收到了三个误报。

在某种程度上,得到一些误报是一种解脱。当我写《垃圾邮件计划》时,我还没有遇到过,我不知道它们会是什么样子。现在我遇到了一些,我很欣慰地发现它们并不像我想象的那么糟糕。统计过滤器产生的误报结果是听起来很像垃圾邮件的邮件,而这些往往是你们最不介意错过的[9]。

两个误报是从我购买过东西的公司发送的通讯。我从未要求接收它们,所以可以说它们是垃圾邮件,但我将它们计为误报,因为我之前没有将它们作为垃圾邮件删除。过滤器抓住它们的原因是,两家公司在一月份都转向了商业邮件发送者,而不是从自己的服务器发送邮件,而且邮件头和正文都变得更加垃圾邮件化。

第三个误报是个坏消息,虽然。它来自埃及某人,全部用大写字母书写。这是使标记区分大小写的直接结果;《垃圾邮件计划》过滤器不会抓住它。

很难说总体误报率是多少,因为从统计上讲,我们处于噪音水平。任何从事过滤器工作的人(至少,有效的过滤器)都会意识到这个问题。有些邮件很难说它们是否是垃圾邮件,当你把过滤器调得很紧时,这些就是你最终查看的邮件。例如,到目前为止,过滤器已经抓住了两封因为拼写错误而发送到我地址的邮件,还有一封是在相信我是别人的情况下发送给我的。可以说,这些既不是我的垃圾邮件,也不是我的非垃圾邮件。

另一个误报来自Virtumundo的副总裁。我写信给他们假装是客户,由于回复是通过Virtumundo的邮件服务器回来的,它有了可以想象的最 incriminating 的邮件头。可以说这也不是真正的误报,而是一种海森堡不确定性效应:我得到它只是因为我在写关于垃圾邮件过滤的文章。

不计算这些,到目前为止我总共有五个误报,在大约7740封合法邮件中,误报率为0.06%。另外两个是我购买的东西缺货的通知,以及来自Evite的聚会提醒。

我不认为这个数字是可信的,部分原因是样本如此之小,部分原因是我认为我可以修复过滤器不抓住其中一些。

误报在我看来是与漏报不同类型的错误。过滤率是性能的衡量标准。我将误报更像错误。我以优化来改进过滤率,以调试来减少误报。

所以这五个误报是我的错误列表。例如,来自埃及的邮件被抓住了,因为大写文本使它在过滤器看来像尼日利亚垃圾邮件。这确实是一种错误。与html一样,邮件全部大写在概念上实际上是一个特征,而不是每个词一个。我需要以更复杂的方式处理大小写。

那么这个0.06%说明了什么?我认为不多。你可以将其视为上限,记住样本大小很小。但在这个阶段,它更多的是衡量我的实现中的错误,而不是贝叶斯过滤的某种内在误报率。

未来

接下来是什么?过滤是一个优化问题,优化的关键是性能分析。不要试图猜测你的代码哪里慢,因为你会猜错。查看你的代码哪里慢,然后修复它。在过滤中,这转化为:查看你错过的垃圾邮件,并弄清楚你本可以做什么来抓住它们。

例如,垃圾邮件发送者现在正在积极地努力逃避过滤器,他们正在做的一件事是分解和拼写错误单词,以防止过滤器识别它们。但处理这个不是我的首要任务,因为我仍然没有麻烦抓住这些垃圾邮件[10]。

目前我确实有麻烦处理两种垃圾邮件。一种是假装是来自女性的邮件,邀请你去与她聊天或在约会网站上查看她的资料。这些通过是因为它们是你可以不用销售语言进行的销售类型。它们使用与普通邮件相同的词汇。

我有麻烦过滤的另一种垃圾邮件是来自例如保加利亚公司提供合同编程服务的邮件。这些通过是因为我也是程序员,而垃圾邮件充满了与我真实邮件相同的词汇。

我可能会首先专注于个人广告类型。我认为如果我仔细看,我将能够找到这些与我真实邮件之间的统计差异。写作风格当然不同,虽然可能需要多词过滤才能抓住。此外,我注意到他们倾向于重复url,而在合法邮件中包含url的人不会这样做[11。

外包类型将很难抓住。即使你向网站发送爬虫,你也不会找到确凿的统计证据。也许唯一的答案是垃圾邮件中宣传的域的中心列表[12]。但这种类型的邮件不会有那么多。如果剩下的唯一垃圾邮件是来自保加利亚的未经请求的合同编程服务要约,我们可能都可以继续从事其他工作。

统计过滤真的会让我们达到那个点吗?我不知道。现在,对我个人来说,垃圾邮件不是问题。但垃圾邮件发送者还没有认真努力欺骗统计过滤器。当他们这样做时会发生什么?

我对在网络级别工作的过滤器[13]不乐观。当有一个值得克服的静态障碍时,垃圾邮件发送者相当有效地克服它。已经有一家名为Assurance Systems的公司,它会将你的邮件通过Spamassassin运行,并告诉你它是否会被过滤掉。

网络级别的过滤器不会完全无用。它们可能足以杀死所有”选择加入”的垃圾邮件,意思是来自像Virtumundo和Equalamail这样声称他们真正运行选择加入列表的公司的垃圾邮件。你可以仅基于邮件头过滤这些,无论他们在正文中说什么。但是任何愿意伪造邮件头或使用开放中继的人,可能包括大多数色情垃圾邮件发送者,如果他们愿意,应该能够让一些消息通过网络级别的过滤器。(当然不是他们想发送的消息,那是某种东西。)

我乐观的过滤器类型是基于每个用户个人邮件计算概率的过滤器。这些可以更有效,不仅在避免误报方面,而且在过滤方面:例如,在消息的任何地方找到base-64编码的收件人电子邮件地址是非常好的垃圾邮件指标。

但是个人过滤器的真正优势是它们都会不同。如果每个人的过滤器都有不同的概率,这将使垃圾邮件发送者的优化循环,程序员会称之为他们的编辑-编译-测试周期,令人震惊地缓慢。他们不仅需要调整垃圾邮件直到它通过他们桌面上的某个过滤器副本,而且他们必须为每个调整进行测试邮寄。这就像在没有交互式顶层 的语言中编程,我不会希望任何人遇到这种情况。

注释

[1] Paul Graham。“垃圾邮件计划”。2002年8月。http://paulgraham.com/spam.html。

此算法中的概率使用贝叶斯规则的退化情况计算。有两个简化假设:特征(即词)的概率是独立的,并且我们对邮件是垃圾邮件的先验概率一无所知。

第一个假设在文本分类中很普遍。使用它的算法被称为”朴素贝叶斯”。

我做出第二个假设是因为我的传入邮件中垃圾邮件的比例每天(确实,每小时)波动如此之大,以至于总体先验比率作为预测者似乎毫无价值。如果你假设P(垃圾邮件)和P(非垃圾邮件)都是.5,它们会抵消,你可以将它们从公式中移除。

如果你在垃圾邮件与非垃圾邮件的比例持续非常高或(特别是)非常低的情况下进行贝叶斯过滤,你可能可以通过结合先验概率来改善过滤器性能。要正确做到这一点,你必须按时间跟踪比率,因为垃圾邮件和合法邮件量都有不同的日常模式。

[2] Patrick Pantel和Dekang Lin。“SpamCop—垃圾邮件分类与组织程序”。AAAI-98文本分类学习研讨会论文集。

[3] Mehran Sahami, Susan Dumais, David Heckerman和Eric Horvitz。“过滤垃圾邮件的贝叶斯方法”。AAAI-98文本分类学习研讨会论文集。

[4] 当时我在大约4,000封合法邮件中零误报。如果下一封合法邮件是误报,这会给我们0.03%。正如我后面解释的,这些误报率是不可信的。我在这里引用一个数字只是为了强调无论误报率是多少,它都低于1.16%。

[5] Bill Yerazunis。“稀疏二进制多项式哈希消息过滤与CRM114鉴别器”。2003年垃圾邮件会议论文集。

[6] 在《垃圾邮件计划》中我使用了0.99和0.01的阈值。使用与语料库大小成比例的阈值似乎是合理的。由于我现在每种类型的邮件大约有10,000封,我使用0.9999和0.0001。

[7] 这里有一个我可能应该修复的缺陷。目前,当”Subjectfoo”退化为只是”foo”时,这意味着你得到的是”foo”在正文或我没有标记的其他邮件头行中出现次数的统计。我应该做的是跟踪”foo”整体以及特定版本的统计,并且从”Subjectfoo”不是退化为”foo”,而是退化为”Anywhere*foo”。大小写也是如此:我应该从大写退化为任何情况,而不是小写。

对价格这样做可能也是有利的,例如从”129.99"退化为"129.99"退化为"—9.99”、”.99""--.99"和"—”。

你也可以将词退化为它们的词干,但这可能只在你有小型语料库的早期改善过滤率。

[8] Steven Hauser。“统计垃圾邮件过滤器对我有用”。http://www.sofbot.com。

[9] 误报并不都是相等的,我们在比较停止垃圾邮件的技术时应该记住这一点。而过滤器引起的许多误报将是你们不介意错过的近垃圾邮件,例如黑名单引起的误报将只是来自选择了错误ISP的人的邮件。在两种情况下你都抓住接近垃圾邮件的邮件,但对黑名单来说接近是物理的,对过滤器来说是文本的。

[10] 如果垃圾邮件发送者在掩盖标记方面变得足够好以至于成为问题,我们可以通过简单地删除空白、句号、逗号等并使用字典从结果序列中挑出词来回应。当然,以这种方式找到在原始文本中不可见的词本身就是垃圾邮件的证据。

挑出词并不容易。它需要的不仅仅是重建词边界;垃圾邮件发送者既添加(“xHot nPorn cSite”)又省略(“P#rn”)字母。视觉研究在这里可能有用,因为人类视觉是这些技巧将接近的极限。

[11] 一般来说,垃圾邮件比普通邮件更重复。他们想要把那个信息灌输回家。我目前在15个顶级标记中不允许重复,因为如果发送者碰巧多次使用某个坏词,你可能会得到误报。(在我当前的过滤器中,“dick”的垃圾邮件概率是0.9999,但它也是一个名字。)不过我们至少应该注意到重复,所以我可能尝试允许每个标记最多两个,就像Brian Burton在SpamProbe中所做的那样。

[12] 一旦垃圾邮件发送者被迫使用疯狂填充技术来生成消息中的其他所有内容,像Brightmail这样的方法就会退化为这样。

[13] 有时有人争论说我们应该在网络级别上工作,因为它更有效率。当人们这样说时,他们的意思是:我们目前在网络级别上过滤,我们不想从头开始。但你不能为了适应你的解决方案而支配问题。

从历史上看,稀缺资源论点在软件设计辩论中都是失败的一方。人们倾向于只用它们来证明为其他原因做出的选择(特别是不作为)是合理的。

感谢Sarah Harlin、Trevor Blackwell和Dan Giffin阅读本文的草稿,再次感谢Dan为此过滤器运行的大部分基础设施。

相关:

垃圾邮件计划 垃圾邮件计划常见问题 2003年垃圾邮件会议论文集 日文翻译 中文翻译 这些建议的测试

Better Bayesian Filtering

January 2003

(This article was given as a talk at the 2003 Spam Conference. It describes the work I’ve done to improve the performance of the algorithm described in A Plan for Spam, and what I plan to do in the future.)

The first discovery I’d like to present here is an algorithm for lazy evaluation of research papers. Just write whatever you want and don’t cite any previous work, and indignant readers will send you references to all the papers you should have cited. I discovered this algorithm after “A Plan for Spam” [1] was on Slashdot.

Spam filtering is a subset of text classification, which is a well established field, but the first papers about Bayesian spam filtering per se seem to have been two given at the same conference in 1998, one by Pantel and Lin [2], and another by a group from Microsoft Research [3].

When I heard about this work I was a bit surprised. If people had been onto Bayesian filtering four years ago, why wasn’t everyone using it? When I read the papers I found out why. Pantel and Lin’s filter was the more effective of the two, but it only caught 92% of spam, with 1.16% false positives.

When I tried writing a Bayesian spam filter, it caught 99.5% of spam with less than .03% false positives [4]. It’s always alarming when two people trying the same experiment get widely divergent results. It’s especially alarming here because those two sets of numbers might yield opposite conclusions. Different users have different requirements, but I think for many people a filtering rate of 92% with 1.16% false positives means that filtering is not an acceptable solution, whereas 99.5% with less than .03% false positives means that it is.

So why did we get such different numbers? I haven’t tried to reproduce Pantel and Lin’s results, but from reading the paper I see five things that probably account for the difference.

One is simply that they trained their filter on very little data: 160 spam and 466 nonspam mails. Filter performance should still be climbing with data sets that small. So their numbers may not even be an accurate measure of the performance of their algorithm, let alone of Bayesian spam filtering in general.

But I think the most important difference is probably that they ignored message headers. To anyone who has worked on spam filters, this will seem a perverse decision. And yet in the very first filters I tried writing, I ignored the headers too. Why? Because I wanted to keep the problem neat. I didn’t know much about mail headers then, and they seemed to me full of random stuff. There is a lesson here for filter writers: don’t ignore data. You’d think this lesson would be too obvious to mention, but I’ve had to learn it several times.

Third, Pantel and Lin stemmed the tokens, meaning they reduced e.g. both “mailing” and “mailed” to the root “mail”. They may have felt they were forced to do this by the small size of their corpus, but if so this is a kind of premature optimization.

Fourth, they calculated probabilities differently. They used all the tokens, whereas I only use the 15 most significant. If you use all the tokens you’ll tend to miss longer spams, the type where someone tells you their life story up to the point where they got rich from some multilevel marketing scheme. And such an algorithm would be easy for spammers to spoof: just add a big chunk of random text to counterbalance the spam terms.

Finally, they didn’t bias against false positives. I think any spam filtering algorithm ought to have a convenient knob you can twist to decrease the false positive rate at the expense of the filtering rate. I do this by counting the occurrences of tokens in the nonspam corpus double. I don’t think it’s a good idea to treat spam filtering as a straight text classification problem. You can use text classification techniques, but solutions can and should reflect the fact that the text is email, and spam in particular. Email is not just text; it has structure. Spam filtering is not just classification, because false positives are so much worse than false negatives that you should treat them as a different kind of error. And the source of error is not just random variation, but a live human spammer working actively to defeat your filter.

Tokens

Another project I heard about after the Slashdot article was Bill Yerazunis’ CRM114 [5]. This is the counterexample to the design principle I just mentioned. It’s a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that’s what it’s doing.

Once I understood how CRM114 worked, it seemed inevitable that I would eventually have to move from filtering based on single words to an approach like this. But first, I thought, I’ll see how far I can get with single words. And the answer is, surprisingly far.

Mostly I’ve been working on smarter tokenization. On current spam, I’ve been able to achieve filtering rates that approach CRM114’s. These techniques are mostly orthogonal to Bill’s; an optimal solution might incorporate both.

“A Plan for Spam” uses a very simple definition of a token. Letters, digits, dashes, apostrophes, and dollar signs are constituent characters, and everything else is a token separator. I also ignored case.

Now I have a more complicated definition of a token: Case is preserved. Exclamation points are constituent characters. Periods and commas are constituents if they occur between two digits. This lets me get ip addresses and prices intact. A price range like 2025yieldstwotokens,20-25 yields two tokens, 20 and $25. Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. E.g. “foo” in the Subject line becomes “Subject*foo”. (The asterisk could be any character you don’t allow as a constituent.) Such measures increase the filter’s vocabulary, which makes it more discriminating. For example, in the current filter, “free” in the Subject line has a spam probability of 98%, whereas the same token in the body has a spam probability of only 65%.

Here are some of the current probabilities [6]:

SubjectFREE 0.9999 free!! 0.9999 Tofree 0.9998 Subjectfree 0.9782 free! 0.9199 Free 0.9198 Urlfree 0.9091 FREE 0.8747 From*free 0.7636 free 0.6546

In the Plan for Spam filter, all these tokens would have had the same probability, .7602. That filter recognized about 23,000 tokens. The current one recognizes about 187,000.

The disadvantage of having a larger universe of tokens is that there is more chance of misses. Spreading your corpus out over more tokens has the same effect as making it smaller. If you consider exclamation points as constituents, for example, then you could end up not having a spam probability for free with seven exclamation points, even though you know that free with just two exclamation points has a probability of 99.99%.

One solution to this is what I call degeneration. If you can’t find an exact match for a token, treat it as if it were a less specific version. I consider terminal exclamation points, uppercase letters, and occurring in one of the five marked contexts as making a token more specific. For example, if I don’t find a probability for “Subjectfree!”, I look for probabilities for “Subjectfree”, “free!”, and “free”, and take whichever one is farthest from .5.

Here are the alternatives [7] considered if the filter sees “FREE!!!” in the Subject line and doesn’t have a probability for it.

SubjectFree!!! Subjectfree!!! SubjectFREE! SubjectFree! Subjectfree! SubjectFREE SubjectFree Subjectfree FREE!!! Free!!! free!!! FREE! Free! free! FREE Free free

If you do this, be sure to consider versions with initial caps as well as all uppercase and all lowercase. Spams tend to have more sentences in imperative mood, and in those the first word is a verb. So verbs with initial caps have higher spam probabilities than they would in all lowercase. In my filter, the spam probability of “Act” is 98% and for “act” only 62%.

If you increase your filter’s vocabulary, you can end up counting the same word multiple times, according to your old definition of “same”. Logically, they’re not the same token anymore. But if this still bothers you, let me add from experience that the words you seem to be counting multiple times tend to be exactly the ones you’d want to.

Another effect of a larger vocabulary is that when you look at an incoming mail you find more interesting tokens, meaning those with probabilities far from .5. I use the 15 most interesting to decide if mail is spam. But you can run into a problem when you use a fixed number like this. If you find a lot of maximally interesting tokens, the result can end up being decided by whatever random factor determines the ordering of equally interesting tokens. One way to deal with this is to treat some as more interesting than others.

For example, the token “dalco” occurs 3 times in my spam corpus and never in my legitimate corpus. The token “Url*optmails” (meaning “optmails” within a url) occurs 1223 times. And yet, as I used to calculate probabilities for tokens, both would have the same spam probability, the threshold of .99.

That doesn’t feel right. There are theoretical arguments for giving these two tokens substantially different probabilities (Pantel and Lin do), but I haven’t tried that yet. It does seem at least that if we find more than 15 tokens that only occur in one corpus or the other, we ought to give priority to the ones that occur a lot. So now there are two threshold values. For tokens that occur only in the spam corpus, the probability is .9999 if they occur more than 10 times and .9998 otherwise. Ditto at the other end of the scale for tokens found only in the legitimate corpus.

I may later scale token probabilities substantially, but this tiny amount of scaling at least ensures that tokens get sorted the right way.

Another possibility would be to consider not just 15 tokens, but all the tokens over a certain threshold of interestingness. Steven Hauser does this in his statistical spam filter [8]. If you use a threshold, make it very high, or spammers could spoof you by packing messages with more innocent words.

Finally, what should one do about html? I’ve tried the whole spectrum of options, from ignoring it to parsing it all. Ignoring html is a bad idea, because it’s full of useful spam signs. But if you parse it all, your filter might degenerate into a mere html recognizer. The most effective approach seems to be the middle course, to notice some tokens but not others. I look at a, img, and font tags, and ignore the rest. Links and images you should certainly look at, because they contain urls.

I could probably be smarter about dealing with html, but I don’t think it’s worth putting a lot of time into this. Spams full of html are easy to filter. The smarter spammers already avoid it. So performance in the future should not depend much on how you deal with html.

Performance

Between December 10 2002 and January 10 2003 I got about 1750 spams. Of these, 4 got through. That’s a filtering rate of about 99.75%.

Two of the four spams I missed got through because they happened to use words that occur often in my legitimate email.

The third was one of those that exploit an insecure cgi script to send mail to third parties. They’re hard to filter based just on the content because the headers are innocent and they’re careful about the words they use. Even so I can usually catch them. This one squeaked by with a probability of .88, just under the threshold of .9.

Of course, looking at multiple token sequences would catch it easily. “Below is the result of your feedback form” is an instant giveaway.

The fourth spam was what I call a spam-of-the-future, because this is what I expect spam to evolve into: some completely neutral text followed by a url. In this case it was was from someone saying they had finally finished their homepage and would I go look at it. (The page was of course an ad for a porn site.)

If the spammers are careful about the headers and use a fresh url, there is nothing in spam-of-the-future for filters to notice. We can of course counter by sending a crawler to look at the page. But that might not be necessary. The response rate for spam-of-the-future must be low, or everyone would be doing it. If it’s low enough, it won’t pay for spammers to send it, and we won’t have to work too hard on filtering it.

Now for the really shocking news: during that same one-month period I got three false positives.

In a way it’s a relief to get some false positives. When I wrote “A Plan for Spam” I hadn’t had any, and I didn’t know what they’d be like. Now that I’ve had a few, I’m relieved to find they’re not as bad as I feared. False positives yielded by statistical filters turn out to be mails that sound a lot like spam, and these tend to be the ones you would least mind missing [9].

Two of the false positives were newsletters from companies I’ve bought things from. I never asked to receive them, so arguably they were spams, but I count them as false positives because I hadn’t been deleting them as spams before. The reason the filters caught them was that both companies in January switched to commercial email senders instead of sending the mails from their own servers, and both the headers and the bodies became much spammier.

The third false positive was a bad one, though. It was from someone in Egypt and written in all uppercase. This was a direct result of making tokens case sensitive; the Plan for Spam filter wouldn’t have caught it.

It’s hard to say what the overall false positive rate is, because we’re up in the noise, statistically. Anyone who has worked on filters (at least, effective filters) will be aware of this problem. With some emails it’s hard to say whether they’re spam or not, and these are the ones you end up looking at when you get filters really tight. For example, so far the filter has caught two emails that were sent to my address because of a typo, and one sent to me in the belief that I was someone else. Arguably, these are neither my spam nor my nonspam mail.

Another false positive was from a vice president at Virtumundo. I wrote to them pretending to be a customer, and since the reply came back through Virtumundo’s mail servers it had the most incriminating headers imaginable. Arguably this isn’t a real false positive either, but a sort of Heisenberg uncertainty effect: I only got it because I was writing about spam filtering.

Not counting these, I’ve had a total of five false positives so far, out of about 7740 legitimate emails, a rate of .06%. The other two were a notice that something I bought was back-ordered, and a party reminder from Evite.

I don’t think this number can be trusted, partly because the sample is so small, and partly because I think I can fix the filter not to catch some of these.

False positives seem to me a different kind of error from false negatives. Filtering rate is a measure of performance. False positives I consider more like bugs. I approach improving the filtering rate as optimization, and decreasing false positives as debugging.

So these five false positives are my bug list. For example, the mail from Egypt got nailed because the uppercase text made it look to the filter like a Nigerian spam. This really is kind of a bug. As with html, the email being all uppercase is really conceptually one feature, not one for each word. I need to handle case in a more sophisticated way.

So what to make of this .06%? Not much, I think. You could treat it as an upper bound, bearing in mind the small sample size. But at this stage it is more a measure of the bugs in my implementation than some intrinsic false positive rate of Bayesian filtering.

Future

What next? Filtering is an optimization problem, and the key to optimization is profiling. Don’t try to guess where your code is slow, because you’ll guess wrong. Look at where your code is slow, and fix that. In filtering, this translates to: look at the spams you miss, and figure out what you could have done to catch them.

For example, spammers are now working aggressively to evade filters, and one of the things they’re doing is breaking up and misspelling words to prevent filters from recognizing them. But working on this is not my first priority, because I still have no trouble catching these spams [10].

There are two kinds of spams I currently do have trouble with. One is the type that pretends to be an email from a woman inviting you to go chat with her or see her profile on a dating site. These get through because they’re the one type of sales pitch you can make without using sales talk. They use the same vocabulary as ordinary email.

The other kind of spams I have trouble filtering are those from companies in e.g. Bulgaria offering contract programming services. These get through because I’m a programmer too, and the spams are full of the same words as my real mail.

I’ll probably focus on the personal ad type first. I think if I look closer I’ll be able to find statistical differences between these and my real mail. The style of writing is certainly different, though it may take multiword filtering to catch that. Also, I notice they tend to repeat the url, and someone including a url in a legitimate mail wouldn’t do that [11].

The outsourcing type are going to be hard to catch. Even if you sent a crawler to the site, you wouldn’t find a smoking statistical gun. Maybe the only answer is a central list of domains advertised in spams [12]. But there can’t be that many of this type of mail. If the only spams left were unsolicited offers of contract programming services from Bulgaria, we could all probably move on to working on something else.

Will statistical filtering actually get us to that point? I don’t know. Right now, for me personally, spam is not a problem. But spammers haven’t yet made a serious effort to spoof statistical filters. What will happen when they do?

I’m not optimistic about filters that work at the network level [13]. When there is a static obstacle worth getting past, spammers are pretty efficient at getting past it. There is already a company called Assurance Systems that will run your mail through Spamassassin and tell you whether it will get filtered out.

Network-level filters won’t be completely useless. They may be enough to kill all the “opt-in” spam, meaning spam from companies like Virtumundo and Equalamail who claim that they’re really running opt-in lists. You can filter those based just on the headers, no matter what they say in the body. But anyone willing to falsify headers or use open relays, presumably including most porn spammers, should be able to get some message past network-level filters if they want to. (By no means the message they’d like to send though, which is something.)

The kind of filters I’m optimistic about are ones that calculate probabilities based on each individual user’s mail. These can be much more effective, not only in avoiding false positives, but in filtering too: for example, finding the recipient’s email address base-64 encoded anywhere in a message is a very good spam indicator.

But the real advantage of individual filters is that they’ll all be different. If everyone’s filters have different probabilities, it will make the spammers’ optimization loop, what programmers would call their edit-compile-test cycle, appallingly slow. Instead of just tweaking a spam till it gets through a copy of some filter they have on their desktop, they’ll have to do a test mailing for each tweak. It would be like programming in a language without an interactive toplevel, and I wouldn’t wish that on anyone.

Notes

[1] Paul Graham. “A Plan for Spam.” August 2002. http://paulgraham.com/spam.html.

Probabilities in this algorithm are calculated using a degenerate case of Bayes’ Rule. There are two simplifying assumptions: that the probabilities of features (i.e. words) are independent, and that we know nothing about the prior probability of an email being spam.

The first assumption is widespread in text classification. Algorithms that use it are called “naive Bayesian.”

The second assumption I made because the proportion of spam in my incoming mail fluctuated so much from day to day (indeed, from hour to hour) that the overall prior ratio seemed worthless as a predictor. If you assume that P(spam) and P(nonspam) are both .5, they cancel out and you can remove them from the formula.

If you were doing Bayesian filtering in a situation where the ratio of spam to nonspam was consistently very high or (especially) very low, you could probably improve filter performance by incorporating prior probabilities. To do this right you’d have to track ratios by time of day, because spam and legitimate mail volume both have distinct daily patterns.

[2] Patrick Pantel and Dekang Lin. “SpamCop— A Spam Classification & Organization Program.” Proceedings of AAAI-98 Workshop on Learning for Text Categorization.

[3] Mehran Sahami, Susan Dumais, David Heckerman and Eric Horvitz. “A Bayesian Approach to Filtering Junk E-Mail.” Proceedings of AAAI-98 Workshop on Learning for Text Categorization.

[4] At the time I had zero false positives out of about 4,000 legitimate emails. If the next legitimate email was a false positive, this would give us .03%. These false positive rates are untrustworthy, as I explain later. I quote a number here only to emphasize that whatever the false positive rate is, it is less than 1.16%.

[5] Bill Yerazunis. “Sparse Binary Polynomial Hash Message Filtering and The CRM114 Discriminator.” Proceedings of 2003 Spam Conference.

[6] In “A Plan for Spam” I used thresholds of .99 and .01. It seems justifiable to use thresholds proportionate to the size of the corpora. Since I now have on the order of 10,000 of each type of mail, I use .9999 and .0001.

[7] There is a flaw here I should probably fix. Currently, when “Subjectfoo” degenerates to just “foo”, what that means is you’re getting the stats for occurrences of “foo” in the body or header lines other than those I mark. What I should do is keep track of statistics for “foo” overall as well as specific versions, and degenerate from “Subjectfoo” not to “foo” but to “Anywhere*foo”. Ditto for case: I should degenerate from uppercase to any-case, not lowercase.

It would probably be a win to do this with prices too, e.g. to degenerate from “129.99"to"129.99" to "—9.99”, ”.99",and"--.99", and "—”.

You could also degenerate from words to their stems, but this would probably only improve filtering rates early on when you had small corpora.

[8] Steven Hauser. “Statistical Spam Filter Works for Me.” http://www.sofbot.com.

[9] False positives are not all equal, and we should remember this when comparing techniques for stopping spam. Whereas many of the false positives caused by filters will be near-spams that you wouldn’t mind missing, false positives caused by blacklists, for example, will be just mail from people who chose the wrong ISP. In both cases you catch mail that’s near spam, but for blacklists nearness is physical, and for filters it’s textual.

[10] If spammers get good enough at obscuring tokens for this to be a problem, we can respond by simply removing whitespace, periods, commas, etc. and using a dictionary to pick the words out of the resulting sequence. And of course finding words this way that weren’t visible in the original text would in itself be evidence of spam.

Picking out the words won’t be trivial. It will require more than just reconstructing word boundaries; spammers both add (“xHot nPorn cSite”) and omit (“P#rn”) letters. Vision research may be useful here, since human vision is the limit that such tricks will approach.

[11] In general, spams are more repetitive than regular email. They want to pound that message home. I currently don’t allow duplicates in the top 15 tokens, because you could get a false positive if the sender happens to use some bad word multiple times. (In my current filter, “dick” has a spam probabilty of .9999, but it’s also a name.) It seems we should at least notice duplication though, so I may try allowing up to two of each token, as Brian Burton does in SpamProbe.

[12] This is what approaches like Brightmail’s will degenerate into once spammers are pushed into using mad-lib techniques to generate everything else in the message.

[13] It’s sometimes argued that we should be working on filtering at the network level, because it is more efficient. What people usually mean when they say this is: we currently filter at the network level, and we don’t want to start over from scratch. But you can’t dictate the problem to fit your solution.

Historically, scarce-resource arguments have been the losing side in debates about software design. People only tend to use them to justify choices (inaction in particular) made for other reasons.

Thanks to Sarah Harlin, Trevor Blackwell, and Dan Giffin for reading drafts of this paper, and to Dan again for most of the infrastructure that this filter runs on.

A Plan for Spam Plan for Spam FAQ 2003 Spam Conference Proceedings Japanese Translation Chinese Translation Test of These Suggestions