反击的过滤器

Paul Graham 2003-08-01

Filters that Fight Back

August 2003

We may be able to improve the accuracy of Bayesian spam filters by having them follow links to see what’s waiting at the other end. Richard Jowsey of death2spam now does this in borderline cases, and reports that it works well.

Why only do it in borderline cases? And why only do it once?

As I mentioned in Will Filters Kill Spam?, following all the urls in a spam would have an amusing side-effect. If popular email clients did this in order to filter spam, the spammer’s servers would take a serious pounding. The more I think about this, the better an idea it seems. This isn’t just amusing; it would be hard to imagine a more perfectly targeted counterattack on spammers.

So I’d like to suggest an additional feature to those working on spam filters: a “punish” mode which, if turned on, would spider every url in a suspected spam n times, where n could be set by the user. [1]

As many people have noted, one of the problems with the current email system is that it’s too passive. It does whatever you tell it. So far all the suggestions for fixing the problem seem to involve new protocols. This one wouldn’t.

If widely used, auto-retrieving spam filters would make the email system rebound. The huge volume of the spam, which has so far worked in the spammer’s favor, would now work against him, like a branch snapping back in his face. Auto-retrieving spam filters would drive the spammer’s costs up, and his sales down: his bandwidth usage would go through the roof, and his servers would grind to a halt under the load, which would make them unavailable to the people who would have responded to the spam.

Pump out a million emails an hour, get a million hits an hour on your servers. We would want to ensure that this is only done to suspected spams. As a rule, any url sent to millions of people is likely to be a spam url, so submitting every http request in every email would work fine nearly all the time. But there are a few cases where this isn’t true: the urls at the bottom of mails sent from free email services like Yahoo Mail and Hotmail, for example.

To protect such sites, and to prevent abuse, auto-retrieval should be combined with blacklists of spamvertised sites. Only sites on a blacklist would get crawled, and sites would be blacklisted only after being inspected by humans. The lifetime of a spam must be several hours at least, so it should be easy to update such a list in time to interfere with a spam promoting a new site. [2]

High-volume auto-retrieval would only be practical for users on high-bandwidth connections, but there are enough of those to cause spammers serious trouble. Indeed, this solution neatly mirrors the problem. The problem with spam is that in order to reach a few gullible people the spammer sends mail to everyone. The non-gullible recipients are merely collateral damage. But the non-gullible majority won’t stop getting spam until they can stop (or threaten to stop) the gullible from responding to it. Auto-retrieving spam filters offer them a way to do this.

Would that kill spam? Not quite. The biggest spammers could probably protect their servers against auto-retrieving filters. However, the easiest and cheapest way for them to do it would be to include working unsubscribe links in their mails. And this would be a necessity for smaller fry, and for “legitimate” sites that hired spammers to promote them. So if auto-retrieving filters became widespread, they’d become auto-unsubscribing filters.

In this scenario, spam would, like OS crashes, viruses, and popups, become one of those plagues that only afflict people who don’t bother to use the right software.

Notes

[1] Auto-retrieving filters will have to follow redirects, and should in some cases (e.g. a page that just says “click here”) follow more than one level of links. Make sure too that the http requests are indistinguishable from those of popular Web browsers, including the order and referrer.

If the response doesn’t come back within x amount of time, default to some fairly high spam probability.

Instead of making n constant, it might be a good idea to make it a function of the number of spams that have been seen mentioning the site. This would add a further level of protection against abuse and accidents.

[2] The original version of this article used the term “whitelist” instead of “blacklist”. Though they were to work like blacklists, I preferred to call them whitelists because it might make them less vulnerable to legal attack. This just seems to have confused readers, though.

There should probably be multiple blacklists. A single point of failure would be vulnerable both to attack and abuse.

Thanks to Brian Burton, Bill Yerazunis, Dan Giffin, Eric Raymond, and Richard Jowsey for reading drafts of this.

反击的过滤器

2003年8月

我们也许能够通过让贝叶斯垃圾邮件过滤器跟踪链接来看看另一端等待什么来提高其准确性。death2spam的理查德·乔西现在在边界情况下这样做，并报告说效果很好。

为什么只在边界情况下这样做？为什么只做一次？

正如我在《过滤器会杀死垃圾邮件吗？》中提到的，跟踪垃圾邮件中的所有url会有一个有趣的副作用。如果流行的邮件客户端为了过滤垃圾邮件而这样做，垃圾邮件发送者的服务器将受到严重打击。我越想这个，越觉得这是个好主意。这不仅仅是有趣；很难想象有比这更精确地针对垃圾邮件发送者的反击。

所以我想向那些正在研究垃圾邮件过滤器的人建议一个附加功能：“惩罚”模式，如果打开，将对可疑垃圾邮件中的每个url进行n次蜘蛛抓取，其中n可以由用户设置。[1]

正如许多人指出的，当前邮件系统的问题之一是它太被动了。它做你告诉它做的一切。到目前为止，所有修复问题的建议似乎都涉及新协议。这个不会。

如果广泛使用，自动检索垃圾邮件过滤器将使邮件系统反弹。垃圾邮件的大量数量，迄今为止一直对垃圾邮件发送者有利，现在将对他不利，就像树枝反弹在他脸上一样。自动检索垃圾邮件过滤器将提高垃圾邮件发送者的成本，并降低他的销售：他的带宽使用将飙升，他的服务器将在负载下停止运转，这将使它们对那些会响应垃圾邮件的人不可用。

每小时发送一百万封邮件，每小时在你的服务器上获得一百万次点击。我们想确保这只对可疑的垃圾邮件进行。通常，发送给数百万人的任何url都可能是垃圾邮件url，因此提交每封邮件中的每个http请求几乎总是可以正常工作。但有几种情况这不是真的：例如，像Yahoo Mail和Hotmail这样的免费邮件服务发送的邮件底部的url。

为了保护这样的站点，并防止滥用，自动检索应该与垃圾邮件推广站点的黑名单结合。只有黑名单上的站点才会被抓取，站点只有在经过人工检查后才会被列入黑名单。垃圾邮件的生命周期必须至少有几个小时，所以应该很容易及时更新这样的列表来干扰推广新站点的垃圾邮件。[2]

大量自动检索只对高带宽连接的用户实用，但有足够的用户会给垃圾邮件发送者造成严重麻烦。实际上，这个解决方案巧妙地反映了问题。垃圾邮件的问题在于，为了到达少数易受骗的人，垃圾邮件发送者向每个人发送邮件。不易受骗的收件人只是附带损害。但不易受骗的多数人不会停止接收垃圾邮件，直到他们能够停止（或威胁停止）易受骗的人响应它。自动检索垃圾邮件过滤器为他们提供了一种方法来做到这一点。

这会杀死垃圾邮件吗？不完全是。最大的垃圾邮件发送者可能能够保护他们的服务器免受自动检索过滤器的影响。然而，对他们来说最简单、最便宜的方法是在邮件中包含有效的退订链接。这对于小鱼以及雇佣垃圾邮件发送者推广他们的”合法”站点来说将是必要的。因此，如果自动检索过滤器变得广泛，它们将成为自动退订过滤器。

在这种情况下，垃圾邮件将像操作系统崩溃、病毒和弹出窗口一样，成为那些懒得使用正确软件的人所遭受的瘟疫之一。

注释

[1] 自动检索过滤器必须跟踪重定向，并在某些情况下（例如，只说”点击这里”的页面）跟踪多级链接。还要确保http请求与流行Web浏览器的请求无法区分，包括顺序和引用者。

如果响应在x时间内没有返回，默认为相当高的垃圾邮件概率。

与其使n恒定，不如使它成为提到该站点的垃圾邮件数量的函数，这可能是个好主意。这将增加对滥用和事故的进一步保护。

[2] 本文的早期版本使用”白名单”而不是”黑名单”一词。虽然它们要像黑名单一样工作，我更喜欢称它们为白名单，因为这可能使它们不太容易受到法律攻击。然而，这似乎只是让读者感到困惑。

应该可能有多个黑名单。单点故障既容易受到攻击也容易受到滥用。

感谢布莱恩·伯顿、比尔·耶拉祖尼斯、丹·吉芬、埃里克·雷蒙德和理查德·乔西阅读本文的草稿。