[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <loom.20061003T123315-373@post.gmane.org>
Date: Tue, 3 Oct 2006 10:50:51 +0000 (UTC)
From: Gordon Cormack <gvcormac@...terloo.ca>
To: linux-kernel@...r.kernel.org
Subject: Re: Spam, bogofilter, etc
Linus Torvalds <torvalds <at> osdl.org> writes:
> I'm sorry, but spam-filtering is simply harder than the bayesian
> word-count weenies think it is. I even used to _know_ something about
> bayesian filtering, since it was one of the projects I worked on at uni,
> and dammit, it's not a good approach, as shown by the fact that it's
> trivial to get around.
Linus, I've seen no evidence that statistical filters are trivial
to beat. Can you provide some?
> I don't know why people got so excited about the whole bayesian thing.
> It's fine as _one_ small clause in a bigger framework of deciding spam,
> but it's totally inappropriate for a "yes/no" kind of decision on its own.
Why is that? Statistical filters (so-called 'Bayesian) have lower
false positive and false negative rates than many other approaches.
Bogofilter is one of the better ones, although it is not particularly
Bayesian.
> If you want a yes/no kind of thing, do it on real hard issues, like not
> accepting email from machines that aren't registered MX gateways. Sure,
> that will mean that people who just set up their local sendmail thing and
> connect directly to port 25 will just not be able to email, but let's face
> it, that's why we have ISP's and DNS in the first place.
You are saying that this sort of false positive is acceptable to
you. With no corresponding claim as to the corresponding false
negative rate.
So-called yes/no values are simply tests with their own failure
rates. As such, they have strictly less information than
scores or probability estimates that offer a confidence
estimate as well. The trick is in combining several sources
of evidence, and 'Bayesian' is but one method of combining this
evidence.
>
> If you want to do word analysis, use it like SpamAssassin does it - with
> some Bayesian rule perhaps adding a few points to the score. That's
> entirely appropriate. But running bogo-filter _instead_ of spamassassin is
> just asinine.
Spamassassin performs quite poorly with the default weight
given to its statistical filter. It works much better
if you increase the weight. Many tests show that it works
better still if you simply discard the ad hoc rules and
rely on the 'Bayesian' filter alone. I have found that
almost all of the false positives I've encountered in
the last 3 years have been due to Spamassassin's ad hoc
rules, not its statistical filter.
References
http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05
http://plg.uwaterloo.ca/~gvcormac/spamassassin.html
http://www.ceas.cc/2006/listabs.html#12.pdf
Gordon Cormack
University of Waterloo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists