[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1065546489.3151.47.camel@tantor.nuclearelephant.com>
From: jonathan at nuclearelephant.com (Jonathan A. Zdziarski)
Subject: Spam with PGP
Sorry I didn't mean to sound like a troll. I'll follow up with some
information...guess I shouldn't have gone out to lunch after sending
this email =)
First, check out http://www.paulgraham.com. He goes into great detail
to explain how probability-based filters work. He explains in the
setting of 'bayesian' filtering, but this could be applied to Chi-Square
and other similar types of filtering that use mathematical
probabilities.
Heuristic filters are based on a set of static rules which identify
characteristics of spam. Some of the setbacks to this are:
- The rules are not specific to each user's own behavior, which severely
hampers accuracy
- The rules require constant updating as spammers are always
circumventing the latest rulesets
- Most such filters, such as SpamAssassin, have no way of learning; they
must be reprogrammed
Here is a short excerpt from the DSPAM FAQ about the difference between
DSPAM (my project) and SpamAssassin. I'm not knocking SpamAssassin; I
think it's a great tool, and is good if you need out-of-the-box
filtering...but there are several long term solutions that are much
better.
<snip>
SpamAssassin is based primarily on a set of rules to detect the
individual characteristics of spam. DSPAM, on the other hand, puts all
of its weight primarily on tokenized Bayesian filtering. The advantage
to using DSPAM's approach, I feel, is that almost all of the rules
SpamAssassin uses to identify the characteristics of spam are
automatically performed by DSPAM's approach. On top of this, because
DSPAM's analysis is on a per-user basis, it is able to determine just
how important each characteristic (or "rule" in SpamAssassin talk) is to
each user, rather than collectively. For example, SpamAssassin's first
rule is to identify if the MUA is pine. Many users receive more spams
from a pine MUA than not. DSPAM performs this automatically as part of
its Bayesian analysis and is able to calculate the probability on a
per-user basis, so a user who receives a lot of innocent pine mail will
get a more innocent probability than someone whose only pine mail are
spams. This keeps DSPAM very lightweight and resource friendly. Out of
SpamAssassin's 921 rules, only 133 rules were not performed by the
advanced Bayesian filtering of DSPAM. Out of that 133, 39 were
duplicates, range rules, or nearly identical rules. 33 were blackhole
rules, 31 were rare, very low scoring, or unmeaningful rules, and 4 were
illogical. This left a total of 26 good rules performed by SpamAssassin
that were not performed by DSPAM. While these 26 remaining rules are
good, they themselves do not positively identify spam, but only a few
underlying characteristics that may or may not identify a particular
message (innocent or spam)
</snip>
As far as other alternatives, there's DSPAM, BogoFilter, Spambayes, and
several others. I can't speak much about the rest, but I can tell you
that DSPAM uses a much more advanced approach implementing Chained
Tokens for advanced language analysis, De-obfuscation techniques, etc.
All of these are great tools...and probability-based filtering is why
heuristic filters are obsolete...no trolling intended.
On Tue, 2003-10-07 at 12:34, Gregory A. Gilliss wrote:
> Okay, maybe this is a troll, but in case it isn't how about listing
> some recommendations for spam filters to replace spamassassin? I'm
> sure there's probably still a few people on list using it who would
> be interested in what works better.
Powered by blists - more mailing lists