[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.58.0310071442360.27612@soyokaze.cynistar.net>
From: apthorpe+fd at cynistar.net (Bob Apthorpe)
Subject: Spam with PGP
Hi,
I suggest that before you start explaining what SpamAssassin does and how
it does it that you visit http://www.spamassassin.org/, specifically the
README at http://www.spamassassin.org/full/2.7x/dist/README
On Tue, 7 Oct 2003, Jonathan A. Zdziarski wrote:
> [missing attribution] wrote:
> > Of course, SpamAssassin does bayesian filtering as well.
> >
> > heuristic + bayesian is better than either alone, IMHO.
> Actually the way SA does it weakens filtering. SA's bayesian filtering
> is only a very small piece of SA, and unfortunately not much attention
> has been given to it. The filter's final calculation is only a small
> percentage of the actual final score.
Here are SA's Bayesian scores; the four columns of scores are:
1: no network tests (DNSBLs, Razor, DCC, Pyzor), no Bayes
2: network tests, no Bayes
3: no network test, Bayes
4: network tests, Bayes
score BAYES_00 0 0 -4.901 -4.900
score BAYES_01 0 0 -0.600 -1.524
score BAYES_10 0 0 -0.734 -0.908
score BAYES_20 0 0 -0.127 -1.428
score BAYES_30 0 0 -0.349 -0.904
score BAYES_40 0 0 -0.001 -0.001
score BAYES_44 0 0 -0.001 -0.001
score BAYES_50 0 0 0.001 0.001
score BAYES_56 0 0 0.001 0.001
score BAYES_60 0 0 1.789 1.592
score BAYES_70 0 0 2.142 2.255
score BAYES_80 0 0 2.442 1.657
score BAYES_90 0 0 2.454 2.101
score BAYES_99 0 0 5.400 5.400
The lowest positive Bayesian score (BAYES_60 w/network tests) is 1.592,
providing ~32% of the (default) 5 points necessary for a message to be
flagged as spam. This would appear to counter your claims that SA's
Bayesian classifier provides only a small fraction of the total score.
> Because true Bayesian filtering
> performs a huge majority of the same tests that SA performs, SA's own
> ruleset easily waters down any bayesian findings whenever there are
> opposing values between the two.
The Bayesian classifier does not perform the same rule-based heuristic
tests. Depending on how vigilant the end-user was in training the Bayesian
classifier, it's rare that the statistical scores and the heuristic scores
are both large and of opposite signs.
> For example, a pine MUA...SA thinks a
> pine MUA suggests an innocent message, but a majority of the emails with
> a pine MUA my wife receives are spams. In this case, the hard-coded MUA
> rule will unfortunately water down the score, even if Bayes thinks a
> pine MUA is spam. Obviously the pine MUA is just a small rule, but if
> you apply this to the other rules, you get the same results.
SA 2.5x had a number of negative-scoring tests that were easily forged
(various MUA signatures, REFERENCES, IN_REP_TO, PGP signatures, etc.)
These rules have been dropped from SA 2.60 or have had their scores far
reduced to counter this known problem.
> What's worse is that last time I looked (this may have changed), SA's
> bayesian filter did not appear to have a mechanism for learning, but was
> just a static dictionary. If users got spam there was no way for the
> user to forward their spams into the system for processing. Again, this
> may have changed and if it has, that's great.
SA has included sa-learn for manual training ever since the Bayesian
classifier was incorporated into the code (v2.50.) Additionally, SA
contains thresholds above/below which messages will be automatically
learned as spam/ham so the system trains itself (albeit slowly) without
user intervention.
> The product of Bayesian filtering includes all the heuristic tests as
> well, so having both _hurts_ you, and is not something you benefit
> from.
No it does not, on all counts. You need to review the difference between
heuristic and statistical classifiers.
> It is much better to focus on creating a strong probability-based
> filter IMHO...and I think the statistics agree with me.
Then perhaps you should join forces with the people already performing
such statistical comparisons between SpamAssassin, CRM114, bogofilter, and
the like. The SA development list is at
http://lists.sourceforge.net/mailman/listinfo/spamassassin-devel
This problem (evading spam-filtering by including a bogus PGP sig) is a
recognized and dead issue. The solution is to keep your security tools
up-to-date. As SA filters more spam, spammers will find new ways around
the filters, heuristic, statistical, or otherwise.
--
Bob Apthorpe
Powered by blists - more mailing lists