full-disclosure - SPAM and "undisclosed recipients"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1068956658.12690.9.camel@tantor.nuclearelephant.com>
From: jonathan at nuclearelephant.com (Jonathan A. Zdziarski)
Subject: SPAM and "undisclosed recipients"


> There should be a way to stop the email spamming.  You could use their
> weaknesses as a way to prevent spam.  The fact is that most SPAM is sent in
> MASS quantities all at one time, or a very short interval.  If servers could
> somehow have a "global awareness" of the activity of spammers this could be
> prevented. 

We are working on adding new layers of "spam networking" on top of
existing statistical filters similar to what you are saying, and the
great thing is that many of the open-source filter authors are working
together to come up with new solutions.  One is through a process we
call inoculation.  You can read about it here (old copy of the draft
until they post the latest one we sent)
http://www.ietf.org/internet-drafts/draft-spamfilt-inoculation-00.txt

Another thing that has been discussed is the design of a Peer-to-Peer
network to exchange information about spams.  The trick is to prevent
any possibility of information leakage (e.g. you don't want to leak
people's personal emails onto the network).  No draft for this yet, but
I've attached the initial email I sent out to the other authors about
this.

We've already implemented the inoculation message format, and my own
project (DSPAM) also supports this and several other methods of "spam
networking" such as classification groups and even shared groups (if
you're really interested you can read more about it at
http://www.nuclearelephant.com/projects/dspam/)

This is all in an attempt to get past the 99.9% (1 in 1000) plateau of
accuracy - as Bob Yerazunis [the author of CRM114] puts it...and try to
push to 99.99% (1 in 10,000).    Your point is well taken; statistical
filtering in itself is extremely accurate, but the biggest weakness of
filtering using the "Bayesian" buzzword is isolation.  Breaking past the
iron curtain is definitely going to bring us to the next level of spam
fighting.

Jonathan

-------------- next part --------------
OK I have thought long and hard about this and have come up with a good
starting point for a P2P Fingerprinting network.  If anyone's interested
in this, and would like to take part in the project please let me know -
if not, I'm going to go ahead and do it anyways ;)

First let me qualify the concept.  We have implemented 'Classification
Groups' into DSPAM for the past 2-3 weeks now.  A classification group
is used when a group member's filter is uncertain about a particular
message (I presently use Chi-Square to determine confidence, but this
could be determined using any algorithm).  When uncertain, the filter
will query the results of all the other group members' filters and make
a decision based on how many (if any) of the members' filters believe
the message is spam.  Within the past week, our collective system has
caught an additional 40 spams that would have otherwise gotten through.
Clearly it is proving itself to be helpful in A. breaking down the
isolation barrier between statistical-based filter users, and B. helping
to mudge past three-nines accuracy.

A P2P Fingerprinting network runs on a similar concept, however instead
of performing bayesian (or whatever) calculations across a large
network, the message is fingerprinted and a confidence-match based on
"known spam" is performed.  There is one significant caveat to get
around: The solution must keep the contents of the email in question
completely private.  So how do you query users on a P2P network for
their results without letting them see the email?  Fingerprinting.  So
far, I have come up with the idea below for fingerprinting an email. 
Basically, we are fingerprinting each line of the message and
conveniently packaging the array of fingerprints for transmission:

1. Whitespace (NOT including newlines) should be removed from the
message
2. Each line of the message (subject and body) should be hashed
individually using a one-way hashing algorithm such as MD5
3. The resulting array of hashes can be base-64 encoded (for transit) to
create one large "fingerprint", _but in a way so that the individual
hash values are not lost_

Step 3 is not necessary to this concept's implementation, but might be
good practice and convenient.

For example:

From: Bob <bob@....com>			
To: Bill <bill@...l.com>
Subject: This might be a spam

Bob, This is some text.
This is some more text.

Becomes:
b8a9facfea0b01d5fdd4bbcaefbae494
bb0429bc1035da2210cfe4f9988ec7ae
1baba7198ab6892d764b4f7c1e8de939
68b329da9893e34099c7d8ad5cb9c940
9dd980027a8925f2e7deb8cb444493ba
24c29d4890980e676106852685ec0b4f

Could encode to:
YjhhOWZhY2ZlYTBiMDFkNWZkZDRiYmNhZWZiYWU0OTRiYjA0MjliYzEwMzVkYTIyMTBjZmU0
Zjk5ODhlYzdhZTFiYWJhNzE5OGFiNjg5MmQ3NjRiNGY3YzFlOGRlOTM5NjhiMzI5ZGE5ODkz
ZTM0MDk5YzdkOGFkNWNiOWM5NDA5ZGQ5ODAwMjdhODkyNWYyZTdkZWI4Y2I0NDQ0OTNiYTI0
YzI5ZDQ4OTA5ODBlNjc2MTA2ODUyNjg1ZWMwYjRm

And will decode back to the original hash values.

The filter with the message in question can then perform the following
process:

1. Hash the sender's host or IP address, From address, and Subject line
into three different hashes and perform an OR query on the P2P server
2. Search results will return a list of peers and which criteria were
met, or if no criteria were met at all a more raw search can be
performed using individual hashed(?) words from the subject (optional,
as it requires somewhat of an information leak)
3. The filter will choose the peers that it is most confident will have
a useful answer based on the met criteria
4. The filter will send the fingerprint to each peer
5. Each peer will decode the fingerprint into its original set of
hashes.  It will search for a single message in its database that has
the most number of hash matches and return a confidence level
6. The filter will, based on the results of all the peers and their
confidence level, determine whether it should mark the message as spam.

In English what is going on is that each line of the message is being
irreversibly hashed and sent to a set of peers.  Each peer determines
how many lines of the message match any one spam in its hash list, and
returns a confidence level.  If 10 peers are queried and 7 return a 70%
confidence level, for example, then the filter will know that the
message is most likely spam.  If, on the other hand, only 1 peer returns
an 80% confidence level, the filter would in all likelihood accept the
message to avoid a false positive.

>From a peer perspective, each message that gets marked as spam on the
system (not per-user) is hashed in the same fashion, and stored by
message-id or some other unique identifier in a database somewhere. 
Some type of binary tree with hash values as keys might be very useful
in this circumstance, or a SQL-based environment where one could easily
identify the message with the most matches at a very rapid rate. 
Something like:

MESSAGE_ID	VARCHAR2(256)
HASH		VARCHAR2(32)
FREQUENCY	NUMBER

Seems like a workable concept to me, but I would appreciate any
input/suggestions on the idea, and would be interested in knowing who
would add support to their filter to support this type of P2P networking
(both as a client and a peer) should it get off the ground.

Jonathan