lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <1068956658.12690.9.camel@tantor.nuclearelephant.com> From: jonathan at nuclearelephant.com (Jonathan A. Zdziarski) Subject: SPAM and "undisclosed recipients" > There should be a way to stop the email spamming. You could use their > weaknesses as a way to prevent spam. The fact is that most SPAM is sent in > MASS quantities all at one time, or a very short interval. If servers could > somehow have a "global awareness" of the activity of spammers this could be > prevented. We are working on adding new layers of "spam networking" on top of existing statistical filters similar to what you are saying, and the great thing is that many of the open-source filter authors are working together to come up with new solutions. One is through a process we call inoculation. You can read about it here (old copy of the draft until they post the latest one we sent) http://www.ietf.org/internet-drafts/draft-spamfilt-inoculation-00.txt Another thing that has been discussed is the design of a Peer-to-Peer network to exchange information about spams. The trick is to prevent any possibility of information leakage (e.g. you don't want to leak people's personal emails onto the network). No draft for this yet, but I've attached the initial email I sent out to the other authors about this. We've already implemented the inoculation message format, and my own project (DSPAM) also supports this and several other methods of "spam networking" such as classification groups and even shared groups (if you're really interested you can read more about it at http://www.nuclearelephant.com/projects/dspam/) This is all in an attempt to get past the 99.9% (1 in 1000) plateau of accuracy - as Bob Yerazunis [the author of CRM114] puts it...and try to push to 99.99% (1 in 10,000). Your point is well taken; statistical filtering in itself is extremely accurate, but the biggest weakness of filtering using the "Bayesian" buzzword is isolation. Breaking past the iron curtain is definitely going to bring us to the next level of spam fighting. Jonathan -------------- next part -------------- OK I have thought long and hard about this and have come up with a good starting point for a P2P Fingerprinting network. If anyone's interested in this, and would like to take part in the project please let me know - if not, I'm going to go ahead and do it anyways ;) First let me qualify the concept. We have implemented 'Classification Groups' into DSPAM for the past 2-3 weeks now. A classification group is used when a group member's filter is uncertain about a particular message (I presently use Chi-Square to determine confidence, but this could be determined using any algorithm). When uncertain, the filter will query the results of all the other group members' filters and make a decision based on how many (if any) of the members' filters believe the message is spam. Within the past week, our collective system has caught an additional 40 spams that would have otherwise gotten through. Clearly it is proving itself to be helpful in A. breaking down the isolation barrier between statistical-based filter users, and B. helping to mudge past three-nines accuracy. A P2P Fingerprinting network runs on a similar concept, however instead of performing bayesian (or whatever) calculations across a large network, the message is fingerprinted and a confidence-match based on "known spam" is performed. There is one significant caveat to get around: The solution must keep the contents of the email in question completely private. So how do you query users on a P2P network for their results without letting them see the email? Fingerprinting. So far, I have come up with the idea below for fingerprinting an email. Basically, we are fingerprinting each line of the message and conveniently packaging the array of fingerprints for transmission: 1. Whitespace (NOT including newlines) should be removed from the message 2. Each line of the message (subject and body) should be hashed individually using a one-way hashing algorithm such as MD5 3. The resulting array of hashes can be base-64 encoded (for transit) to create one large "fingerprint", _but in a way so that the individual hash values are not lost_ Step 3 is not necessary to this concept's implementation, but might be good practice and convenient. For example: From: Bob <bob@....com> To: Bill <bill@...l.com> Subject: This might be a spam Bob, This is some text. This is some more text. Becomes: b8a9facfea0b01d5fdd4bbcaefbae494 bb0429bc1035da2210cfe4f9988ec7ae 1baba7198ab6892d764b4f7c1e8de939 68b329da9893e34099c7d8ad5cb9c940 9dd980027a8925f2e7deb8cb444493ba 24c29d4890980e676106852685ec0b4f Could encode to: YjhhOWZhY2ZlYTBiMDFkNWZkZDRiYmNhZWZiYWU0OTRiYjA0MjliYzEwMzVkYTIyMTBjZmU0 Zjk5ODhlYzdhZTFiYWJhNzE5OGFiNjg5MmQ3NjRiNGY3YzFlOGRlOTM5NjhiMzI5ZGE5ODkz ZTM0MDk5YzdkOGFkNWNiOWM5NDA5ZGQ5ODAwMjdhODkyNWYyZTdkZWI4Y2I0NDQ0OTNiYTI0 YzI5ZDQ4OTA5ODBlNjc2MTA2ODUyNjg1ZWMwYjRm And will decode back to the original hash values. The filter with the message in question can then perform the following process: 1. Hash the sender's host or IP address, From address, and Subject line into three different hashes and perform an OR query on the P2P server 2. Search results will return a list of peers and which criteria were met, or if no criteria were met at all a more raw search can be performed using individual hashed(?) words from the subject (optional, as it requires somewhat of an information leak) 3. The filter will choose the peers that it is most confident will have a useful answer based on the met criteria 4. The filter will send the fingerprint to each peer 5. Each peer will decode the fingerprint into its original set of hashes. It will search for a single message in its database that has the most number of hash matches and return a confidence level 6. The filter will, based on the results of all the peers and their confidence level, determine whether it should mark the message as spam. In English what is going on is that each line of the message is being irreversibly hashed and sent to a set of peers. Each peer determines how many lines of the message match any one spam in its hash list, and returns a confidence level. If 10 peers are queried and 7 return a 70% confidence level, for example, then the filter will know that the message is most likely spam. If, on the other hand, only 1 peer returns an 80% confidence level, the filter would in all likelihood accept the message to avoid a false positive. >From a peer perspective, each message that gets marked as spam on the system (not per-user) is hashed in the same fashion, and stored by message-id or some other unique identifier in a database somewhere. Some type of binary tree with hash values as keys might be very useful in this circumstance, or a SQL-based environment where one could easily identify the message with the most matches at a very rapid rate. Something like: MESSAGE_ID VARCHAR2(256) HASH VARCHAR2(32) FREQUENCY NUMBER Seems like a workable concept to me, but I would appreciate any input/suggestions on the idea, and would be interested in knowing who would add support to their filter to support this type of P2P networking (both as a client and a peer) should it get off the ground. Jonathan
Powered by blists - more mailing lists