Tuesday, May 02, 2006


John Graham-Cumming, the inventor of POPFile, has a new project called SpamOrHam. He is attempting to improve spam filter testing by verifying (or correcting) the accuracy of one of the few large sources of real world test emails, the TREC 2005 Public Spam Corpus. The more accuratly classified the test data is, the more accuratly spam filters can be tested, which hopefully will lead to better filters. To do this, he needs volunteers to visit the site and classify a few messages. If enough people participate, the corpus could be done in no time.

The Public Spam Corpus is a body of email collected from a large corporation during the process of a legal trial. Normally it is hard to get such a real world collection of email due to privacy issues, etc. But thanks to the legal system, this is a collection of mail from 150 recipients in the company containing business and personal emails as well as spam. After a few messages you will understand more about the source and the types of business they deal with. This was a major company that was covered heavily in the news recently. It is a bit voyeruistic reading some of these emails (though I have seen no juicy details). I found it rather addictive.

Apparently this project is such a good idea that within just two days of its launch, it has already seen hacking attempts which due to John's planning ahead were already being prevented, partly through a CAPTCHA and logging.

Update: Looks like John has made some improvements. Now you get feedback after each classification based on what the automated classification of the email was. It also keeps a running total of your classifications, possible errors and those where you and the filter agreed. I am trying to recover a hard drive in Knoppix right now and had some free time while it copies, so I just did 200 emails. I disagreed with the filter or was unsure 13 times. Considering many of those I couldn't decide, looks like the corpus is pretty well filtered.

Good writeup on what looks to be a good project.
I've signed up to be an evaluator also.
Since there doesn't seem to be a discussion forum concerning SpamOrHam.org, I'll put forward my question concerning emails received in a foreign language.
Should they be considered spam?
My take is that unless there are extenuating conditions pointing them into either spam or ham, they should be "not sure".
I think a discussion forum sounds like a good idea. John, the creator of the project has a blog. Currently there is some discussion on the latest post on the project on his blog. Some POPFile (John's other project) users are using the POPFile forum to discuss things exactly like you mentioned.

I agree foreign language mail is hard to deal with if you don't speak the language. Some very possibly could be legitimate, but many are clearly spam. Even without being able to read the language, you can look for the usual spammy characteristics in the headers. But if you aren't sure, use the "not sure" button. I probably use it more than absolutely necessary, but I would rather not be wrong.
Post a Comment

<< Home