Friday, September 16, 2005

My Review

After I read that was reviewed and showed really good results with a very small sample I knew I had to run my own test. I wasted a lot of time on this, but I think it was worth it as my test was with a much larger sample and came up with different results.

My sample was made up of 110 actual blogs and 201 splogs. Legitimate blogs were selected semi-randomly by searching Google for sites on Blogspot. The splogs were harder to find that way so were taken from various lists of splogs and some Google searches for spammy keywords. Most, but not all sites were easily identifiable as blog or splog by a human, but many blogs had some spammy characteristics that could have thrown off the filter.

Out of 311 total sites 113 (36%) were missclassified. It wasn't really as bad as that sounds. Most of the errors were splogs that were not caught and I would much rather miss a few splogs than make mistakes on actual blogs. There were only 16 legitimate blogs that were falsely labeled splog. That is only 5% of the total sample, but it is 14.5% of the legitimate blogs. That is too high.

There were also 5 splogs that the filter returned connection errors for. I had no trouble getting the sites to load myself. Trying again later did no good either, I wonder if the results are cached. That makes a lot of sense, but there should be some number of retries for connection errors.

Most links I tested were to the index of the site, but to give a thorough test I included some links to archive pages or specific posts. I wanted to see if that would make a difference. It did. Of the 16 missed blogs, 4 would have been classified correctly had I used the front page of the blog. I don't know if the posts I selected were extra spammy, but I suspect that the filter only checks the link it is given. If that is true it should either use both that page and the main page of the blog or just automatically truncate anything after the domain and test only the front page. But that may run into trouble on non-Blogspot blogs since the blog may start from a subdirectory. If the main page of the blog could be correctly identified that would give the best results.

I noticed some but not all sites identified as splogs showed the Blogger flag icon and said the splog had been reported to Blogger. In fact the image shown is the unflag image, but the flagging cannot be undone from the result page.

The interface for checking blogs needs some improvement. The bookmarklet is ok for checking or reporting one blog at a time, but that isn't really efficient for more active splog fighters. A textarea form to input multiple links and a table for the results output would be great.

The sites not correctly classified are listed on our .

Update: See's comments on my review.

If I knew I would setup for u a page with multi links submissions :)

And about the "auto reporting" the interface its better to use the since it give a direct access to the results.

And thanks a lot for the review. Very helpful !
Post a Comment

<< Home