PenPlusBytes: Spam weapon helps preserve books

By Paul Rubens

A weapon used to fight spammers is now helping university researchers preserve old books and manuscripts.

Many websites use an automated test to tell computers and humans apart when signing up to an account or logging in.

The test consists of typing in a few random letters in an image and is designed to fight spammers.

Carnegie Mellon is using this test to help decipher words in books that machines cannot read by letting sites use them to authenticate log-ins.

The test, known as a CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans Apart), was originally designed at Carnegie Mellon to help to keep out automated programs known as "bots."

Spam messages

Bots are designed by spammers to post advertisements in discussion forums or to sign up for large numbers of e-mail addresses which are later used to send spam messages.

A CAPTCHA consists of an image containing letters or numbers which have been heavily distorted, making it hard or impossible for a bot to "read."

There's still about 100 million books to be digitised, which at the current rate will take us about 400 years to complete
Luis von Ahn, Carnegie Mellon

By requiring web site visitors to type in the contents of the CAPTCHA before being allowed in to the site, humans can be admitted while all but the smartest bots are rebuffed.

CAPTCHAs are unpopular with many Internet users because the words they contain are often so heavily distorted to foil bots that that many humans struggle to read them.

This means potential visitors' time is wasted while they make repeated attempts to decipher the CAPTCHA they are presented with.

But the CMU research team, based in Pittsburgh, Pennsylvania, has devised an ingenious system to put the time used interpreting CAPTCHAs to good use.

Text files

The team is involved in digitising old books and manuscripts supplied by a non-profit organisation called the Internet Archive, and uses Optical Character Recognition (OCR) software to examine scanned images of texts and turn them into digital text files which can be stored and searched by computers.

But the OCR software is unable to read about one in 10 words, due to the poor quality of the original documents.

The only reliable way to decode them is for a human to examine them individually - a mammoth task since CMU processes thousands of pages of text every month.

To solve this problem the team takes images of the words which the OCR software can't read, and uses them as CAPTCHAs.

These CAPTCHAs, known as reCAPTCHAS, are then distributed to websites around the world to be used in place of conventional CAPTCHAs.

When visitors decipher the reCAPTCHAs to gain access to the web site, the answers - the results of humans examining the images - are sent back to CMU.

Every time an Internet user deciphers a reCAPTCHA, another word from an old book or manuscript is digitised.

Deciphered correctly

To ensure that the reCAPTCHAs are deciphered correctly, website visitors are actually presented with images of two words to examine, the contents of one of which is already known.

"If a person types the correct answer to the one we already know, we have confidence that they will give the correct answer to the other," says Luis von Ahn, a Professor at CMU.

"We send the same unknown words to two different people, and if they both provide the same answer then effectively we can be sure that it is correct.

If they don't agree then we send it to a lot more people to examine."

Thanks to the adoption of reCAPTCHAs by popular websites like Facebook, Twitter and StumbleUpon, the system is helping to decipher about one million words every day for CMU's book archiving project, according to von Ahn.

Given that it takes about 10 seconds to decipher a reCAPTCHA and type in the answer, this represents the equivalent of almost three thousand man hours a day spent deciphering words that CMU's computers find illegible.

A handy extra benefit of this system is that reCAPTCHAs are particularly good at foiling bots while remaining legible to people.

"Firstly, we are starting with words that we know our computers can't read," says von Ahn. "These words have also been distorted naturally over time, and the number of ways they have been distorted is very large.

'Distorted further'

"The more ways they are distorted, the harder it is for spammers to write software which can read them."

To make it even harder for bots, these words are then distorted further.

"What we do is the equivalent of placing the image on a rubber sheet and pulling it to distort the geometry," he says.

Using the reCAPTCHA system von Ahn's team is digitising documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.

"There's no danger of us running out of words," says von Ahn. "There's still about 100 million books to be digitised, which at the current rate will take us about 400 years to complete."

Story from BBC NEWS:
http://news.bbc.co.uk/go/pr/fr/-/2/hi/technology/7023627.stm

Published: 2007/10/02 10:01:32 GMT

© BBC MMVII

PenPlusBytes

Sunday, October 07, 2007

Spam weapon helps preserve books

No comments: