It sure would have been nice if the Carnegie Mellon University folks
would've thought about accessibility when they invented CAPTCHA in the first
place, or, at least, those who probably placed Federal and other funds into
the work leading to the invention. I'm just glad that audio CAPTCHA is
going to be a part of this new project. Of course, as we know, audio is
ultimately insufficient and a better way to secure resources must be
CCN Magazine, Canada
Sunday, May 27, 2007
ReCAPTCHA System Improves Internet Security and Book Searchability
A Carnegie Mellon University computer scientist is enlisting the unwitting
help of thousands, if not millions, of Web users each day to eliminate a
technical bottleneck that has slowed efforts to transform books, newspapers
and other printed materials into digitized text that is computer searchable.
Luis von Ahn, an assistant professor of computer science and recipient of a
MacArthur Foundation "genius grant," says the project will also improve Web
security systems used to reduce spam and make it possible for individuals to
safeguard their own email addresses from spammers.
Key to the new project is assigning a new, dual use to existing technology:
CAPTCHAs, the distorted-letter tests found at the bottom of registration
forms on Yahoo, Hotmail, PayPal, Wikipedia and hundreds of other sites
worldwide. CAPTCHAs, an acronym for Completely Automated Public Turing Test
to Tell Computers and Humans Apart, distinguish between legitimate human
users and malevolent computer programs designed by spammers to harvest
thousands of free email accounts. The tests require users to type the
distorted letters they see inside a box – a task that is difficult for
computers, but easy for humans.
Working with a team that includes computer science professor Manuel Blum,
undergraduate student Ben Maurer and research programmer Mike Crawford, von
Ahn invented a new version of the tests, called reCAPTCHAs, that will help
convert printed text into computer-readable letters on behalf of the
Internet Archive. The San Francisco-based non-profit group administers the
Open Content Alliance and is one of several large initiatives working to
digitize books and other printed materials under open principles, making the
text searchable by computer and capable of being reformatted for new uses.
Optical character recognition (OCR) systems that automatically perform this
conversion are often stumped by underlined text, scribbles and fuzzy or
otherwise poorly printed letters. ReCAPTCHAs will use words from these
troublesome passages to replace the artificially distorted letters and
numbers typically used in CAPTCHAs.
The new tests continue to distinguish between humans and machines because
they use text that OCR systems have already failed to read. And because
people must decipher these words to pass the reCAPTCHA test, they will help
complete the expensive digitization process.
"I think it's a brilliant idea – using the Internet to correct OCR
mistakes," said Brewster Kahle, director of the Internet Archive. ReCAPTCHAs
will speed the digitization process while also helping to improve OCR
methods and perhaps extend them to additional languages, he said. "This is
an example of why having open collections in the public domain is
important," he added. "People are working together to build a good, open
system." Von Ahn hopes to substitute his reCAPTCHAs for as many conventional
CAPTCHAs as possible. "It is estimated that 60 million or more CAPTCHAs are
solved each day, with each test taking about 10 seconds," he said. "That's
more than 150,000 precious hours of human work that are lost each day, but
that we can put to good use with reCAPTCHAs."
With support from Intel Corp., von Ahn's team has devised a free, Web-based
service that allows individual webmasters to install reCAPTCHAs to protect
their sites. Individuals can also use the service to protect their own email
addresses, or lists of addresses they post on personal Web pages. In the
case of some commercial Web sites with heavy traffic, reCAPTCHA may charge a
fee to pay for additional bandwidth.
To make certain that people are correctly deciphering the printed text, the
reCAPTCHA system will require Web site visitors to type two words, one of
which the system already knows. Each unknown word will be submitted to
multiple visitors. If the visitor types the known word correctly, the system
has greater confidence that the unknown word is being typed correctly. If
several visitors type the same answer for the unknown word, that answer will
be assumed to be correct.
An audio version of reCAPTCHA, which will transcribe portions of radio
programs that have defied speech recognition programs, will also be
available for blind Web users.