CAPTCHA: Spambots, eBooks and the Turing Test
CAPTCHAs, an acronym for Completely Automated Public Turing test to tell Computers and Humans Apart, are used to distinguish between human users and computer programs online. But there’s a lot more to them than you might think...
Have you ever wondered about those blurry text boxes that appear when you sign in to a website? Probably not. But there’s a lot more to them than you might think. CAPTCHAs, an acronym for Completely Automated Public Turing test to tell Computers and Humans Apart, are used to distinguish between human users and computer programs online. Every day an estimated 100 million CAPTCHAs are completed worldwide.
The first CAPTCHA was created in 1997 in response to the problem of computer programs being used to bombard websites with advertisements or to gain access to private data. For example, some computer programs were being used to repeatedly sign up for email accounts or website addresses. The only way to stop these programs is to have a test that a nonhuman can’t solve.
Creating an online test that a computer will reliably fail is no easy task. Any conventional maths problem or multiple-choice question would be too easy. What about a password made up of a nonsense sequence of numbers or letters? Even these wouldn’t do the trick as it would still be possible to create a program that can decipher the method of creating the sequences of letters/numbers.
Bypassing Optical Character Recognition Software
A team working at AltaVista in the mid-90s were faced with precisely this challenge. Their solution was to take advantage of the fact that computers have bad vision. Even the most advanced optical character recognition (OCR) technology currently in existence is no match for the human eye when it comes to recognising symbols. If you have an image that is distorted such that OCR technology cannot read it but a human can, you have a method for distinguishing between human users and computers.
CAPTCHAs have been described as a kind of 'reverse Turing Test' as they require the user to prove that they are really human to an automated system, whereas in Turing’s conception of the Imitation Game the test is taken by a machine and administered by a human.
In an article by computer scientists at Carnegie Mellon University, creators of one of the first CAPTCHA programs, they describe the dual utility of the system as providing security for websites while assisting in AI research as a win-win situation: “Either the CAPTCHA is not broken and there is a way to differentiate humans from computers, or the CAPTCHA is broken and a useful AI problem is solved.”
CAPTCHAs proved to be remarkably successful in protecting websites from spambots and other automated attacks. Their main flaw is that they are difficult – in some instances impossible – to solve for people with visual impairments. This led to the creation of a number of alternatives to the traditional text-based CAPTCHAs, including audio CAPTCHAs, where a distorted sound clip is used.
As efforts to create programs that can beat CAPTCHAs became more sophisticated the system was updated and replaced with reCAPTCHA in 2009. It has become the main user verification system for websites including Twitter, Facebook and Google. The two main innovations with reCAPTCHA are that two images are used and the images themselves are selected automatically.
In reCAPTCHA scanned text is analysed using two optical character recognition (OCR) programs. If a word cannot by identified by both programs the word is added to the pool of CAPTCHA puzzles. After a number of human users type the word it is assumed that the most common answer is the correct one. Answers which do not correspond to those given by the majority of users are assumed to have been given by computer programs.
Remarkably the reCAPTCHA system was originally developed not for internet security but to aid in the digitization of books. When the system scanning a book comes across a word that it fails to recognise, the word is flagged up and identified by a series of human users before being returned. Every time you complete a reCAPTCHA puzzle in Google you are unwittingly assisting in the digitization of a book. This fact has led some to criticize the system as a form of unpaid labour, more generous observers describe it as a kind of crowdsourcing.
The creators of reCAPTCHA describe its origin as a means of identifying scanned text: “For older prints with faded ink and yellowed pages, OCR cannot recognize about 20% of the words. By contrast, humans are more accurate at transcribing such print. For example, two humans using the ‘key and verify’ technique, where each types the text independently and then any discrepancies are identified, can achieve more than 99% accuracy.”
The Strengths and Weaknesses of reCAPTCHA
The strength of the reCAPTCHA system comes from the fact that the computer program administering the test does not know the answer itself. This means it would be impossible to cheat, even if you were to look at the computer program in its entirety. Put simply you can’t cheat because the computer program itself doesn't know what the correct answer is. It takes an unreadable image and then derives the answer from the most common response.
But its strength is also its weakness. If the word is too difficult to read then the human users will give a variety of responses and it will not be clear what the correct answer is. Most of us at one time or another have experienced the frustration of repeatedly typing a word to be told the answer is incorrect. It may be that what you are typing is correct but if most other users answered incorrectly then the system will identify you as a robot!
In 1950, when Alan Turing first proposed a test of artificial intelligence in which typed sheets of paper were passed between an interrogator and two participants (one human and one machine) he barely could have imagined that his idea would prove the inspiration for a system which is improving internet security, translation, the digitization of books, machine vision as well as contributing to the on-going debate over whether a computer will ever be able to outsmart a human.
Although his prediction that there would be AI capable of fooling even the most sophisticated tests by the year 2000 proved to be incorrect, he can take solace in the fact that, in addition to irritating millions on a day to day basis, CAPTCHAs are contributing to the future of machine intelligence.