The group transcription project we're all working on

Group translation is an increasingly popular service, as more internet users volunteer their time to translate the vast quantity of text on the internet from one language to another. However, it’s not only making text accessible to people who speak different languages that crowd-sourcing is being used for, but also making it accessible to future generations. An ongoing group transcription project is helping to do just this.

The great transcription project

There can be few internet users who are not familiar with reCAPTCHA: the security system used to tell human users apart from computers by testing their character recognition skills. Try to access certain sites or forget your social media password a few too many times and you will be presented with a reCAPTCHA security question to complete by copying two words into an empty box to prove you’re not a spambot.

The reason the system works is because the words users are presented with have already been read by a computer and the answers have come back incorrect, so it has been proven that the text is unreadable to computers. What’s surprising is that the words have not been invented by the security device or plucked from thin air, but instead originate from an historic printed document. Every piece of text featured on a reCAPTCHA has been taken from old printed documents in the process of being digitised in order to preserve them and make them more widely available.

When a reCAPTCHA box pops up, the two words the user is asked to type are often written in an old font and smudged or faded ink that looks as though it was printed decades ago. This is probably because it was printed decades ago! The printed documents will have been scanned and uploaded on to a computer and then run through an Optical Character Recognition (OCR) programme, which reads any writing in the scanned image. This is then transformed into text, which is easier to store than the original image as it takes up less space; is more searchable; and cheaper to download.

Unfortunately, OCR is not able to read everything. Without a human’s ability to decipher text that is distorted, some of the results it returns are wrong. If there were only a few documents being run through OCR, one way to correct these faults could be to have a team of human proof-readers check the results and correct mistakes. However, the volume of documents being digitised means this is simply not possible.

That’s where reCAPTCHA comes in. The system was invented by a team at Carnegie Mellon University, where one of the original developers, Luis von Ahn, realised that each CAPTCHA security question requires ten-seconds of human brain time to solve – a precious and free resource that was not being tapped into. They came up with the idea of taking words from images undergoing digitisation that computer programmes had been unable to decipher and using these as security questions.

The system was acquired by Google in 2009, and since then has been rolled out across numerous websites, email providers and social media networks. The words featured in the reCAPTCHA are all taken from scanned pages that OCR has been unable to decipher. So, for every reCAPTCHA an internet user fills in, they are helping to translate, transcribe and digitise a little more historic documentation. Indeed, a study by the developers, published in Science in 2008, revealed that through the system, users are able to transcribe text with a word accuracy of greater than 99 per cent.

Deciphering the undecipherable

The system sounds ingenious. However, if the computer can’t read the word in the first place, how does it know that the answer the human reader gives is correct? reCAPTCHAs always feature two words for the user to type out, both of which have been distorted in some way, such as by using a wavy font or putting a line through them. One of these words is something OCR has been unable to decipher, while the other is something it has read and digitised correctly. If the internet user is able to correctly copy out the word the computer knows, the software assumes the answer given for the word it cannot read is also correct. Each unknown word will be given to several people and the results tallied before it is accepted by the computer as ready to be transformed into digital text.

A couple of words and a few seconds may not seem like much, but in total 200 million CAPTCHAs are completed on a daily basis. When put together, this adds up to 150,000 hours of human reading every day – which is something that would be near impossible for an in-house team of proof-readers to achieve. So, by giving people something useful to read, this time and effort can be put to good use translating, transcribing and digitising historic books. The current goal of reCAPTCHA is to digitise content on Google books and back issues of the New York Times.

Of course, the purpose of reCAPTCHA is not only to read old books, but also to protect websites from spambots – and it’s likely the majority of users believe that’s all the system is doing. Spam is responsible for everything from diminishing user experience on a website to putting that website at risk, yet a spambot cannot read. Therefore, a simple CAPTCHA can protect a website from being attacked by automated programmes.

Thanks to reCAPTCHA, websites are now kept safe in an operation that also contributes to the digitisation of masses of content. So, while online content is protected from malicious computer programmes, our written history is protected for future generations – all thanks to ten seconds of work.

The group transcription project we’re all working on