The life of a book at Distributed Proofreaders

Distributed Proofreaders (DP) is an international web-based community working together to create eBooks from physical books in the public domain. I’ve been a volunteer and developer at DP for over 9 years and a site administrator for the past 6. During my sabbatical I’ve enjoyed a more active role in the community wearing my developer and system administrator hats.

After realizing that many folks aren’t aware of how DP works, I wrote a post on the DP blog that follows the life of a book at Distributed Proofreaders. Therein you can see how the children’s book Uncle Wiggily’s Auto Sled by Howard Roger Garis went from a physical form to a beautiful eBook at Project Gutenberg (you can read the HTML version in your web browser).

I’ve got a few more big DP development tasks I want to complete in the next 3 months before my sabbatical is over. Stay tuned!

The relevance of Distributed Proofreaders in a Google Books world

The mission of Distributed Proofreaders (DP) is to preserve public domain books by converting them into high-quality eBooks and publishing them to Project Gutenberg. But they’re not the only ones who are working to digitally preserve the world’s books. Other players in this space include:

  • The Internet Archive (TIA), through its OpenLibrary project, is digitizing all the world’s public domain books and making them accessible.
  • Google Books has a similar mission1, but focuses on indexing the contents of the books so users can search against them.
  • HathiTrust is a collaboration of multiple universities working to digitally preserve their collections. They work with TIA, Google Books, and others to source their material.

Both OpenLibrary and Google Books make not only the digitized images available, but also the underlying text. In fact, both bundle up that text into eBooks automatically. The creation of these eBooks is entirely automated without any human interaction and thus are lightening fast compared to DP.

If TIA, Google Books, and others are both providing digital books to the public, why does DP still exist? What is its relevance in today’s world?

The answer is actually quite simple: accuracy of the text.

Text from OCR is ‘data’ not ‘information’

The quality of the OCR text from an image depends on multiple factors, including the quality of the image, the capabilities of the OCR software, and how the OCR software was configured. Even with the crispest image and sophisticated software the OCR text isn’t perfect and are filled with page artifacts and errors.

For a example, I found a book that was available in Project Gutenberg that went through DP and was also available from the above 3 providers above. I selected a random page in the project and compared the text output from each. The book was Railway construction by William Hemingway Mills; page 20. Links to the book from the different sources:

  1. Distributed Proofreaders (eBook 50696 at Project Gutenberg)
  2. Google Books
  3. OpenLibrary
  4. HathiTrust (the edition was digitized by Google Books)

Using the DP version, specifically the text of the page after it finished the P3 proofreading round and before it was formatted, I did a diff against the one provided by Google Books and TIA ignoring any changes in whitespace. I’ve highlighted the differences below; the first version given (lines prefaced by <) is the DP version, and the second (lines prefaced by >) is the Google Books or TIA version.

The Google Books text was taken from their ePub version, the HTML tags stripped out, and the newlines reinserted for easier comparison.
$ diff -w railroad_dp.txt railroad_google.txt
< (b) At any intermediate siding connection upon a line
> (6) At any intermediate siding connection upon a line
< Every signal arm to be so weighted as to fly to and remain
> Every signal arm to be so weighted as to lly to and remain

The TIA text was taken from their plain text version. Note the cruft on the page at the end — yes, that’s really present in their text:
$ diff -w railroad_dp.txt railroad_tia.txt
< (b) At any intermediate siding connection upon a line
< worked under the train staff and ticket system, or under the
> (b) At any intermediate siding cormection upon a line
> worked u/nder the train staff and ticket system, or under the
< system: Sidings, if any, being locked as in (b).
> system : Sidings, if any, being locked as in (6).
< one arm on one side of a post, to be made to apply--the first, or
> one arm on one side of a post, to be made to apply — the first, or
< run over by trains of other companies using a different system
> run over by trains of other companies usiing a different system
> .â– J ML^
> *•**

Something important to note, is that there are no cases in this page where the Google Books or TIA versions found errors in the DP version. At least for this one page in this one book, DP provides the most accurate text.2

Google Books cares about the accuracy of the text only as much as it can index and bring up a book based on that book’s indexed contents. They don’t care if pages have ancillary characters or incorrect words as long as, taken as a whole, the book is indexable. I presume TIA cares more about having valid text, but they don’t appear to have the resources to improve them.

Errors like the ones above are fairly minor and mostly just annoying for the average reader. However, consider such errors in a scientific book or journal where the accuracy of the numbers is very important.

OCR-only text is just a bunch of data, but without accuracy it’s not really information. In fact, in some subtle cases it could be misinformation.

DP provides more accurate text, but it does so at the cost of speed. A book can take from days to years to go through the whole process at DP and be published at Project Gutenberg.

Improving DP’s relevance

Currently, every text that goes through DP ends up in Project Gutenberg as an eBook. The eBooks are far superior to the ones produced by automated systems and are a delight to read. There will always be a need for these.

However, there are small things we can do at DP to become more relevant in today’s digital ecosystem.

Closing the loop with image providers

It’s sad that while DP sources many of its images from Google Books or TIA, those providers continue to offer sub-par text and eBooks for download well after DP has uploaded finished eBooks to Project Gutenberg.

DP should close the loop with TIA and Google Books to provide them with updated eBooks and page texts. Projects at DP already identify where the images were sourced from, so it would be straightforward to send the providers the updated text in an automated way. I can see providers like Google Books being particularly interested in accurate page texts to refine their OCR algorithms and improve their search index. Both TIA and Google Books could use the accurate page texts to update the underlying text in their image PDFs, allowing accurate PDF searching and accessibility (eg: screen readers).

Partnering with the image providers in this manner is the right thing to do for the world at large and a potential source for more volunteers, developers, and perhaps even funding.

Not everything needs to become an eBook

Not everything needs to be a beautiful, hand-crafted eBook. Some printed materials, like journal articles and legal briefs, would benefit most from simply having accurate text — something DP excels at.

If DP were to expand its mission to encompass the accurate preservation of all public-domain printed materials, with the end product varying depending on the needs of the item, it could increase the rate at which accurate public domain texts are produced. Such materials would only go through the proofreading rounds, skipping the formatting rounds and post-processing step that are the biggest bottleneck. This could result in the accurate text being available within mere days.

With this in place, I can see DP partnering with folks like JSTOR or CourtListener to proofread their public domain materials. Such partnerships would be good publicity and a valuable source of new volunteers. Because this would still be limited to public domain material, Project Gutenberg could accept these not-eBooks as a final product if it chose.

Potentially more than relevant

By increasing DP’s scope, partnering with image providers, and leveraging its strengths, DP can remain more than relevant in the age of Google Books, but it’s going to take some realignment of mission and buy-in from the community to get there.


1 Google Books is digitizing much more than just public domain books, but let’s focus on the public domain books in this discussion, there are more than enough of them out there.

2 The astute among you will point out that all I’ve done is compare the resulting text from each source, not the text to the actual image, which is required to determine accuracy. You are correct, all I have shown is how precise the texts are to one another. I have great confidence in the accuracy of the DP version compared to the image, but I leave proof of that as an exercise for the reader. And if you enjoy that kind of work, have I got a great site for you to volunteer with!

Books in every Nook and cranny

Barnes and Noble revealed their own eBook reader yesterday: Nook. I’ve been following the eBook readers closely, specifically the Sony PRS-600 and the Kindle 2.

My ideal eBook reader would have the following features in order of importance:

  • use eInk technology
  • native support of ePub format (the open standard eBook format)
  • native support of PDF format
  • wireless LAN support
  • no physical keyboard
  • bonus: touch screen support
  • bonus: SD card support

The Kindle 2 is right out seeing as that it has no support for ePub, no native PDF support (unless you get the larger Kindle DX), no wireless LAN support, and a physical keyboard. Also working against it is the inability to play with one before buying it and Amazon’s highly ironic Orwellian fubar. The cell connectivity is a neat gimmick but isn’t a feature I’m looking for.

The Sony PRS-600 thus far has most of the features that I’ve been wanting, including the touch screen but sans wireless LAN support. I played with one of the earlier PRS-700 models that a friend purchased and really liked it. The downside is that it’s sold by Sony which while they may make excellent hardware I hate their business practices. Like use of their proprietary Memory Sticks in their cameras and laptops, their apparent disdain of their consumers of the PSP Go, and their classic rootkit escapade. Oh, and they’re in bed with the RIAA as if they needed another strike against them in my book.

The Nook seems to have all of the items on my wish list, plus a few more extras that I like (mini-SD card slot and color LCD touch screen in addition to the eInk screen). The ability to lend a friend a purchased book, even for a measly two weeks, is pretty interesting although I doubt I’d purchase books given how many interesting ones are available for free via Project Gutenberg. The ability to read any eBook in its entirety for free while inside a Barnes & Noble store is exceptionally cool. Because it’s sold by a brick-and-mortar store I can go into a Barnes and Noble and play with one before I buy it. And playing with one is the first thing I’m going to do come November 30th when the stores get them.