The life of a book at Distributed Proofreaders

Distributed Proofreaders (DP) is an international web-based community working together to create eBooks from physical books in the public domain. I’ve been a volunteer and developer at DP for over 9 years and a site administrator for the past 6. During my sabbatical I’ve enjoyed a more active role in the community wearing my developer and system administrator hats.

After realizing that many folks aren’t aware of how DP works, I wrote a post on the DP blog that follows the life of a book at Distributed Proofreaders. Therein you can see how the children’s book Uncle Wiggily’s Auto Sled by Howard Roger Garis went from a physical form to a beautiful eBook at Project Gutenberg (you can read the HTML version in your web browser).

I’ve got a few more big DP development tasks I want to complete in the next 3 months before my sabbatical is over. Stay tuned!

The relevance of Distributed Proofreaders in a Google Books world

The mission of Distributed Proofreaders (DP) is to preserve public domain books by converting them into high-quality eBooks and publishing them to Project Gutenberg. But they’re not the only ones who are working to digitally preserve the world’s books. Other players in this space include:

  • The Internet Archive (TIA), through its OpenLibrary project, is digitizing all the world’s public domain books and making them accessible.
  • Google Books has a similar mission1, but focuses on indexing the contents of the books so users can search against them.
  • HathiTrust is a collaboration of multiple universities working to digitally preserve their collections. They work with TIA, Google Books, and others to source their material.

Both OpenLibrary and Google Books make not only the digitized images available, but also the underlying text. In fact, both bundle up that text into eBooks automatically. The creation of these eBooks is entirely automated without any human interaction and thus are lightening fast compared to DP.

If TIA, Google Books, and others are both providing digital books to the public, why does DP still exist? What is its relevance in today’s world?

The answer is actually quite simple: accuracy of the text.

Text from OCR is ‘data’ not ‘information’

The quality of the OCR text from an image depends on multiple factors, including the quality of the image, the capabilities of the OCR software, and how the OCR software was configured. Even with the crispest image and sophisticated software the OCR text isn’t perfect and are filled with page artifacts and errors.

For a example, I found a book that was available in Project Gutenberg that went through DP and was also available from the above 3 providers above. I selected a random page in the project and compared the text output from each. The book was Railway construction by William Hemingway Mills; page 20. Links to the book from the different sources:

  1. Distributed Proofreaders (eBook 50696 at Project Gutenberg)
  2. Google Books
  3. OpenLibrary
  4. HathiTrust (the edition was digitized by Google Books)

Using the DP version, specifically the text of the page after it finished the P3 proofreading round and before it was formatted, I did a diff against the one provided by Google Books and TIA ignoring any changes in whitespace. I’ve highlighted the differences below; the first version given (lines prefaced by <) is the DP version, and the second (lines prefaced by >) is the Google Books or TIA version.

The Google Books text was taken from their ePub version, the HTML tags stripped out, and the newlines reinserted for easier comparison.
$ diff -w railroad_dp.txt railroad_google.txt
4c4
< (b) At any intermediate siding connection upon a line
---
> (6) At any intermediate siding connection upon a line
31c31
< Every signal arm to be so weighted as to fly to and remain
---
> Every signal arm to be so weighted as to lly to and remain

The TIA text was taken from their plain text version. Note the cruft on the page at the end — yes, that’s really present in their text:
$ diff -w railroad_dp.txt railroad_tia.txt
4,5c4,5
< (b) At any intermediate siding connection upon a line
< worked under the train staff and ticket system, or under the
---
> (b) At any intermediate siding cormection upon a line
> worked u/nder the train staff and ticket system, or under the
11c11
< system: Sidings, if any, being locked as in (b).
---
> system : Sidings, if any, being locked as in (6).
15c15
< one arm on one side of a post, to be made to apply--the first, or
---
> one arm on one side of a post, to be made to apply — the first, or
41c41
< run over by trains of other companies using a different system
---
> run over by trains of other companies usiing a different system
50a51,58
>
>
> .â– J ML^
>
>
>
> *•**
>

Something important to note, is that there are no cases in this page where the Google Books or TIA versions found errors in the DP version. At least for this one page in this one book, DP provides the most accurate text.2

Google Books cares about the accuracy of the text only as much as it can index and bring up a book based on that book’s indexed contents. They don’t care if pages have ancillary characters or incorrect words as long as, taken as a whole, the book is indexable. I presume TIA cares more about having valid text, but they don’t appear to have the resources to improve them.

Errors like the ones above are fairly minor and mostly just annoying for the average reader. However, consider such errors in a scientific book or journal where the accuracy of the numbers is very important.

OCR-only text is just a bunch of data, but without accuracy it’s not really information. In fact, in some subtle cases it could be misinformation.

DP provides more accurate text, but it does so at the cost of speed. A book can take from days to years to go through the whole process at DP and be published at Project Gutenberg.

Improving DP’s relevance

Currently, every text that goes through DP ends up in Project Gutenberg as an eBook. The eBooks are far superior to the ones produced by automated systems and are a delight to read. There will always be a need for these.

However, there are small things we can do at DP to become more relevant in today’s digital ecosystem.

Closing the loop with image providers

It’s sad that while DP sources many of its images from Google Books or TIA, those providers continue to offer sub-par text and eBooks for download well after DP has uploaded finished eBooks to Project Gutenberg.

DP should close the loop with TIA and Google Books to provide them with updated eBooks and page texts. Projects at DP already identify where the images were sourced from, so it would be straightforward to send the providers the updated text in an automated way. I can see providers like Google Books being particularly interested in accurate page texts to refine their OCR algorithms and improve their search index. Both TIA and Google Books could use the accurate page texts to update the underlying text in their image PDFs, allowing accurate PDF searching and accessibility (eg: screen readers).

Partnering with the image providers in this manner is the right thing to do for the world at large and a potential source for more volunteers, developers, and perhaps even funding.

Not everything needs to become an eBook

Not everything needs to be a beautiful, hand-crafted eBook. Some printed materials, like journal articles and legal briefs, would benefit most from simply having accurate text — something DP excels at.

If DP were to expand its mission to encompass the accurate preservation of all public-domain printed materials, with the end product varying depending on the needs of the item, it could increase the rate at which accurate public domain texts are produced. Such materials would only go through the proofreading rounds, skipping the formatting rounds and post-processing step that are the biggest bottleneck. This could result in the accurate text being available within mere days.

With this in place, I can see DP partnering with folks like JSTOR or CourtListener to proofread their public domain materials. Such partnerships would be good publicity and a valuable source of new volunteers. Because this would still be limited to public domain material, Project Gutenberg could accept these not-eBooks as a final product if it chose.

Potentially more than relevant

By increasing DP’s scope, partnering with image providers, and leveraging its strengths, DP can remain more than relevant in the age of Google Books, but it’s going to take some realignment of mission and buy-in from the community to get there.

 

1 Google Books is digitizing much more than just public domain books, but let’s focus on the public domain books in this discussion, there are more than enough of them out there.

2 The astute among you will point out that all I’ve done is compare the resulting text from each source, not the text to the actual image, which is required to determine accuracy. You are correct, all I have shown is how precise the texts are to one another. I have great confidence in the accuracy of the DP version compared to the image, but I leave proof of that as an exercise for the reader. And if you enjoy that kind of work, have I got a great site for you to volunteer with!

Development leadership failure

Last night I did some dev work for DP. Mostly some code cleanup (heaven knows we need it) but also rolling out some committed code to production. I’ve made a concerted effort to get committed-but-not-released code deployed — some of which has been waiting for, literally, years.

Even worse, we have reams of code updates sitting uncommitted (and slowly suffering from bitrot) in volunteers’ sandboxes waiting for code review. In the case of Amy’s new quizzes, for almost 5(!!!!) years. In other cases volunteers have done a crazy amount of legwork to address architectural issues that remain unimplemented due to no solid commitment that if they did the work it would be reviewed, committed, and deployed — like Laurent’s site localization effort.

These are clear systematic failures by development leadership, ie: me. It’s obvious why even when the project attracts developers, we can’t retain them.

The first step is to get through the backlog of outstanding work. I have Laurent’s localization work almost finished. This will allow the site to be translated into other languages — I think Portuguese and French are already done. Next up is getting Amy’s new quizzes pushed out. She’s done a marvelous job of keeping her code up to date with HEAD based on my initial work last night. Now to get them committed and rolled out. Then a site-wide change on our include()s required to get full site localization implemented.

After all that, we need to address how to better keep code committed and rolled out. I think we as a team suffer from “don’t commit until it’s perfect, then wait until it’s simmered before rolling it out”. Where “simmered” means “sitting in CVS with no active testing done on it”. We need to move to a more flexible check-in criteria or a more liberal roll-out. There’s no good reason why the bar is so crazy high on both ends of that.

But first – the backlog.

PGDP.net – spellcheck facelift

After taking a look back at my journal entries I was surprised to see that I’ve never posted anything about PGDP or WordCheck, which really is amazing since it has consumed many hours of my life.

I guess I should start with some background. Last September I came across Distributed Proofreaders whose mission is to preserve books by providing texts to Project Gutenberg (PG). The site provides a mechanism for content providers to upload scanned and OCR’d copies of books, proofers to validate and correct the text one page at a time by comparing it to the scanned image, formatters (aka foofers) to format the text, and post-processors to ensure consistency and do a final edit before the text is uploaded to PG. Each book, or project, is guided through the process by a project manager who is responsible for helping with questions from proofers or foofers.

I began as a simple proofer, proofing a couple of pages a day during my lunch break. The more pages I proofed the more annoyed I became with the spellcheck component of the proofing interface. It didn’t offer the ability to see the page image in the default interface, the misspelled words were displayed as a drop-down box with suggestions instead of a text box to make corrections, and there was no way to add words to the spell checker’s dictionary so there were many false positives.

After searching some of the forums, my concerns were shared by others. A bit more digging revealed that the PGDP code is open source and thus available for improvement. It will surprise no one that I got annoyed enough that I started working to improve these issues collaborating with more senior project developers and active members of the community. The resulting “spell check” was sufficiently advanced that we renamed it WordCheck to convey that the tool does much more than simply check for misspelled words:

  • The image is always shown when making corrections to the page text.
  • Instead of using drop-down boxes to indicate misspelled words, edit boxes are instead.
  • If the project is listed has having multiple languages, the page text is checked using all possible language dictionaries instead of just the first one reducing the number of false positives.
  • Each project now has its own ‘dictionary’ in the form of a Good Words List that project mangers can use to add frequently-occuring words such as proper nouns to further reduce false positives.
  • Each project also has a Bad Words List that a project manager can use to add words that should be flagged even if they pass the dictionary. This is used to help proofers find and correct scannos (like typos but made by the OCR software) such as modem for modern or arid for and.
  • While proofing, proofers have the ability to suggest that a word can be added to the project’s Good Word List.

This initial version was released in mid-March.

After the initial release development continued to enhance the ability for project managers to manage project word lists and proofer suggestions. These enhancements were released at the end of May and improved the project manager’s interaction with WordCheck. Further tool development occurred and another minor release was made at the end of July.

As a proofer I can vouch that the changes made my life much easier. Several project managers have said that they are seeing better quality texts coming out of each round as well since WordCheck as been released. Overall I’ve really enjoyed working with the PGDP folks, both developers and proofers. I think my WordCheck active development is coming to an end unless defects are found. Instead I have my sights on the proofing interface itself and the spaghetti code that makes it run.