The mission of Distributed Proofreaders (DP) is to preserve public domain books by converting them into high-quality eBooks and publishing them to Project Gutenberg. But they’re not the only ones who are working to digitally preserve the world’s books. Other players in this space include:
- The Internet Archive (TIA), through its OpenLibrary project, is digitizing all the world’s public domain books and making them accessible.
- Google Books has a similar mission1, but focuses on indexing the contents of the books so users can search against them.
- HathiTrust is a collaboration of multiple universities working to digitally preserve their collections. They work with TIA, Google Books, and others to source their material.
Both OpenLibrary and Google Books make not only the digitized images available, but also the underlying text. In fact, both bundle up that text into eBooks automatically. The creation of these eBooks is entirely automated without any human interaction and thus are lightening fast compared to DP.
If TIA, Google Books, and others are both providing digital books to the public, why does DP still exist? What is its relevance in today’s world?
The answer is actually quite simple: accuracy of the text.
Text from OCR is ‘data’ not ‘information’
The quality of the OCR text from an image depends on multiple factors, including the quality of the image, the capabilities of the OCR software, and how the OCR software was configured. Even with the crispest image and sophisticated software the OCR text isn’t perfect and are filled with page artifacts and errors.
For a example, I found a book that was available in Project Gutenberg that went through DP and was also available from the above 3 providers above. I selected a random page in the project and compared the text output from each. The book was Railway construction by William Hemingway Mills; page 20. Links to the book from the different sources:
- Distributed Proofreaders (eBook 50696 at Project Gutenberg)
- Google Books
- HathiTrust (the edition was digitized by Google Books)
Using the DP version, specifically the text of the page after it finished the P3 proofreading round and before it was formatted, I did a diff against the one provided by Google Books and TIA ignoring any changes in whitespace. I’ve highlighted the differences below; the first version given (lines prefaced by <) is the DP version, and the second (lines prefaced by >) is the Google Books or TIA version.
The Google Books text was taken from their ePub version, the HTML tags stripped out, and the newlines reinserted for easier comparison.
$ diff -w railroad_dp.txt railroad_google.txt
< (b) At any intermediate siding connection upon a line
> (6) At any intermediate siding connection upon a line
< Every signal arm to be so weighted as to fly to and remain
> Every signal arm to be so weighted as to lly to and remain
The TIA text was taken from their plain text version. Note the cruft on the page at the end — yes, that’s really present in their text:
$ diff -w railroad_dp.txt railroad_tia.txt
< (b) At any intermediate siding connection upon a line
< worked under the train staff and ticket system, or under the
> (b) At any intermediate siding cormection upon a line
> worked u/nder the train staff and ticket system, or under the
< system: Sidings, if any, being locked as in (b).
> system : Sidings, if any, being locked as in (6).
< one arm on one side of a post, to be made to apply--the first, or
> one arm on one side of a post, to be made to apply â€” the first, or
< run over by trains of other companies using a different system
> run over by trains of other companies usiing a different system
> .â– J ML^
Something important to note, is that there are no cases in this page where the Google Books or TIA versions found errors in the DP version. At least for this one page in this one book, DP provides the most accurate text.2
Google Books cares about the accuracy of the text only as much as it can index and bring up a book based on that book’s indexed contents. They don’t care if pages have ancillary characters or incorrect words as long as, taken as a whole, the book is indexable. I presume TIA cares more about having valid text, but they don’t appear to have the resources to improve them.
Errors like the ones above are fairly minor and mostly just annoying for the average reader. However, consider such errors in a scientific book or journal where the accuracy of the numbers is very important.
OCR-only text is just a bunch of data, but without accuracy it’s not really information. In fact, in some subtle cases it could be misinformation.
DP provides more accurate text, but it does so at the cost of speed. A book can take from days to years to go through the whole process at DP and be published at Project Gutenberg.
Improving DP’s relevance
Currently, every text that goes through DP ends up in Project Gutenberg as an eBook. The eBooks are far superior to the ones produced by automated systems and are a delight to read. There will always be a need for these.
However, there are small things we can do at DP to become more relevant in today’s digital ecosystem.
Closing the loop with image providers
It’s sad that while DP sources many of its images from Google Books or TIA, those providers continue to offer sub-par text and eBooks for download well after DP has uploaded finished eBooks to Project Gutenberg.
DP should close the loop with TIA and Google Books to provide them with updated eBooks and page texts. Projects at DP already identify where the images were sourced from, so it would be straightforward to send the providers the updated text in an automated way. I can see providers like Google Books being particularly interested in accurate page texts to refine their OCR algorithms and improve their search index. Both TIA and Google Books could use the accurate page texts to update the underlying text in their image PDFs, allowing accurate PDF searching and accessibility (eg: screen readers).
Partnering with the image providers in this manner is the right thing to do for the world at large and a potential source for more volunteers, developers, and perhaps even funding.
Not everything needs to become an eBook
Not everything needs to be a beautiful, hand-crafted eBook. Some printed materials, like journal articles and legal briefs, would benefit most from simply having accurate text — something DP excels at.
If DP were to expand its mission to encompass the accurate preservation of all public-domain printed materials, with the end product varying depending on the needs of the item, it could increase the rate at which accurate public domain texts are produced. Such materials would only go through the proofreading rounds, skipping the formatting rounds and post-processing step that are the biggest bottleneck. This could result in the accurate text being available within mere days.
With this in place, I can see DP partnering with folks like JSTOR or CourtListener to proofread their public domain materials. Such partnerships would be good publicity and a valuable source of new volunteers. Because this would still be limited to public domain material, Project Gutenberg could accept these not-eBooks as a final product if it chose.
Potentially more than relevant
By increasing DP’s scope, partnering with image providers, and leveraging its strengths, DP can remain more than relevant in the age of Google Books, but it’s going to take some realignment of mission and buy-in from the community to get there.
1 Google Books is digitizing much more than just public domain books, but let’s focus on the public domain books in this discussion, there are more than enough of them out there.
2 The astute among you will point out that all I’ve done is compare the resulting text from each source, not the text to the actual image, which is required to determine accuracy. You are correct, all I have shown is how precise the texts are to one another. I have great confidence in the accuracy of the DP version compared to the image, but I leave proof of that as an exercise for the reader. And if you enjoy that kind of work, have I got a great site for you to volunteer with!
2 thoughts on “The relevance of Distributed Proofreaders in a Google Books world”
One reason why DP has not worked with various image providers is that they want to be able to identify the position of the word on the page. This is so they can highlight the correct place for a search. They derive this information from ABBYY Finereader which generates boundary boxes for words. If DP were being built today, it might well be able to retain that information. But at the time the base DP code was developed, Finereader, to the best of my knowledge, did not yet provide that information. Without that information, various potential partners, including Google Books, the New York Times, and (the big European digitizing initiative whose name I’ve forgotten) were not interested in our data. Previous contacts with TIA suggest that they would be interested in adding our texts to theirs, as an additional, optional download for a specific title, if there were an automated way to do so.
We should be able to provide corrected text with placement data to image providers without any changes to DP. From the PDFs the image providers have, we can extract the text with placement data, update the text with the post-P3 DP page text, and (if desired) recreate the PDF. All of this is technically feasible now. I’ve roughed out the algorithms and steps needed but I haven’t coded it up yet. I’m not saying it’ll be easy, but lets assume that we can give the image providers updated text with placement data — is that allowed by DPF and under what conditions/attribution/etc?
It should be fairly straightforward to programmatically notify TIA when a project finishes that was generated from their sources, we just need to code it up.