After taking a look back at my journal entries I was surprised to see that I’ve never posted anything about PGDP or WordCheck, which really is amazing since it has consumed many hours of my life.
I guess I should start with some background. Last September I came across Distributed Proofreaders whose mission is to preserve books by providing texts to Project Gutenberg (PG). The site provides a mechanism for content providers to upload scanned and OCR’d copies of books, proofers to validate and correct the text one page at a time by comparing it to the scanned image, formatters (aka foofers) to format the text, and post-processors to ensure consistency and do a final edit before the text is uploaded to PG. Each book, or project, is guided through the process by a project manager who is responsible for helping with questions from proofers or foofers.
I began as a simple proofer, proofing a couple of pages a day during my lunch break. The more pages I proofed the more annoyed I became with the spellcheck component of the proofing interface. It didn’t offer the ability to see the page image in the default interface, the misspelled words were displayed as a drop-down box with suggestions instead of a text box to make corrections, and there was no way to add words to the spell checker’s dictionary so there were many false positives.
After searching some of the forums, my concerns were shared by others. A bit more digging revealed that the PGDP code is open source and thus available for improvement. It will surprise no one that I got annoyed enough that I started working to improve these issues collaborating with more senior project developers and active members of the community. The resulting “spell check” was sufficiently advanced that we renamed it WordCheck to convey that the tool does much more than simply check for misspelled words:
- The image is always shown when making corrections to the page text.
- Instead of using drop-down boxes to indicate misspelled words, edit boxes are instead.
- If the project is listed has having multiple languages, the page text is checked using all possible language dictionaries instead of just the first one reducing the number of false positives.
- Each project now has its own ‘dictionary’ in the form of a Good Words List that project mangers can use to add frequently-occuring words such as proper nouns to further reduce false positives.
- Each project also has a Bad Words List that a project manager can use to add words that should be flagged even if they pass the dictionary. This is used to help proofers find and correct scannos (like typos but made by the OCR software) such as modem for modern or arid for and.
- While proofing, proofers have the ability to suggest that a word can be added to the project’s Good Word List.
This initial version was released in mid-March.
After the initial release development continued to enhance the ability for project managers to manage project word lists and proofer suggestions. These enhancements were released at the end of May and improved the project manager’s interaction with WordCheck. Further tool development occurred and another minor release was made at the end of July.
As a proofer I can vouch that the changes made my life much easier. Several project managers have said that they are seeing better quality texts coming out of each round as well since WordCheck as been released. Overall I’ve really enjoyed working with the PGDP folks, both developers and proofers. I think my WordCheck active development is coming to an end unless defects are found. Instead I have my sights on the proofing interface itself and the spaghetti code that makes it run.