Introducing the “new” DP logo

Distributed Proofreaders has been around since 2000, well before the advent of modern image formats like SVG vector images and PNG raster images. The DP logo, therefore, was a GIF available in only the size needed for the website:

Fast forward 15 years and our logo is still 360×68 pixels with no hope of being used at any larger size, in any of the instances where a square image is needed (like Twitter), and no chance of being used in print. Over the years folks have filled the void by creating new raster images, some of a great quality, but never anything that was considered official and never in a vector format from which we could generate raster images of various sizes.

In modern logo development you design it in a vector format, such at SVG, and then rasterize it to whatever size you need for the web. In addition to a logo you need what’s commonly called a mark or badge, essentially a square design that readily brings your brand to mind when seen. Marks are often used to link back to your website.

Most large companies go one step further and include much more detail about their brand, including specifying colors, fonts, spacing, when to use which image, and much more. Some branding guidelines from companies you’ve probably heard of:

About two weeks ago I decided DP needed some modern branding assets. Loading up my trusty copy of Inkscape and all of the current images available to me, I created a “new” DP logo in SVG format we could use to create official brand logos. I also created a DP mark in SVG format. Today, we rolled them out.

Introducing the “new” DP logo and mark:

Logo
DP logo

Mark
DP mark

Included in the roll-out is a full branding page providing access to the SVG files as well as PNGs with both white and transparent backgrounds in various sizes. The nice thing about transparent PNGs is that because they have an alpha channel, we only need one PNG to use for all our different themes rather than one GIF per background color.

To round out the set, the branding page even includes a black and white logo without a drop-shadow for use in black and white print applications.

DP logo in black and white

Making these was a fun foray back to my design and publishing days. Turns out no one knows what font the original logo is in. It’s likely some variant of Garamond, but none that I could find. Luckily I was able to find Amiri, a free font from Google Fonts that was a pretty close match. That worked for everything except the ‘dp’ in the center of the logo and the core of the mark. Those bits of the logo are very visually striking and the letter shapes in Amiri were too different from the original to use. Fortunately Linda, the general manager, had already done some work in Corel to vectorize those bits into an image for use on Twitter. After combining the two together (and converting the final text to paths for better compatibility) it was a simple matter of adding the drop shadow and exporting some PNGs.

The “new” logo isn’t exactly like the old one, but it’s pretty close and hopefully conveys the most important visual aspects of the original. And now they’re available in easily-consumable formats for virtually any media.

Enabling DP development with a developer VM

Getting started doing development on the DP code can be quite challenging. You can get a copy of the source code quite readily, but creating a system to test any changes gets complicated due to the code dependencies — primarily its tight integration with phpBB.

For a long time now, developers could request an account on our TEST server which has all the prerequisites installed, including a shared database with loaded data. There are a few downside with using the TEST server, however. The primary one being that everyone is using the shared database, significantly limiting the changes that could be made without impacting others. Another downside is that you need internet connectivity to do development work.

Having a way to do development locally on your desktop would be ideal. Installations on modern desktops are almost impossible, however, given our current dependency on magic quotes, a “feature” which has us locked on PHP 5.3, a very archaic version that no modern Linux desktop includes.

Environments like this are a perfect use case for virtual machines. While validating the installation instructions on the recent release I set out to create a DP development VM. This ensured that our instructions could be used to set up a fully-working installation of DP as well as produce a VM that others could use.

The DP development VM is a VMware VM running Ubuntu 12.04 LTS with a fully-working installation of DP. It comes pre-loaded with a variety of DP user accounts (proofer, project manager, admin) and even a sample project ready for proofing. The VM is running the R201601 release of DP source directly from the master git repo, so it’s easy to update to newer ‘production’ milestones when they come out. With the included instructions a developer can start doing DP development within minutes of downloading the VM.

I used VMware because it was convenient as I already had Fusion on my Mac and that VMware Player is freely available for Windows and Linux. A better approach would have been VirtualBox1 as it’s freely available for all platforms. Thankfully it should be fairly straightforward to create a VirtualBox VM from the VMware .vmdk (I leave this as an exercise to another developer).

After I had the VM set up and working I discovered vagrant while doing some hacking on OpenLibrary. If I had to create the VM again I would probably go the vagrant route. Although I expect it would take me a lot longer to set up it would significantly improve the development experience.

It’s too early to know if the availability of the development VM will increase the number of developers contributing to DP, but having yet another tool in the development tool-box can’t hurt.

1 Although I feel dirty using VirtualBox because it’s owned by Oracle. Granted, I feel dirty using MySQL for the same reason…

A new release of the DP site code, 9 years in the making

Today we released a new version of the Distributed Proofreaders code that runs pgdp.net! The announcement includes a list of what’s changed in the 9 years since the last release as well as a list of contributors, some statistics, and next steps. I’ve been working on getting a new release cut since mid-September so I’m pretty excited about it!

The prior release was in September 2006 and since that time there have been continuous, albeit irregular, updates to pgdp.net, but no package available for folks to download for new installations or to update their existing ones. Instead, enterprising individuals had to pull code from the ‘production’ tag in CVS (yes, seriously).

In the process of getting the code ready for release I noticed that there had been changes to the database on pgdp.net that hadn’t been reflected in the initial DB schema or the upgrade scripts in the code. So even if someone had downloaded the code from CVS they would have struggled to get it working.

As part of cutting the release I walked through the documentation that we provide, including the installation, upgrade, and configuration steps, and realized how much implied knowledge was in there. Much of the release process was me updating the documentation after learning what you were suppose to do.1 I ended up creating a full DP installation on a virtual machine to ensure the installation steps produced a working system. I’m not saying they’re now perfect, but they are certainly better than before.

Cutting a release is important for multiple reasons, including the ability for others to use code that is known to work. But the most important to me as a developer is the ability to reset dependency versions going forward. The current code, including that released today, continues to work on severely antiquated versions of PHP (4.x up through 5.3) and MySQL (4.x up to 5.1). This was a pseudo design decision in order to allow sites running on shared hosting with no control over their middleware to continue to function. Given how the hosting landscape has changed drastically over the past 9 years, and how really old those versions are, we decided it’s time to change that.

Going forward we’re resetting the requirements to be PHP 5.3 (but not later, due to our frustrating dependency on magic quotes) and MySQL 5.1 and later. This will allow us to use modern programming features like classes and exceptions that we couldn’t before.

Now that we have a release behind us, I’m excited to get more developers involved and start making some much-needed sweeping changes. Things like removing our dependency on magic quotes and creating a RESTful API to allow programmatic access to DP data. I’m hoping being on git and the availability of a development VM (more on that in a future blog post) will accelerate development.

If you’re looking for somewhere to volunteer as a developer for a literary2 great cause, come join us!

1 A serious hat-tip to all of my tech writer friends who do this on a daily basis!

2 See what I did there?

The life of a book at Distributed Proofreaders

Distributed Proofreaders (DP) is an international web-based community working together to create eBooks from physical books in the public domain. I’ve been a volunteer and developer at DP for over 9 years and a site administrator for the past 6. During my sabbatical I’ve enjoyed a more active role in the community wearing my developer and system administrator hats.

After realizing that many folks aren’t aware of how DP works, I wrote a post on the DP blog that follows the life of a book at Distributed Proofreaders. Therein you can see how the children’s book Uncle Wiggily’s Auto Sled by Howard Roger Garis went from a physical form to a beautiful eBook at Project Gutenberg (you can read the HTML version in your web browser).

I’ve got a few more big DP development tasks I want to complete in the next 3 months before my sabbatical is over. Stay tuned!

The relevance of Distributed Proofreaders in a Google Books world

The mission of Distributed Proofreaders (DP) is to preserve public domain books by converting them into high-quality eBooks and publishing them to Project Gutenberg. But they’re not the only ones who are working to digitally preserve the world’s books. Other players in this space include:

  • The Internet Archive (TIA), through its OpenLibrary project, is digitizing all the world’s public domain books and making them accessible.
  • Google Books has a similar mission1, but focuses on indexing the contents of the books so users can search against them.
  • HathiTrust is a collaboration of multiple universities working to digitally preserve their collections. They work with TIA, Google Books, and others to source their material.

Both OpenLibrary and Google Books make not only the digitized images available, but also the underlying text. In fact, both bundle up that text into eBooks automatically. The creation of these eBooks is entirely automated without any human interaction and thus are lightening fast compared to DP.

If TIA, Google Books, and others are both providing digital books to the public, why does DP still exist? What is its relevance in today’s world?

The answer is actually quite simple: accuracy of the text.

Text from OCR is ‘data’ not ‘information’

The quality of the OCR text from an image depends on multiple factors, including the quality of the image, the capabilities of the OCR software, and how the OCR software was configured. Even with the crispest image and sophisticated software the OCR text isn’t perfect and are filled with page artifacts and errors.

For a example, I found a book that was available in Project Gutenberg that went through DP and was also available from the above 3 providers above. I selected a random page in the project and compared the text output from each. The book was Railway construction by William Hemingway Mills; page 20. Links to the book from the different sources:

  1. Distributed Proofreaders (eBook 50696 at Project Gutenberg)
  2. Google Books
  3. OpenLibrary
  4. HathiTrust (the edition was digitized by Google Books)

Using the DP version, specifically the text of the page after it finished the P3 proofreading round and before it was formatted, I did a diff against the one provided by Google Books and TIA ignoring any changes in whitespace. I’ve highlighted the differences below; the first version given (lines prefaced by <) is the DP version, and the second (lines prefaced by >) is the Google Books or TIA version.

The Google Books text was taken from their ePub version, the HTML tags stripped out, and the newlines reinserted for easier comparison.
$ diff -w railroad_dp.txt railroad_google.txt
4c4
< (b) At any intermediate siding connection upon a line
---
> (6) At any intermediate siding connection upon a line
31c31
< Every signal arm to be so weighted as to fly to and remain
---
> Every signal arm to be so weighted as to lly to and remain

The TIA text was taken from their plain text version. Note the cruft on the page at the end — yes, that’s really present in their text:
$ diff -w railroad_dp.txt railroad_tia.txt
4,5c4,5
< (b) At any intermediate siding connection upon a line
< worked under the train staff and ticket system, or under the
---
> (b) At any intermediate siding cormection upon a line
> worked u/nder the train staff and ticket system, or under the
11c11
< system: Sidings, if any, being locked as in (b).
---
> system : Sidings, if any, being locked as in (6).
15c15
< one arm on one side of a post, to be made to apply--the first, or
---
> one arm on one side of a post, to be made to apply — the first, or
41c41
< run over by trains of other companies using a different system
---
> run over by trains of other companies usiing a different system
50a51,58
>
>
> .â– J ML^
>
>
>
> *•**
>

Something important to note, is that there are no cases in this page where the Google Books or TIA versions found errors in the DP version. At least for this one page in this one book, DP provides the most accurate text.2

Google Books cares about the accuracy of the text only as much as it can index and bring up a book based on that book’s indexed contents. They don’t care if pages have ancillary characters or incorrect words as long as, taken as a whole, the book is indexable. I presume TIA cares more about having valid text, but they don’t appear to have the resources to improve them.

Errors like the ones above are fairly minor and mostly just annoying for the average reader. However, consider such errors in a scientific book or journal where the accuracy of the numbers is very important.

OCR-only text is just a bunch of data, but without accuracy it’s not really information. In fact, in some subtle cases it could be misinformation.

DP provides more accurate text, but it does so at the cost of speed. A book can take from days to years to go through the whole process at DP and be published at Project Gutenberg.

Improving DP’s relevance

Currently, every text that goes through DP ends up in Project Gutenberg as an eBook. The eBooks are far superior to the ones produced by automated systems and are a delight to read. There will always be a need for these.

However, there are small things we can do at DP to become more relevant in today’s digital ecosystem.

Closing the loop with image providers

It’s sad that while DP sources many of its images from Google Books or TIA, those providers continue to offer sub-par text and eBooks for download well after DP has uploaded finished eBooks to Project Gutenberg.

DP should close the loop with TIA and Google Books to provide them with updated eBooks and page texts. Projects at DP already identify where the images were sourced from, so it would be straightforward to send the providers the updated text in an automated way. I can see providers like Google Books being particularly interested in accurate page texts to refine their OCR algorithms and improve their search index. Both TIA and Google Books could use the accurate page texts to update the underlying text in their image PDFs, allowing accurate PDF searching and accessibility (eg: screen readers).

Partnering with the image providers in this manner is the right thing to do for the world at large and a potential source for more volunteers, developers, and perhaps even funding.

Not everything needs to become an eBook

Not everything needs to be a beautiful, hand-crafted eBook. Some printed materials, like journal articles and legal briefs, would benefit most from simply having accurate text — something DP excels at.

If DP were to expand its mission to encompass the accurate preservation of all public-domain printed materials, with the end product varying depending on the needs of the item, it could increase the rate at which accurate public domain texts are produced. Such materials would only go through the proofreading rounds, skipping the formatting rounds and post-processing step that are the biggest bottleneck. This could result in the accurate text being available within mere days.

With this in place, I can see DP partnering with folks like JSTOR or CourtListener to proofread their public domain materials. Such partnerships would be good publicity and a valuable source of new volunteers. Because this would still be limited to public domain material, Project Gutenberg could accept these not-eBooks as a final product if it chose.

Potentially more than relevant

By increasing DP’s scope, partnering with image providers, and leveraging its strengths, DP can remain more than relevant in the age of Google Books, but it’s going to take some realignment of mission and buy-in from the community to get there.

 

1 Google Books is digitizing much more than just public domain books, but let’s focus on the public domain books in this discussion, there are more than enough of them out there.

2 The astute among you will point out that all I’ve done is compare the resulting text from each source, not the text to the actual image, which is required to determine accuracy. You are correct, all I have shown is how precise the texts are to one another. I have great confidence in the accuracy of the DP version compared to the image, but I leave proof of that as an exercise for the reader. And if you enjoy that kind of work, have I got a great site for you to volunteer with!

Development leadership failure

Last night I did some dev work for DP. Mostly some code cleanup (heaven knows we need it) but also rolling out some committed code to production. I’ve made a concerted effort to get committed-but-not-released code deployed — some of which has been waiting for, literally, years.

Even worse, we have reams of code updates sitting uncommitted (and slowly suffering from bitrot) in volunteers’ sandboxes waiting for code review. In the case of Amy’s new quizzes, for almost 5(!!!!) years. In other cases volunteers have done a crazy amount of legwork to address architectural issues that remain unimplemented due to no solid commitment that if they did the work it would be reviewed, committed, and deployed — like Laurent’s site localization effort.

These are clear systematic failures by development leadership, ie: me. It’s obvious why even when the project attracts developers, we can’t retain them.

The first step is to get through the backlog of outstanding work. I have Laurent’s localization work almost finished. This will allow the site to be translated into other languages — I think Portuguese and French are already done. Next up is getting Amy’s new quizzes pushed out. She’s done a marvelous job of keeping her code up to date with HEAD based on my initial work last night. Now to get them committed and rolled out. Then a site-wide change on our include()s required to get full site localization implemented.

After all that, we need to address how to better keep code committed and rolled out. I think we as a team suffer from “don’t commit until it’s perfect, then wait until it’s simmered before rolling it out”. Where “simmered” means “sitting in CVS with no active testing done on it”. We need to move to a more flexible check-in criteria or a more liberal roll-out. There’s no good reason why the bar is so crazy high on both ends of that.

But first – the backlog.

PGDP.net – spellcheck facelift

After taking a look back at my journal entries I was surprised to see that I’ve never posted anything about PGDP or WordCheck, which really is amazing since it has consumed many hours of my life.

I guess I should start with some background. Last September I came across Distributed Proofreaders whose mission is to preserve books by providing texts to Project Gutenberg (PG). The site provides a mechanism for content providers to upload scanned and OCR’d copies of books, proofers to validate and correct the text one page at a time by comparing it to the scanned image, formatters (aka foofers) to format the text, and post-processors to ensure consistency and do a final edit before the text is uploaded to PG. Each book, or project, is guided through the process by a project manager who is responsible for helping with questions from proofers or foofers.

I began as a simple proofer, proofing a couple of pages a day during my lunch break. The more pages I proofed the more annoyed I became with the spellcheck component of the proofing interface. It didn’t offer the ability to see the page image in the default interface, the misspelled words were displayed as a drop-down box with suggestions instead of a text box to make corrections, and there was no way to add words to the spell checker’s dictionary so there were many false positives.

After searching some of the forums, my concerns were shared by others. A bit more digging revealed that the PGDP code is open source and thus available for improvement. It will surprise no one that I got annoyed enough that I started working to improve these issues collaborating with more senior project developers and active members of the community. The resulting “spell check” was sufficiently advanced that we renamed it WordCheck to convey that the tool does much more than simply check for misspelled words:

  • The image is always shown when making corrections to the page text.
  • Instead of using drop-down boxes to indicate misspelled words, edit boxes are instead.
  • If the project is listed has having multiple languages, the page text is checked using all possible language dictionaries instead of just the first one reducing the number of false positives.
  • Each project now has its own ‘dictionary’ in the form of a Good Words List that project mangers can use to add frequently-occuring words such as proper nouns to further reduce false positives.
  • Each project also has a Bad Words List that a project manager can use to add words that should be flagged even if they pass the dictionary. This is used to help proofers find and correct scannos (like typos but made by the OCR software) such as modem for modern or arid for and.
  • While proofing, proofers have the ability to suggest that a word can be added to the project’s Good Word List.

This initial version was released in mid-March.

After the initial release development continued to enhance the ability for project managers to manage project word lists and proofer suggestions. These enhancements were released at the end of May and improved the project manager’s interaction with WordCheck. Further tool development occurred and another minor release was made at the end of July.

As a proofer I can vouch that the changes made my life much easier. Several project managers have said that they are seeing better quality texts coming out of each round as well since WordCheck as been released. Overall I’ve really enjoyed working with the PGDP folks, both developers and proofers. I think my WordCheck active development is coming to an end unless defects are found. Instead I have my sights on the proofing interface itself and the spaghetti code that makes it run.