When Distributed Proofreaders started in 2000, Project Gutenberg only accepted eBooks in ASCII, later flexing to ISO-8859-1. pgdp.net has always only supported ISO-8859-1 (although practically this was really Windows-1252) which we refer to simply as Latin-1. This character set was enforced not only for the eBooks themselves, but also for the user interface. While the DP codebase has long supported arbitrary character sets, in many places it was assumed these were always using 1-byte-per-character encodings.
But the world is much bigger than Latin-1 and there are millions of books in languages that can’t be represented with 1-byte-per-character encodings. Enter Unicode and UTF-8, which Project Gutenberg started accepting a while back.
There has been talk of moving pgdp.net to Unicode and UTF-8 for many years but the effort involved is daunting. At a glance:
- Updating the DP code to support UTF-8. The code is in PHP which doesn’t support UTF-8 natively. Most PHP functions treat strings as an array of bytes regardless of the encoding.
- Converting our hundreds of in-progress books from ISO-8859-1 to UTF-8.
- Finding monospace proofreading fonts that support the wide range of Unicode glyphs we might need.
- Updating documentation and guidelines.
- Educating hundreds of volunteers on the changes.
In addition, moving to Unicode introduces the possibility that proofreaders will insert incorrect Unicode characters into the text. Imagine the case where a proofreader inserts a κ (kappa) instead of a k. Because visually they are very similar this error may end up in the final eBook. Unicode contains literally millions of characters that wouldn’t belong in our books.
There has been much hemming and hawing and discussions for probably at least a decade, but little to no progress on the development side.
Unicode, here we come
Back in February 2018 I started working on breaking down the problem, doing research on our unknowns, making lists of lists, and throwing some code against the wall and seeing what stuck.
This past November, almost 2 years later, we got to what I believe is functionally code-complete. Next June, we intend to roll out the update to convert the site over to Unicode. Between now and then we are working to finish testing the new code, address any shortcomings, and update documentation. The goal is to have the site successfully moved over to Unicode well before our 20th anniversary in October 2020.
Discoveries and Decisions
Some interesting things we’ve learned and/or decided as part of this effort.
Rather than open up the floodgates and allow proofreaders to input any Unicode character into the proofreading interface, we’re allowing project managers to restrict what Unicode characters they want their project to use. Both the UI and the server-side normalization enforce that only those characters are used. This allows the code to support the full set of Unicode characters but reduces the possibility of invalid characters in a given project.
For our initial roll-out, we will only be allowing projects to use the Latin-1-based glyphs in Unicode. So while everything will be in Unicode, proofreaders will only be able to insert glyphs they are familiar with. This will give us some time to ensure the code is working correctly and that our our documentation is updated before allowing other glyphsets to be used.
Unicode has some real oddities that we continue to stumble across. It took me weeks to wrap my head around how the Greek polytonic oxia forms get normalized down to the monotonic tonos forms and what that meant for our books which all predate the 1980s change from polytonic to monotonic. I also can’t believe I summed up weeks of discussions and research in that one sentences.
DP has our own proofreading font: DPCustomMono2. It is designed to help proofreaders distinguish between similar characters such as: l I 1. Not surprisingly, it only supports Latin-1 glyphs. We are still evaluating how to broaden it to support a wider set. That said, fonts render glyphs, not character sets, so the font will continue to work for projects that only use the Latin-1 character set.
We were able to find two other monospace fonts with very broad Unicode support: Noto Sans Mono and DejaVu Sans Mono. Moreover, both of these can be provided as a web font (see this blog post for the former, and FontSquirrel for the latter), ensuring that all of our proofreaders have access to a monospace Unicode-capable font. Note that a prior version of Noto Sans Mono, called Noto Mono, is deprecated and you should use the former instead.
Most browsers do sane glyph substitution. Say you have the following font-family style: ‘font-family: DPCustomMono2, Noto Sans Mono;’ and both fonts are available to your browser as web fonts. If you use that styling against text and there is a glyph to render that doesn’t exist in DPCustomMono2, the browser will render that one glyph in Noto Sans Mono and keep going rather than render a tofu. This is great news as it means we can provide one of our two sane wide-coverage Unicode monospace fonts as a fallback font-family ensuring that we will always have some monospace rendering of all characters on a page.
Modern versions of MySQL support two different UTF-8 encodings: utf8mb3 (aka utf8) & utf8mb4. The former only supports 3-byte UTF-8 characters whereas utf8mb4, introduced in MySQL v5.5.3, supports 4-byte UTF-8 characters. We opted for utf8mb4 to get the broadest language support and help future-proof the code (utf8mb3 is now deprecated).
An early fear was that we would need to increase the size of our varchar columns to handle the larger string widths needed with UTF-8. While this was true in MySQL 4.x, in 5.x MySQL varchar sizes represents the size of the string regardless of encoding.
PHP continues to be a pain regarding UTF-8, but it wasn’t as bad as we feared. It turns out that although most of the PHP string functions operate only on byte arrays, assuming 1-byte characters most of the places where we use them that’s just fine.
For other places, we found portable-utf8 to be a great starting point. Some of the functions aren’t particularly performant and it’s incredibly annoying that so many of them call utf8_clean() every time they are used, but it was super helpful in moving in the right direction.
mb_detect_encoding() is total crap, as I mentioned in one recent blog post, but we hacked around that.