Migrating Distributed Proofreaders to Unicode

When Distributed Proofreaders started in 2000, Project Gutenberg only accepted eBooks in ASCII, later flexing to ISO-8859-1. pgdp.net has always only supported ISO-8859-1 (although practically this was really Windows-1252) which we refer to simply as Latin-1. This character set was enforced not only for the eBooks themselves, but also for the user interface. While the DP codebase has long supported arbitrary character sets, in many places it was assumed these were always using 1-byte-per-character encodings.

But the world is much bigger than Latin-1 and there are millions of books in languages that can’t be represented with 1-byte-per-character encodings. Enter Unicode and UTF-8, which Project Gutenberg started accepting a while back.

There has been talk of moving pgdp.net to Unicode and UTF-8 for many years but the effort involved is daunting. At a glance:

  • Updating the DP code to support UTF-8. The code is in PHP which doesn’t support UTF-8 natively. Most PHP functions treat strings as an array of bytes regardless of the encoding.
  • Converting our hundreds of in-progress books from ISO-8859-1 to UTF-8.
  • Finding monospace proofreading fonts that support the wide range of Unicode glyphs we might need.
  • Updating documentation and guidelines.
  • Educating hundreds of volunteers on the changes.

In addition, moving to Unicode introduces the possibility that proofreaders will insert incorrect Unicode characters into the text. Imagine the case where a proofreader inserts a κ (kappa) instead of a k. Because visually they are very similar this error may end up in the final eBook. Unicode contains literally millions of characters that wouldn’t belong in our books.

There has been much hemming and hawing and discussions for probably at least a decade, but little to no progress on the development side.

Unicode, here we come

Back in February 2018 I started working on breaking down the problem, doing research on our unknowns, making lists of lists, and throwing some code against the wall and seeing what stuck.

This past November, almost 2 years later, we got to what I believe is functionally code-complete. Next June, we intend to roll out the update to convert the site over to Unicode. Between now and then we are working to finish testing the new code, address any shortcomings, and update documentation. The goal is to have the site successfully moved over to Unicode well before our 20th anniversary in October 2020.

Discoveries and Decisions

Some interesting things we’ve learned and/or decided as part of this effort.


Rather than open up the floodgates and allow proofreaders to input any Unicode character into the proofreading interface, we’re allowing project managers to restrict what Unicode characters they want their project to use. Both the UI and the server-side normalization enforce that only those characters are used. This allows the code to support the full set of Unicode characters but reduces the possibility of invalid characters in a given project.

For our initial roll-out, we will only be allowing projects to use the Latin-1-based glyphs in Unicode. So while everything will be in Unicode, proofreaders will only be able to insert glyphs they are familiar with. This will give us some time to ensure the code is working correctly and that our our documentation is updated before allowing other glyphsets to be used.

When you start allowing people to input Unicode characters, you have to provide them an easy way to select characters not on their keyboard. Our proofreaders come to us on all kinds of different devices and keyboards. We put a lot of effort into easing how proofreaders find and use common accented characters in addition to the full array of supported characters for a given project. One of our developers created a new, extensible, character picker for our editing interface in addition to coding up javascript that converts our custom diacritical markup into the desired character.

Unicode has some real oddities that we continue to stumble across. It took me weeks to wrap my head around how the Greek polytonic oxia forms get normalized down to the monotonic tonos forms and what that meant for our books which all predate the 1980s change from polytonic to monotonic. I also can’t believe I summed up weeks of discussions and research in that one sentences.


DP has our own proofreading font: DPCustomMono2. It is designed to help proofreaders distinguish between similar characters such as: l I 1. Not surprisingly, it only supports Latin-1 glyphs. We are still evaluating how to broaden it to support a wider set. That said, fonts render glyphs, not character sets, so the font will continue to work for projects that only use the Latin-1 character set.

We were able to find two other monospace fonts with very broad Unicode support: Noto Sans Mono and DejaVu Sans Mono. Moreover, both of these can be provided as a web font (see this blog post for the former, and FontSquirrel for the latter), ensuring that all of our proofreaders have access to a monospace Unicode-capable font. Note that a prior version of Noto Sans Mono, called Noto Mono, is deprecated and you should use the former instead.

Most browsers do sane glyph substitution. Say you have the following font-family style: ‘font-family: DPCustomMono2, Noto Sans Mono;’ and both fonts are available to your browser as web fonts. If you use that styling against text and there is a glyph to render that doesn’t exist in DPCustomMono2, the browser will render that one glyph in Noto Sans Mono and keep going rather than render a tofu. This is great news as it means we can provide one of our two sane wide-coverage Unicode monospace fonts as a fallback font-family ensuring that we will always have some monospace rendering of all characters on a page.


Modern versions of MySQL support two different UTF-8 encodings: utf8mb3 (aka utf8) & utf8mb4. The former only supports 3-byte UTF-8 characters whereas utf8mb4, introduced in MySQL v5.5.3, supports 4-byte UTF-8 characters. We opted for utf8mb4 to get the broadest language support and help future-proof the code (utf8mb3 is now deprecated).

An early fear was that we would need to increase the size of our varchar columns to handle the larger string widths needed with UTF-8. While this was true in MySQL 4.x, in 5.x MySQL varchar sizes represents the size of the string regardless of encoding.


PHP continues to be a pain regarding UTF-8, but it wasn’t as bad as we feared. It turns out that although most of the PHP string functions operate only on byte arrays, assuming 1-byte characters most of the places where we use them that’s just fine.

For other places, we found portable-utf8 to be a great starting point. Some of the functions aren’t particularly performant and it’s incredibly annoying that so many of them call utf8_clean() every time they are used, but it was super helpful in moving in the right direction.

mb_detect_encoding() is total crap, as I mentioned in one recent blog post, but we hacked around that.

Operations that move InnoDB tables out of the system tablespace

Distributed Proofreaders has a very large InnoDB-backed table that was created many years ago on a MySQL version that only supported the system tablespace (ibdata1). We’ve since upgraded to 5.7 which supports file-per-table tablespaces and we have innodb_file_per_table=ON.

With file-per-table tablespaces enabled, some table operations will move tables from the system tablespace to their own per-file tablespace and given the size of the table in question it was important to understand which ones would cause this to happen.

My research lead me to the Online DDL Operations doc which was the key. Any operation that says it Rebuilds Table will move an InnoDB table out of the system tablespace, regardless if In Place says “Yes”.

For example, the following will keep the table where it is:

  • Creating, dropping, or renaming non-primary-key indexes
  • Renaming a column
  • Setting a column default value
  • Dropping a column default value

And these will rebuild the table and move it to its own tablespace:

  • Adding or dropping primary keys
  • Adding or dropping a column
  • Reordering columns
  • Changing column types
  • Converting a character set

The above are not an exhaustive list, see the Online DDL Operations documentation for your version of MySQL for the definitive list.

It’s important to know that OPTIMIZE TABLE will also move an InnoDB table out of the system tablespace.

If there’s an operation you want to perform that will rebuild the table but you want to keep the table in the system tablespace, you can temporarily set innodb_file_per_table=OFF (it’s a dynamic variable), do the operation, and then turn it back ON.

And for the curious, if you have a table already in its per-file tablespace and set innodb_file_per_table=OFF, making changes that will rebuild the table won’t move it to the system tablespace. It looks like you have to drop and recreate the table to do that.

Casey’s 2019 Playlist

It’s December, which means it’s time to reveal this year’s playlist (aka: mix cd). Like every year, this playlist is a year in review.

Woman Up, Wannabe, and This Is Me is a shout-out to all the amazing women in my life. Thank you for being awesome just the way you are.

My friend Sam introduced me to Cowboy Bebop and the intro credits won me over. Sadly, the original Tank! by The Seatbelts isn’t available for digital purchase so I had to buy a CD and rip it. (The version on the Spotify list is a cover.)

At the beginning of the year I got to see Steve Grand in concert down in Puerto Vallarta, so he had to be on the list for sure. Pink and Norah Jones released new albums so they got a lot (and I mean a lot) of airtime. Yes, there are 4 Pink songs on the list this year — she’s just that amazing. And while Michael Buble’s new album was released in 2018, I didn’t discover it until the beginning of 2019.

From Pink’s 90 Days, which tears me up every time I hear it, I discovered Wrabel’s music. If you’re up for an emotional roller coaster about trans acceptance, watch his music video of The Village. Then feel a little better by telling all the conservatives assholes You Need To Calm Down. Finally, pay homage to the leader of the queer mafia himself: Elton John. His biopic Rocketman reminded me that yes, sometimes that’s really why they call it the blues.

And most importantly: I got married this year! Up’s Married Life seemed like the most perfect way to end the set.

  1. Tank! – The Seatbelts
  2. Woman Up – Meghan Trainor
  3. Wannabe – Spice Girls
  4. (Hey Why) Miss You Sometimes – Pink
  5. Hustle – Pink
  6. This Is Me – Keala Settle
  7. Wintertime – Norah Jones
  8. Help Me Make It Through The Night – Michael Buble
  9. You Need To Calm Down – Taylor Swift
  10. Walk Me Home – Pink
  11. 90 Days – Pink & Wrabel
  12. Stay – Steve Grand
  13. Walking – Steve Grand
  14. 11 Blocks – Wrabel
  15. The Village – Wrabel
  16. Easy Like Sunday Morning – Lionel Richie
  17. That’s Why They Call It The Blues – Elton John
  18. Your Song (instrumental) – Moulin Rouge Soundtrack, disc 2
  19. Married Life – Up! Soundtrack

You can listen to the songs on Spotify too (except for track 1 where you get a cover of Tank! and track 16 because Lionel Richie isn’t on Spotify). As always, the order of the songs have been carefully curated. You may not be able to listen to them in order with the Spotify free account.

Detecting Windows-1252 encoding

For DP’s move to Unicode we need to handle accepting files from content providers that are not in UTF-8. Usually these files come in as Windows-1252, but sometimes they might be ISO-8859-1, UTF-16, or even in UTF-32. We need to get the detection correct to ensure a valid conversion to UTF-8.

For reasons beyond my ken, PHP’s mb_detect_encoding() function appears to be completely unable to detect the difference between Windows-1252 and ISO-8859-1 for strings that clearly have characters in the 0x80 to 0x9F ranges. Shockingly, it also wasn’t able to detect files encoded as UTF-16 with BOMs which I absolutely don’t understand. And it appears I’m not the only person having problems with it.

So we rolled our own, which I feel is almost as blasphemous as writing our own date handling library, but here we are. In case others out there are looking for something similar, here you go. Keep in mind that our objective is to determine an encoding from an expected set and ultimately convert the string to UTF-8.

This detection doesn’t have to be perfect. If the file isn’t in UTF-8 we warn the project manager about the detected encoding before they load the files, so if we guess the encoding wrong there’s a human to double-check it before proceeding.

# Attempt to detect a string's encoding from a subset of expected encodings:
# * UTF-8 (includes pure-ASCII which is a valid subset)
# * UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE based on the BOM
# * Windows-1252
# * ISO-8859-1
# These strings match what mb_detect_encoding() would return. The function
# returns False if it's unable to guess, although it will readily return
# ISO-8859-1 in many circumstances. function guess_string_encoding($text) { if(preg_match('//u', $text)) return 'UTF-8'; # evaluate the BOM, if one exists, borrowed from # https://stackoverflow.com/questions/49546403/php-checking-if-string-is-utf-8-or-utf-16le $first2 = substr($text, 0, 2); $first4 = substr($text, 0, 4); if ($first4 == "\x00\x00\xFE\xFF") return 'UTF-32BE'; elseif ($first4 == "\xFF\xFE\x00\x00") return 'UTF-32LE'; elseif ($first2 == "\xFE\xFF") return 'UTF-16BE'; elseif ($first2 == "\xFF\xFE") return 'UTF-16LE'; # if there are any characters in ranges that are either control characters # or invalid for ISO-8859-1 or CP-1252, return False if(preg_match('/[\x00-\x08\x0E-\x1F\x81\x8D\x8F\x90\x9D]/', $text, $matches)) return False; # if we get here, we're going to assume it's either Windows-1252 or ISO-8859-1. # if the string contains characters in the ISO-8859-1 reserved range, # that's probably Windows-1252 if(preg_match('/[\x80-\x9F]/', $text)) return 'Windows-1252'; # Give up and return ISO-8859-1 return 'ISO-8859-1'; }

Like all dproofreaders code, the above is in the GPL v2.0.

Noto Sans Mono webfont

Over at Distributed Proofreaders we’re busy working to convert the site code over to Unicode from Latin-1. Part of that work is finding a monospace proofreading font that covers set of Unicode codepoints we need and is, or can be made available as, a webfont.

The Google Noto font family is an obvious candidate as it contains two monospace fonts: Noto Mono and Noto Sans Mono. Frustratingly, while Noto Sans is available directly from the Google Font CDN, neither of the two Mono fonts are included.

Noto Mono is available for download as a webfont from FontSquirrel. Note that by default FontSquirrel will provide you with the Western Latin subset of the font, so if you want the full thing disable subsetting. Note that Noto Mono is deprecated in preference to Noto Sans Mono.

Noto Sans Mono is not available from FontSquirrel, although their Noto Sans page says they can’t provide the font due to licensing restrictions. Noto Sans is licensed under the SIL Open Font License, which FontSquirrel reads as preventing them from providing a webfont version of it. Noto Sans Mono is under the same license.

If your read of the SIL Open Font License is different, or if you are willing to change the name of the font and use it, you can use their Webfont Generator to upload the desired TTF variants you’ve downloaded from the Google Noto Sans Mono page.

It’s worth talking about font variants for a moment. Noto Sans Mono comes with 36 variants, such as Regular, Bold, ExtraBold, Light, ExtraLight, Condensed — you get the idea. Each one of these contains glyphs to render a wide range of Unicode codepoints in the desired form. You are probably most interested in the Regular and Bold forms.

Let’s assume you upload the Regular and Bold TTF files to FontSquirrel’s Webfont Generator. You will probably want to enable Expert mode and de-select many of the things that FontSquirrel will do to the font, like truetype hinting, fixing missing glyphs, fixing vertical metrics, etc. Google has done a great job on these fonts and you shouldn’t need any of that. If you want to retain the font’s full set of glyphs you will want to disable subsetting too.

The download will include two WOFF, two WOFF2 files, and a stylesheet to use them. FontSquirrel doesn’t understand that we’ve uploaded two variants of the same font, so we need to fix the stylesheet so that the bold and regular versions work like we want them to. We need to use style linking to tell the web browser that they are the same font, just different styles. Note how the CSS block below uses the same font names for both the Regular and the Bold, but the bold version has a ‘font-weight: bold’ tag:

@font-face {
font-family: 'Not Noto Sans Mono';
src: url('notnotosansmono-regular.woff2') format('woff2'),
url('notnotosansmono-regular.woff') format('woff');
font-weight: normal;
font-style: normal;

@font-face {
font-family: 'Not Noto Sans Mono';
src: url('notnotosansmono-bold.woff2') format('woff2'),
url('notnotosansmono-bold.woff') format('woff');
font-weight: bold;
font-style: normal;

This allows us to use the ‘Not Noto Sans Mono’ font in regular and bold versions just like we would expect:

<p style='font-family: Not Noto Sans Mono;'>Regular mono</p>
<p style='font-family: Not Noto Sans Mono; font-weight: bold;'>Bold mono</p>

It would be far simpler if Google just provided Noto Sans Mono on their font CDN. I’ve sent them an email to that effect, but who knows how that will go.

If you’re wary of converting Noto Sans Mono yourself, a reasonable alternative is DejaVu Sans Mono which also has a very wide set of the Unicode codepoints — it’s available for download as a webfont from FontSquirrel

Leverage your Library

If you live in Seattle or King County you have a rich set of library resources available to you! Here are some tips on how to get the most out of the Seattle Public Library (SPL) and the King County Library System (KCLS).

Use both libraries

While Seattle is within King County, they have separate library systems. In 2015 SPL and KCLS created a new, and frankly better, reciprocal use agreement. This means that if you live in Seattle you can use the King County library and vice versa. In addition to opening up a broader set of library resources, this covers checking out books in one library system and returning them at another one.

Consider bookmarking the catalog for your primary library on your mobile device: SPL | KCLS.

Both systems use BiblioCommons for their online catalog system and you can add both cards to your account. Then when you’re browsing one library catalog for a book you can use the upper right menu to swap to the other one. This makes it easy to see if the other library has a resource in a different format or with a shorter hold.

But where this really comes in handy is checking out eBooks. Both SPL and KCLS have a great selection of eBooks and if you are a member of both libraries you have an incredible selection available to you. Often I find an eBook available at one library system where the other one has a wait.

You’ll need to go in-person to a SPL or KCLS branch to get a library card. All you need at either is a photo ID that shows your birthday and a piece of mail with your address on it.

Kindles & Amazon household

If you live in a household with more than one reader with a Kindle, Amazon makes it easy to share library eBooks with each other via Households. When a Kindle-format eBook is checked out to one household user, any member of the household can have it delivered to their Kindle too via Manage Your Content and Devices.

It’s worth pointing out that eBooks from the library don’t expire if your device is in airplane mode! So if you check out a Kindle eBook from the library and put your device in airplane mode you have as long as you want to read the book.

More than books

Both SPL and KCLS have much, much more to offer than just access to books (although I’ll be honest, I use them mostly for the books). You can listen to music (including streaming!), borrow movies (including streaming!), get free passes to local museums, reserve a meeting room, and so much more.

Happy reading!

Live Oak, with Moss – VI

What think you I have taken my pen to record?

Not the battle-ship, perfect-model’d, majestic, that I saw to day arrive in the offing, under full sail,

Nor the splendors of the past day–nor the splendors of the night that envelopes me–Nor the glory and growth of the great city spread around me,

But the two men I saw to-day on the pier, parting the parting of dear friends.

The one to remain hung on the other’s neck and passionately kissed him–white the one to depart tightly prest the one to remain in his arms.

Live Oak, with Moss; VI. by Walt Whitman

Leading with Empathy

About a year ago I noticed that my best friend was replying to my texts messages a bit differently. It was subtle, but powerful. She was leading with empathy.

My BFF Jonobie and I text each other all kinds of things all the time. Something that made us think of the other person, a funny pic that we can’t stop laughing about, venting about a crappy day, sharing some exciting news, etc. In the latter two, the thing she and I need most often is someone to hear, acknowledge, and echo what we’re feeling. Not to attempt to solve the problem, or to offer advice, but to empathize with us.

In general I think we’ve always done a decent job of empathizing with one another, but it was often implicit rather than explicit. About a year ago I noticed that many times her response was more direct. For instance:

Me: Holy cow, I was just given an important high-profile project and now have an important deadline due in less than 6 weeks.

Her: That sounds both exciting and stressful!

Or also:

Me: Chest X-Rays are back and I do not have pneumonia! Woohoo!

Her: Yayyay!!! I’m super glad that you don’t have pneumonia!!!

In both cases she leads the response with empathy by expressing that she understand how I feel and shares with me in that feeling. Contrast that with other perfectly reasonable responses:

Me: Holy cow, I was just given an important high-profile project and now have an important deadline due in less than 6 weeks.

Them: Boo work!


Me: Chest X-Rays are back and I do not have pneumonia! Woohoo!

Them: <thumbsup>

There’s nothing inherently wrong with these, but they are missing that level of empathy that conveys the sender is there, present, and sharing in your feelings with you — all things that the first set (the ones she actually sent) provided.

At some point I noticed what she was doing, how awesome it was, and worked to integrate that into my texts with other people too. I want to be present for my friends, to convey to them that they are important to me, that I am here to hold space for them.

But what if you don’t know how the other person is feeling after receiving a wall of text? How are we to lead with empathy? She’s modeled that for me too by simply asking, eg:

Me: Holy cow, I was just given an important high-profile project and now have an important deadline due in less than 6 weeks.

Her: Oh my! How are you feeling about that?

or even:

Me: Holy cow, I was just given an important high-profile project and now have an important deadline due in less than 6 weeks.

Her: Woah! That sounds as if it could be either exciting or frustrating!

Me: Actually, I’m excited but stressed out.

For me, texting with empathy was a gateway to me being more empathetic in my in-person interactions too. A few weeks ago Daniel commented that I’ve been more empathetic towards him in our conversations and he’s really appreciated it.

And hearing that I’m better connecting with the person that I love makes me incredibly happy.

A Grand Time

On Tuesday night we saw Steve Grand perform at ACT II down here in Puerto Vallarta. And I can’t tell you how moving it was to hear a man sing about love for another man on stage.

Steve Grand is probably best known for his 2013 breakout country music video All-American Boy about one guy’s unrequited love for another man. As someone who grew up listening to country music, I remember being in awe that we finally, finally, had a country love song by a gay man. Then, admittedly, I lost track of Steve and what he was up to until this week.

On Saturday evening we bought tickets for his Tuesday show, where he performs covers as well as his own work, almost on a lark. Of the four of us going to the show, two of us remembered his music video and the other two were game to be dragged along. It’s very common in PV for performers to walk around the beaches handing out cards marketing their shows, and Steve was no exception. Except Steve one-upped all the rest of them by walking around in a bright blue speedo and a blockbuster smile. And to say that he is incredibly handsome would be like saying that the ocean is merely damp. We were all excited to see him perform after that!

The show was good — really good. He’s an excellent performer and has a great stage presence. What surprised me the most was that, oddly, I knew several of his own songs without knowing they were his! It didn’t take me long to figure out that his song Stay was on Sean’s 2013 mix CD and his song Walking was on Sean’s 2017 mix CD (and somehow I knew his song We Are The Night too, although I haven’t figured out how). These CDs live in my ancient car that only has a CD player and the CDs get a lot of air time so I’ve listened to them, and these two songs, many many times. Enough that I could have sung along.

I never knew they were about another guy and knowing that changed everything about them.

I’ve been raised in such a heteronormative culture that when I hear a guy singing about someone else I assume it’s about a girl. Because it’s almost always true. And to have that preconception totally dismantled about songs that I love by this handsome guy on stage who wrote them was mind-blowing.

Stay with me, we don’t never have to leave
You my southern king, we live it for the daydreams
So don’t get mad—what’s past is in the past
And we can make this last
if you just give me that chance
So when my old man’s out of town but a couple days
I think that you should…

Stay with me
all summer
Stay with me
under the covers
Stay with me
Be my lover

He’s singing about another guy, folks!

We talk about the importance of representation all the time, and I guess I never thought about how that might apply to me in music. I’m ecstatic to have relearned the lesson in such a fantastic way.

I’m certain that my newly-purchased Steve Grand albums will get a lot of air time when I get back from vacation next week. If you identify as a gay man and have the opportunity to hear him in person, I highly recommend it!

We did good & fought bad

As promised in December 2017, last year Daniel and I upped our game to not just do good by donating to charities, but also to fight bad by giving money to local and national political campaigns.

Doing Good

This year we donated over $10k to the same organizations we supported in 2016 and 2017. We think these organizations, both local and national, are doing great work for youth, LGBTQ folks, women, POC, and the environment. Particularly in the case of the ACLU and Lambda Legal, we are proud to be a part of efforts to fight the GOP abomination, I mean administration.

Fighting Bad

In addition to donating to charitable organizations, we got political this year. We promised ourselves that we would do so only if we could continue supporting the charities we cared about first and I’m happy to say we were able to do that.

In total we gave over $9k to Democratic candidates running in the midterms. Not all of the candidates won, but it was worth every penny. I view it like an investment – some pay off in the short term (winning in the midterms) and some pay of in the long term (like Beto riling up Democratic voters for other races in Texas despite him not winning his race). And you can bet that we are in this for the long term.

Fighting Together

I’m more proud of how we were able to raise an additional $6k by engaging other people to donate. Together, we helped take back the House from the Republicans.

In 2020 we will be taking the gloves off again. We’ll continue to encourage voter registration & voter turn-out and put our money where our mouth is to fight for our country.