Migrating Distributed Proofreaders to Unicode

When Distributed Proofreaders started in 2000, Project Gutenberg only accepted eBooks in ASCII, later flexing to ISO-8859-1. pgdp.net has always only supported ISO-8859-1 (although practically this was really Windows-1252) which we refer to simply as Latin-1. This character set was enforced not only for the eBooks themselves, but also for the user interface. While the DP codebase has long supported arbitrary character sets, in many places it was assumed these were always using 1-byte-per-character encodings.

But the world is much bigger than Latin-1 and there are millions of books in languages that can’t be represented with 1-byte-per-character encodings. Enter Unicode and UTF-8, which Project Gutenberg started accepting a while back.

There has been talk of moving pgdp.net to Unicode and UTF-8 for many years but the effort involved is daunting. At a glance:

  • Updating the DP code to support UTF-8. The code is in PHP which doesn’t support UTF-8 natively. Most PHP functions treat strings as an array of bytes regardless of the encoding.
  • Converting our hundreds of in-progress books from ISO-8859-1 to UTF-8.
  • Finding monospace proofreading fonts that support the wide range of Unicode glyphs we might need.
  • Updating documentation and guidelines.
  • Educating hundreds of volunteers on the changes.

In addition, moving to Unicode introduces the possibility that proofreaders will insert incorrect Unicode characters into the text. Imagine the case where a proofreader inserts a κ (kappa) instead of a k. Because visually they are very similar this error may end up in the final eBook. Unicode contains literally millions of characters that wouldn’t belong in our books.

There has been much hemming and hawing and discussions for probably at least a decade, but little to no progress on the development side.

Unicode, here we come

Back in February 2018 I started working on breaking down the problem, doing research on our unknowns, making lists of lists, and throwing some code against the wall and seeing what stuck.

This past November, almost 2 years later, we got to what I believe is functionally code-complete. Next June, we intend to roll out the update to convert the site over to Unicode. Between now and then we are working to finish testing the new code, address any shortcomings, and update documentation. The goal is to have the site successfully moved over to Unicode well before our 20th anniversary in October 2020.

Discoveries and Decisions

Some interesting things we’ve learned and/or decided as part of this effort.

Unicode

Rather than open up the floodgates and allow proofreaders to input any Unicode character into the proofreading interface, we’re allowing project managers to restrict what Unicode characters they want their project to use. Both the UI and the server-side normalization enforce that only those characters are used. This allows the code to support the full set of Unicode characters but reduces the possibility of invalid characters in a given project.

For our initial roll-out, we will only be allowing projects to use the Latin-1-based glyphs in Unicode. So while everything will be in Unicode, proofreaders will only be able to insert glyphs they are familiar with. This will give us some time to ensure the code is working correctly and that our our documentation is updated before allowing other glyphsets to be used.

When you start allowing people to input Unicode characters, you have to provide them an easy way to select characters not on their keyboard. Our proofreaders come to us on all kinds of different devices and keyboards. We put a lot of effort into easing how proofreaders find and use common accented characters in addition to the full array of supported characters for a given project. One of our developers created a new, extensible, character picker for our editing interface in addition to coding up javascript that converts our custom diacritical markup into the desired character.

Unicode has some real oddities that we continue to stumble across. It took me weeks to wrap my head around how the Greek polytonic oxia forms get normalized down to the monotonic tonos forms and what that meant for our books which all predate the 1980s change from polytonic to monotonic. I also can’t believe I summed up weeks of discussions and research in that one sentences.

Fonts

DP has our own proofreading font: DPCustomMono2. It is designed to help proofreaders distinguish between similar characters such as: l I 1. Not surprisingly, it only supports Latin-1 glyphs. We are still evaluating how to broaden it to support a wider set. That said, fonts render glyphs, not character sets, so the font will continue to work for projects that only use the Latin-1 character set.

We were able to find two other monospace fonts with very broad Unicode support: Noto Sans Mono and DejaVu Sans Mono. Moreover, both of these can be provided as a web font (see this blog post for the former, and FontSquirrel for the latter), ensuring that all of our proofreaders have access to a monospace Unicode-capable font. Note that a prior version of Noto Sans Mono, called Noto Mono, is deprecated and you should use the former instead.

Most browsers do sane glyph substitution. Say you have the following font-family style: ‘font-family: DPCustomMono2, Noto Sans Mono;’ and both fonts are available to your browser as web fonts. If you use that styling against text and there is a glyph to render that doesn’t exist in DPCustomMono2, the browser will render that one glyph in Noto Sans Mono and keep going rather than render a tofu. This is great news as it means we can provide one of our two sane wide-coverage Unicode monospace fonts as a fallback font-family ensuring that we will always have some monospace rendering of all characters on a page.

MySQL

Modern versions of MySQL support two different UTF-8 encodings: utf8mb3 (aka utf8) & utf8mb4. The former only supports 3-byte UTF-8 characters whereas utf8mb4, introduced in MySQL v5.5.3, supports 4-byte UTF-8 characters. We opted for utf8mb4 to get the broadest language support and help future-proof the code (utf8mb3 is now deprecated).

An early fear was that we would need to increase the size of our varchar columns to handle the larger string widths needed with UTF-8. While this was true in MySQL 4.x, in 5.x MySQL varchar sizes represents the size of the string regardless of encoding.

PHP

PHP continues to be a pain regarding UTF-8, but it wasn’t as bad as we feared. It turns out that although most of the PHP string functions operate only on byte arrays, assuming 1-byte characters most of the places where we use them that’s just fine.

For other places, we found portable-utf8 to be a great starting point. Some of the functions aren’t particularly performant and it’s incredibly annoying that so many of them call utf8_clean() every time they are used, but it was super helpful in moving in the right direction.

mb_detect_encoding() is total crap, as I mentioned in one recent blog post, but we hacked around that.

Operations that move InnoDB tables out of the system tablespace

Distributed Proofreaders has a very large InnoDB-backed table that was created many years ago on a MySQL version that only supported the system tablespace (ibdata1). We’ve since upgraded to 5.7 which supports file-per-table tablespaces and we have innodb_file_per_table=ON.

With file-per-table tablespaces enabled, some table operations will move tables from the system tablespace to their own per-file tablespace and given the size of the table in question it was important to understand which ones would cause this to happen.

My research lead me to the Online DDL Operations doc which was the key. Any operation that says it Rebuilds Table will move an InnoDB table out of the system tablespace, regardless if In Place says “Yes”.

For example, the following will keep the table where it is:

  • Creating, dropping, or renaming non-primary-key indexes
  • Renaming a column
  • Setting a column default value
  • Dropping a column default value

And these will rebuild the table and move it to its own tablespace:

  • Adding or dropping primary keys
  • Adding or dropping a column
  • Reordering columns
  • Changing column types
  • Converting a character set

The above are not an exhaustive list, see the Online DDL Operations documentation for your version of MySQL for the definitive list.

It’s important to know that OPTIMIZE TABLE will also move an InnoDB table out of the system tablespace.

If there’s an operation you want to perform that will rebuild the table but you want to keep the table in the system tablespace, you can temporarily set innodb_file_per_table=OFF (it’s a dynamic variable), do the operation, and then turn it back ON.

And for the curious, if you have a table already in its per-file tablespace and set innodb_file_per_table=OFF, making changes that will rebuild the table won’t move it to the system tablespace. It looks like you have to drop and recreate the table to do that.

DP gets a CSS makeover

Today we rolled out a sweeping code release at Distributed Proofreaders that standardizes our CSS and moves us to HTML5. Along the way we worked to have a consistent look-and-feel across the entire site.

The DP codebase has grown very organically over the years, starting out in 2000 when Cascading Style Sheets (CSS) were young and browser support for CSS was very poor. Since that time developers have added new code and styling for code in a variety of ways. CSS, and browser support for it, has come a long ways in 17 years and it was past time to get a common look-and-feel using modern CSS.

Some of our design goals:

  • Modern HTML & CSS
    We did not design for specific browsers, but rather designed for modern standards, specifically HTML5 and CSS3. HTML5 is the future and is largely backwards compatible with HTML4.x. Most of our pages should now validate cleanly against HTML5.
  • Pure-CSS for themes
    Moving to a pure-CSS system for themes, without theme-specific graphics, makes them immensely easier to create and update. Doing so means we don’t have to create or modify image files when working with themes.
  • Site-wide consistency
    The site has grown very organically over the past 17 years with each developer adding their own layout, table styles, etc. We made some subtle, and some not-so-subtle, changes to make pages across the site more consistent.
  • Consistent CSS
    Using consistent CSS across the site code allows developers to re-use components easily and makes it easier for users to adjust CSS browser-side for accessibility if necessary.
  • No (or little) per-page CSS
    Instead of embedding CSS styles directly in a page, we want to have the CSS in common files. This allows for better style re-use and gets us on the path to supporting Content Security Policies.

As part of this effort we created a Style Design Philosophy document to discuss what we were working towards as well as a Style Demo page.

Despite the removal of magic quotes and the mysqli changes being far more invasive, broad-reaching, and risky, the CSS work is the code deployment I’m most worried about. Not because I think we did anything wrong or I’m worried about how it will render in browsers1, but because users hate change, and this roll-out is full of change they can see. Some subtle, some not so subtle.

I expect to be fielding a wide range of “why did X change!?” and “I don’t like the way Y looks!” over the next few weeks. I can only hope these are intermixed with some appreciative comments as well to balance out the criticism.

1 IE6 being the known exception that we will just live with.

Enabling DP development with a developer VM

Getting started doing development on the DP code can be quite challenging. You can get a copy of the source code quite readily, but creating a system to test any changes gets complicated due to the code dependencies — primarily its tight integration with phpBB.

For a long time now, developers could request an account on our TEST server which has all the prerequisites installed, including a shared database with loaded data. There are a few downside with using the TEST server, however. The primary one being that everyone is using the shared database, significantly limiting the changes that could be made without impacting others. Another downside is that you need internet connectivity to do development work.

Having a way to do development locally on your desktop would be ideal. Installations on modern desktops are almost impossible, however, given our current dependency on magic quotes, a “feature” which has us locked on PHP 5.3, a very archaic version that no modern Linux desktop includes.

Environments like this are a perfect use case for virtual machines. While validating the installation instructions on the recent release I set out to create a DP development VM. This ensured that our instructions could be used to set up a fully-working installation of DP as well as produce a VM that others could use.

The DP development VM is a VMware VM running Ubuntu 12.04 LTS with a fully-working installation of DP. It comes pre-loaded with a variety of DP user accounts (proofer, project manager, admin) and even a sample project ready for proofing. The VM is running the R201601 release of DP source directly from the master git repo, so it’s easy to update to newer ‘production’ milestones when they come out. With the included instructions a developer can start doing DP development within minutes of downloading the VM.

I used VMware because it was convenient as I already had Fusion on my Mac and that VMware Player is freely available for Windows and Linux. A better approach would have been VirtualBox1 as it’s freely available for all platforms. Thankfully it should be fairly straightforward to create a VirtualBox VM from the VMware .vmdk (I leave this as an exercise to another developer).

After I had the VM set up and working I discovered vagrant while doing some hacking on OpenLibrary. If I had to create the VM again I would probably go the vagrant route. Although I expect it would take me a lot longer to set up it would significantly improve the development experience.

It’s too early to know if the availability of the development VM will increase the number of developers contributing to DP, but having yet another tool in the development tool-box can’t hurt.

1 Although I feel dirty using VirtualBox because it’s owned by Oracle. Granted, I feel dirty using MySQL for the same reason…

A new release of the DP site code, 9 years in the making

Today we released a new version of the Distributed Proofreaders code that runs pgdp.net! The announcement includes a list of what’s changed in the 9 years since the last release as well as a list of contributors, some statistics, and next steps. I’ve been working on getting a new release cut since mid-September so I’m pretty excited about it!

The prior release was in September 2006 and since that time there have been continuous, albeit irregular, updates to pgdp.net, but no package available for folks to download for new installations or to update their existing ones. Instead, enterprising individuals had to pull code from the ‘production’ tag in CVS (yes, seriously).

In the process of getting the code ready for release I noticed that there had been changes to the database on pgdp.net that hadn’t been reflected in the initial DB schema or the upgrade scripts in the code. So even if someone had downloaded the code from CVS they would have struggled to get it working.

As part of cutting the release I walked through the documentation that we provide, including the installation, upgrade, and configuration steps, and realized how much implied knowledge was in there. Much of the release process was me updating the documentation after learning what you were suppose to do.1 I ended up creating a full DP installation on a virtual machine to ensure the installation steps produced a working system. I’m not saying they’re now perfect, but they are certainly better than before.

Cutting a release is important for multiple reasons, including the ability for others to use code that is known to work. But the most important to me as a developer is the ability to reset dependency versions going forward. The current code, including that released today, continues to work on severely antiquated versions of PHP (4.x up through 5.3) and MySQL (4.x up to 5.1). This was a pseudo design decision in order to allow sites running on shared hosting with no control over their middleware to continue to function. Given how the hosting landscape has changed drastically over the past 9 years, and how really old those versions are, we decided it’s time to change that.

Going forward we’re resetting the requirements to be PHP 5.3 (but not later, due to our frustrating dependency on magic quotes) and MySQL 5.1 and later. This will allow us to use modern programming features like classes and exceptions that we couldn’t before.

Now that we have a release behind us, I’m excited to get more developers involved and start making some much-needed sweeping changes. Things like removing our dependency on magic quotes and creating a RESTful API to allow programmatic access to DP data. I’m hoping being on git and the availability of a development VM (more on that in a future blog post) will accelerate development.

If you’re looking for somewhere to volunteer as a developer for a literary2 great cause, come join us!

1 A serious hat-tip to all of my tech writer friends who do this on a daily basis!

2 See what I did there?

Development leadership failure

Last night I did some dev work for DP. Mostly some code cleanup (heaven knows we need it) but also rolling out some committed code to production. I’ve made a concerted effort to get committed-but-not-released code deployed — some of which has been waiting for, literally, years.

Even worse, we have reams of code updates sitting uncommitted (and slowly suffering from bitrot) in volunteers’ sandboxes waiting for code review. In the case of Amy’s new quizzes, for almost 5(!!!!) years. In other cases volunteers have done a crazy amount of legwork to address architectural issues that remain unimplemented due to no solid commitment that if they did the work it would be reviewed, committed, and deployed — like Laurent’s site localization effort.

These are clear systematic failures by development leadership, ie: me. It’s obvious why even when the project attracts developers, we can’t retain them.

The first step is to get through the backlog of outstanding work. I have Laurent’s localization work almost finished. This will allow the site to be translated into other languages — I think Portuguese and French are already done. Next up is getting Amy’s new quizzes pushed out. She’s done a marvelous job of keeping her code up to date with HEAD based on my initial work last night. Now to get them committed and rolled out. Then a site-wide change on our include()s required to get full site localization implemented.

After all that, we need to address how to better keep code committed and rolled out. I think we as a team suffer from “don’t commit until it’s perfect, then wait until it’s simmered before rolling it out”. Where “simmered” means “sitting in CVS with no active testing done on it”. We need to move to a more flexible check-in criteria or a more liberal roll-out. There’s no good reason why the bar is so crazy high on both ends of that.

But first – the backlog.

Mystery of the terrible throughput (or how I solved a TCP problem)

It all started out with a simple single stream reading test. Just a simple request for the entirety of an 8GB file. We do this stuff all the time. Except this time instead of 700 MB/s I was getting 130 MB/s. What?

Usually we test with jumbo frames (9000 MTU) but for this exercise we were using standard frames (1500 MTU). Still, there’s no way that was the difference. After 2 days I discover a method to consistently reproduce the problem: while the streaming test is running, toggle the LRO flag on the server’s network interface. This is just as crazy as making your car go faster by removing your soda from the cupholder. There’s no way that it has anything to do with it, but for some reason it does. Consistently. At last I have a reproducible, if ludicrous, defect.

Fast forward through 5 days of eliminating nodes, clients, switches, and NFS overcommits. Add in packet traces, kernel debugging output, and assorted analysis. Eventually Case catches the first real clue: the packet congestion window between the ‘fast’ and ‘slow’ states are distinctly different. In the ‘fast’ state, the congestion window stays fairly constant. In the ‘slow’ state, the window oscillates wildly – starting at the MTU growing really large, and starting over.

The LRO trick worked by causing enough retransmits that the stack dropped into slow start mode — one mystery solved. The reason we haven’t seen this before is that after a node-client pair get into the fast state, the slow start threshold is retained in the TCP hostcache between connections which is why we haven’t clearly identified this before — another mystery solved.

Fast forward through a few more days of slogging through TCP code down the path of blaming slow start threshold (or rather the lack of slow start in the slow state). By this time I’m way more familiar with the TCP code, and our kernel debugging framework, than I want to be. I notice that every time the congestion window drops back to the MTU it’s caused by an ENOBUFS error. It’s very unlikely we’re running out of buffer space though. Checking the called function reveals that the error would show up not only when we’re out of buffers, but also if we can’t return one immediately. We surmise the problem is some contention causing an inability to immediately get the requested buffer. So I change the code to reduce the congestion window by a single segment size (aka MTU) instead of dropping it all the way down to the segment size. The assumption being the next time we request a buffer of this size, we’re likely to get one.

And performance shoots up to 900 MB/s — even higher than the previous fast state.

The reason we’re unable to return the requested buffer immediately is unclear, and frankly above my paygrade. I’ll happily let the kernel devs work on that (it involves slabs and uma and things geekier than me).

The core of the problem remains “why aren’t we able to return the requested buffer immediately” but until the devs conquer that one we have a valid, shippable, workaround. And a lowly tester found, identified, and fixed it!

A geek and his keyboard

Simply accepting the death of one keyboard and the failure of its backup was simply not an option, so I started off this morning with my trusty screwdriver.

I opened up the bottom of the dead keyboard and studied its innards. From top to bottom the keyboard consists of:

  1. keys
  2. translucent rubber layer
  3. flexible transparent layer with printed circuit
  4. flexible transparent buffer layer with no circuit
  5. flexible transparent layer with printed circuit
  6. 3 large white plastic structural pieces
  7. 1 PCB

Given the simple structure it is apparent that the PCB is the failing component of the backup keyboard. The PCB design and rev number differ between the two keyboards, but I thought swapping them out would be worth a shot. Fortunately the physical structure of both keyboards is identical. Unfortunately swapping them didn’t work and examining the circuit layers (#3) it’s obvious why: they changed the circuit layout to the PCB.

I went with Plan B which was determining why those specific keys on the dead keyboard were dead. One look at layer #3 confirmed that all the dead keys are on the same circuit. Bringing out my trusty multimeter I discovered a break in the circuit to the PCB. But how to fix that? The transparent circuit layers are on a plastic layer so even if I had my soldering iron here in Denver, there was no way that was going to work. The dead gap wasn’t all that large, just a couple of millimeters, I just needed something to bridge it. A small piece of wire wasn’t optimal as it wouldn’t be flat and it would be hard to secure. Then the light bulb went off: aluminum foil. Conductive, easily trimmed down to the right size, and flat. Throw in a small piece of scotch tape and a few minutes later I have my first hardhack:

And thus far it works beautifully. As a bonus I moved layers #3-5 and #7 to the shell of the backup keyboard so I get the pearly white keys of the backup with the tried-and-true workings of the original.

I’m a bit concerned that the failure of that one circuit is simply a foreshadowing of things to come with different circuits. By the looks of the backup keyboard’s circuits it’s clear that the degradation isn’t from use but with age (which makes perfect sense anyway). We’ll see how long my hardhack works and if there are future failures elsewhere. Who knows, by the time I’m through maybe I’ll have a completely rebuilt keyboard full of aluminum foil.

The effective lifespan of a Microsoft Natural keyboard: ~13 years

I’m sad to report the demise of my Microsoft Natural keyboard (not the Elite, or the Pro or the MultiMedia – the original circa 1995). I turned the computer on today and the keys 67yhnujm no longer work. Given that I’ve had it for minimum of 13 years, it’s had a good run.

Never one to be left unprepared I went to the basement and brought up my spare. Yes, I have a spare Microsoft Natural keyboard for just this circumstance. I love the keyboard so much that when I heard they were no longer making it and replacing it instead with the much inferior Elite, I purchased a spare. It’s been in its box for a good 8 or more years. (Don’t ask about the lengths I’ve gone to keep an original Logitech TrackMan Marble working, it’ll just make me sound obsessive.) Anyone who spends as much time in front of a computer as I do will completely understand about the attachment to specific input devices. The rest of you will call us freaks.

I plugged in the spare keyboard, gently caressing the plam rest, and marveling at the perfectly white keys only to discover that there’s Something Wrong with it. Yes, my backup keyboard failed. Upon certain key combinations the keyboard starts sending escape sequences. Suck a duck.

So now I’m left typing on a crappy Dell keyboard and trying to figure out where to go from here. Looks like I need to crack open both keyboards and see if I can’t merge the two together to make one workable version.

And here I thought I was set for another 13 years…

Inkscape development dependencies on Fedora 12

Just FYI, if you’re wanting to compile Inkscape from source, you’ll need (at least) the following dependency RPMs on Fedora 12:

  • gc-devel
  • glib-devel
  • gtk+-devel
  • gsl-devel
  • libxml
  • libxml-devel
  • poppler-devel
  • poppler-glib-devel
  • libsigc++20-devel
  • glibmm24-devel
  • cairomm-devel
  • pangomm-devel
  • gtkmm24-devel
  • ghostscript
  • ghostscript-devel
  • jasper-devel
  • ImageMagick-devel
  • ImageMagick-c++-devel
  • libwpd-devel
  • libwpg-devel