MediaWiki VisualEditor: slashes and namespaces

Recently I upgraded the wiki at Distributed Proofreaders to the MediaWiki 1.35 LTS release. This comes with the fancy new VisualEditor which should be great for our users who already deal with phpBB markup in the forums and our own custom markup for project formatting.

Setting up the VisualEditor was a bit of a head-scratcher in a couple of ways and hopefully this helps others who encounter similar problems.

Error contacting the Parsoid/RESTBase server (HTTP 404)

This one frustrated me for quite some time. Everything pointed to setting

AllowEncodedSlashes NoDecode

in our Apache config, but that wasn’t working for me. For reasons I don’t understand, I needed to include this in both the :80 and :443 VirtualHost sections, not just the :443 which was serving all traffic.

Error contacting the Parsoid/RESTBase server (HTTP 500)

This was thankfully pretty obvious by looking in the php_errors log. As the VisualEditor Troubleshooting section calls out, our $wgTmpDirectory had the wrong write permissions.

Enabling for Namespaces

The documentation says that to change the namespaces that the VisualEditor will be used on, you use the English canonical names. To get this to work, we needed to use the namespace constants instead. Note that the MW code will include all content namespaces as enabled by default so you only need to include those if you want to disable them.

$wgVisualEditorAvailableNamespaces = [
// includes Content namespaces by default (Main)
NS_PROJECT => true,
NS_PROJECT_TALK => true,
NS_TALK => true,
NS_USER_TALK => true,
];

Stepping back from Distributed Proofreaders

After almost 14.5 years it’s time for me to step back from volunteering with Distributed Proofreaders. What was once an enjoyable activity has become a stressor that I simply don’t need 11 months into a pandemic.

In many ways DP has been a lifeline to me at various times in my life, giving me something constructive and meaningful I can do. This was true as I was going through my divorce a decade ago, during my sabbatical, and at the beginning of the pandemic. But the bitching and criticism that comes from virtually any change we make to the site recently has become unbearable. Complaints about changes aren’t new — humans are classically change-averse and our community seems to be doubly-so — but during the pandemic they’ve seemed to have increased in both frequency and volume.

Receiving verbal or written recognition of my work is important to me. Indeed, it’s the best, and easiest, way to keep me happy. While I have often received that type of feedback from Linda, the General Manager, and Sharon, a fellow admin and developer, I don’t usually get that from the rest of the community. Instead, I most often get the opposite. That’s very demoralizing after hours and hours of time spent.

Development Contributions

I’ve been a developer at DP for over a decade and the lead developer for the past 5+ years. Looking back I have to say we’ve collectively come a long way. I sat down and made a list of the most notable and memorable software changes that I was involved in and while there were some new features, almost all of the big changes were ensuring that the software could run on modern middleware.

My most enduring legacy at DP is likely to be that the site continues to function at all and that makes me incredibly happy.

New Features & Capabilities

Site Modernization

Middleware Support

Development Improvements

What’s Next

I’m not sure what stepping back means exactly or what’s next for me, but it’s time for a change. I’ve committed to finishing some of the planned maintenance work (assisting with the phpBB forum upgrade and eventual OS upgrade) and updating documentation. Beyond that, I’m not sure, but decidedly less of the forums and less dev work which results in all the despised changes.

I hope to find some other open source software I can contribute to. I thought perhaps I would work with other DP-adjacent open source projects like getting the Auth_phpBB MediaWiki extension updated to support the latest MediaWiki LTS, except that only took me about 12 hours.

Migrating Distributed Proofreaders to Unicode

When Distributed Proofreaders started in 2000, Project Gutenberg only accepted eBooks in ASCII, later flexing to ISO-8859-1. pgdp.net has always only supported ISO-8859-1 (although practically this was really Windows-1252) which we refer to simply as Latin-1. This character set was enforced not only for the eBooks themselves, but also for the user interface. While the DP codebase has long supported arbitrary character sets, in many places it was assumed these were always using 1-byte-per-character encodings.

But the world is much bigger than Latin-1 and there are millions of books in languages that can’t be represented with 1-byte-per-character encodings. Enter Unicode and UTF-8, which Project Gutenberg started accepting a while back.

There has been talk of moving pgdp.net to Unicode and UTF-8 for many years but the effort involved is daunting. At a glance:

  • Updating the DP code to support UTF-8. The code is in PHP which doesn’t support UTF-8 natively. Most PHP functions treat strings as an array of bytes regardless of the encoding.
  • Converting our hundreds of in-progress books from ISO-8859-1 to UTF-8.
  • Finding monospace proofreading fonts that support the wide range of Unicode glyphs we might need.
  • Updating documentation and guidelines.
  • Educating hundreds of volunteers on the changes.

In addition, moving to Unicode introduces the possibility that proofreaders will insert incorrect Unicode characters into the text. Imagine the case where a proofreader inserts a κ (kappa) instead of a k. Because visually they are very similar this error may end up in the final eBook. Unicode contains literally millions of characters that wouldn’t belong in our books.

There has been much hemming and hawing and discussions for probably at least a decade, but little to no progress on the development side.

Unicode, here we come

Back in February 2018 I started working on breaking down the problem, doing research on our unknowns, making lists of lists, and throwing some code against the wall and seeing what stuck.

This past November, almost 2 years later, we got to what I believe is functionally code-complete. Next June, we intend to roll out the update to convert the site over to Unicode. Between now and then we are working to finish testing the new code, address any shortcomings, and update documentation. The goal is to have the site successfully moved over to Unicode well before our 20th anniversary in October 2020.

Discoveries and Decisions

Some interesting things we’ve learned and/or decided as part of this effort.

Unicode

Rather than open up the floodgates and allow proofreaders to input any Unicode character into the proofreading interface, we’re allowing project managers to restrict what Unicode characters they want their project to use. Both the UI and the server-side normalization enforce that only those characters are used. This allows the code to support the full set of Unicode characters but reduces the possibility of invalid characters in a given project.

For our initial roll-out, we will only be allowing projects to use the Latin-1-based glyphs in Unicode. So while everything will be in Unicode, proofreaders will only be able to insert glyphs they are familiar with. This will give us some time to ensure the code is working correctly and that our our documentation is updated before allowing other glyphsets to be used.

When you start allowing people to input Unicode characters, you have to provide them an easy way to select characters not on their keyboard. Our proofreaders come to us on all kinds of different devices and keyboards. We put a lot of effort into easing how proofreaders find and use common accented characters in addition to the full array of supported characters for a given project. One of our developers created a new, extensible, character picker for our editing interface in addition to coding up javascript that converts our custom diacritical markup into the desired character.

Unicode has some real oddities that we continue to stumble across. It took me weeks to wrap my head around how the Greek polytonic oxia forms get normalized down to the monotonic tonos forms and what that meant for our books which all predate the 1980s change from polytonic to monotonic. I also can’t believe I summed up weeks of discussions and research in that one sentences.

Fonts

DP has our own proofreading font: DPCustomMono2. It is designed to help proofreaders distinguish between similar characters such as: l I 1. Not surprisingly, it only supports Latin-1 glyphs. We are still evaluating how to broaden it to support a wider set. That said, fonts render glyphs, not character sets, so the font will continue to work for projects that only use the Latin-1 character set.

We were able to find two other monospace fonts with very broad Unicode support: Noto Sans Mono and DejaVu Sans Mono. Moreover, both of these can be provided as a web font (see this blog post for the former, and FontSquirrel for the latter), ensuring that all of our proofreaders have access to a monospace Unicode-capable font. Note that a prior version of Noto Sans Mono, called Noto Mono, is deprecated and you should use the former instead.

Most browsers do sane glyph substitution. Say you have the following font-family style: ‘font-family: DPCustomMono2, Noto Sans Mono;’ and both fonts are available to your browser as web fonts. If you use that styling against text and there is a glyph to render that doesn’t exist in DPCustomMono2, the browser will render that one glyph in Noto Sans Mono and keep going rather than render a tofu. This is great news as it means we can provide one of our two sane wide-coverage Unicode monospace fonts as a fallback font-family ensuring that we will always have some monospace rendering of all characters on a page.

MySQL

Modern versions of MySQL support two different UTF-8 encodings: utf8mb3 (aka utf8) & utf8mb4. The former only supports 3-byte UTF-8 characters whereas utf8mb4, introduced in MySQL v5.5.3, supports 4-byte UTF-8 characters. We opted for utf8mb4 to get the broadest language support and help future-proof the code (utf8mb3 is now deprecated).

An early fear was that we would need to increase the size of our varchar columns to handle the larger string widths needed with UTF-8. While this was true in MySQL 4.x, in 5.x MySQL varchar sizes represents the size of the string regardless of encoding.

PHP

PHP continues to be a pain regarding UTF-8, but it wasn’t as bad as we feared. It turns out that although most of the PHP string functions operate only on byte arrays, assuming 1-byte characters most of the places where we use them that’s just fine.

For other places, we found portable-utf8 to be a great starting point. Some of the functions aren’t particularly performant and it’s incredibly annoying that so many of them call utf8_clean() every time they are used, but it was super helpful in moving in the right direction.

mb_detect_encoding() is total crap, as I mentioned in one recent blog post, but we hacked around that.

Operations that move InnoDB tables out of the system tablespace

Distributed Proofreaders has a very large InnoDB-backed table that was created many years ago on a MySQL version that only supported the system tablespace (ibdata1). We’ve since upgraded to 5.7 which supports file-per-table tablespaces and we have innodb_file_per_table=ON.

With file-per-table tablespaces enabled, some table operations will move tables from the system tablespace to their own per-file tablespace and given the size of the table in question it was important to understand which ones would cause this to happen.

My research lead me to the Online DDL Operations doc which was the key. Any operation that says it Rebuilds Table will move an InnoDB table out of the system tablespace, regardless if In Place says “Yes”.

For example, the following will keep the table where it is:

  • Creating, dropping, or renaming non-primary-key indexes
  • Renaming a column
  • Setting a column default value
  • Dropping a column default value

And these will rebuild the table and move it to its own tablespace:

  • Adding or dropping primary keys
  • Adding or dropping a column
  • Reordering columns
  • Changing column types
  • Converting a character set

The above are not an exhaustive list, see the Online DDL Operations documentation for your version of MySQL for the definitive list.

It’s important to know that OPTIMIZE TABLE will also move an InnoDB table out of the system tablespace.

If there’s an operation you want to perform that will rebuild the table but you want to keep the table in the system tablespace, you can temporarily set innodb_file_per_table=OFF (it’s a dynamic variable), do the operation, and then turn it back ON.

And for the curious, if you have a table already in its per-file tablespace and set innodb_file_per_table=OFF, making changes that will rebuild the table won’t move it to the system tablespace. It looks like you have to drop and recreate the table to do that.

Detecting Windows-1252 encoding

For Distributed Proofreader‘s move to Unicode we need to handle accepting files from content providers that are not in UTF-8. Usually these files come in as Windows-1252, but sometimes they might be ISO-8859-1, UTF-16, or even in UTF-32. We need to get the detection correct to ensure a valid conversion to UTF-8.

For reasons beyond my ken, PHP’s mb_detect_encoding() function appears to be completely unable to detect the difference between Windows-1252 and ISO-8859-1 for strings that clearly have characters in the 0x80 to 0x9F ranges. Shockingly, it also wasn’t able to detect files encoded as UTF-16 with BOMs which I absolutely don’t understand. And it appears I’m not the only person having problems with it.

So we rolled our own, which I feel is almost as blasphemous as writing our own date handling library, but here we are. In case others out there are looking for something similar, here you go. Keep in mind that our objective is to determine an encoding from an expected, known set and ultimately convert the string to UTF-8.

This detection doesn’t have to be perfect. If the file isn’t in UTF-8 we warn the project manager about the detected encoding before they load the files, so if we guess the encoding wrong there’s a human to double-check it before proceeding.

# Attempt to detect a string's encoding from the following subset
# of expected encodings:
# * UTF-8 (includes pure-ASCII which is a valid subset)
# * UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE based on the BOM
# * Windows-1252
# * ISO-8859-1
# These strings match what mb_detect_encoding() would return.
# The function returns false if it's unable to guess.
# It will readily return ISO-8859-1 in many circumstances. function guess_string_encoding($text) {
# See if preg_match() sees it as valid UTF-8 already if (preg_match('//u', $text)) { return 'UTF-8';
} # evaluate the BOM, if one exists, borrowed from # https://stackoverflow.com/questions/49546403/php-checking-if-string-is-utf-8-or-utf-16le $first2 = substr($text, 0, 2); $first4 = substr($text, 0, 4); if ($first4 == "\x00\x00\xFE\xFF") return 'UTF-32BE'; elseif ($first4 == "\xFF\xFE\x00\x00") return 'UTF-32LE'; elseif ($first2 == "\xFE\xFF") return 'UTF-16BE'; elseif ($first2 == "\xFF\xFE") return 'UTF-16LE'; # If the string contains characters in ranges that are either
# control characters or invalid for ISO-8859-1 or CP-1252
# return false indicating we are unable to reliably guess if (preg_match('/[\x00-\x08\x0E-\x1F\x81\x8D\x8F\x90\x9D]/', $text, $matches)) { return false;
} # If we get here, we're going to assume it's either Windows-1252
# or ISO-8859-1. If the string contains characters in the
# ISO-8859-1 reserved range, that's probably Windows-1252. if (preg_match('/[\x80-\x9F]/', $text)) { return 'Windows-1252'; }
# Give up and return ISO-8859-1 return 'ISO-8859-1'; }

Like all dproofreaders code, the above is in the GPL v2.0.

Noto Sans Mono webfont

Over at Distributed Proofreaders we’re busy working to convert the site code over to Unicode from Latin-1. Part of that work is finding a monospace proofreading font that covers set of Unicode codepoints we need and is, or can be made available as, a webfont.

The Google Noto font family is an obvious candidate as it contains two monospace fonts: Noto Mono and Noto Sans Mono. Frustratingly, while Noto Sans is available directly from the Google Font CDN, neither of the two Mono fonts are included.

Noto Mono is available for download as a webfont from FontSquirrel. Note that by default FontSquirrel will provide you with the Western Latin subset of the font, so if you want the full thing disable subsetting. Note that Noto Mono is deprecated in preference to Noto Sans Mono.

Noto Sans Mono is not available from FontSquirrel, although their Noto Sans page says they can’t provide the font due to licensing restrictions. Noto Sans is licensed under the SIL Open Font License, which FontSquirrel reads as preventing them from providing a webfont version of it. Noto Sans Mono is under the same license.

If your read of the SIL Open Font License is different, or if you are willing to change the name of the font and use it, you can use their Webfont Generator to upload the desired TTF variants you’ve downloaded from the Google Noto Sans Mono page.

It’s worth talking about font variants for a moment. Noto Sans Mono comes with 36 variants, such as Regular, Bold, ExtraBold, Light, ExtraLight, Condensed — you get the idea. Each one of these contains glyphs to render a wide range of Unicode codepoints in the desired form. You are probably most interested in the Regular and Bold forms.

Let’s assume you upload the Regular and Bold TTF files to FontSquirrel’s Webfont Generator. You will probably want to enable Expert mode and de-select many of the things that FontSquirrel will do to the font, like truetype hinting, fixing missing glyphs, fixing vertical metrics, etc. Google has done a great job on these fonts and you shouldn’t need any of that. If you want to retain the font’s full set of glyphs you will want to disable subsetting too.

The download will include two WOFF, two WOFF2 files, and a stylesheet to use them. FontSquirrel doesn’t understand that we’ve uploaded two variants of the same font, so we need to fix the stylesheet so that the bold and regular versions work like we want them to. We need to use style linking to tell the web browser that they are the same font, just different styles. Note how the CSS block below uses the same font names for both the Regular and the Bold, but the bold version has a ‘font-weight: bold’ tag:


@font-face {
font-family: 'Not Noto Sans Mono';
src: url('notnotosansmono-regular.woff2') format('woff2'),
url('notnotosansmono-regular.woff') format('woff');
font-weight: normal;
font-style: normal;
}

@font-face {
font-family: 'Not Noto Sans Mono';
src: url('notnotosansmono-bold.woff2') format('woff2'),
url('notnotosansmono-bold.woff') format('woff');
font-weight: bold;
font-style: normal;
}

This allows us to use the ‘Not Noto Sans Mono’ font in regular and bold versions just like we would expect:

<p style='font-family: Not Noto Sans Mono;'>Regular mono</p>
<p style='font-family: Not Noto Sans Mono; font-weight: bold;'>Bold mono</p>

It would be far simpler if Google just provided Noto Sans Mono on their font CDN. I’ve sent them an email to that effect, but who knows how that will go.

If you’re wary of converting Noto Sans Mono yourself, a reasonable alternative is DejaVu Sans Mono which also has a very wide set of the Unicode codepoints — it’s available for download as a webfont from FontSquirrel

DP gets a CSS makeover

Today we rolled out a sweeping code release at Distributed Proofreaders that standardizes our CSS and moves us to HTML5. Along the way we worked to have a consistent look-and-feel across the entire site.

The DP codebase has grown very organically over the years, starting out in 2000 when Cascading Style Sheets (CSS) were young and browser support for CSS was very poor. Since that time developers have added new code and styling for code in a variety of ways. CSS, and browser support for it, has come a long ways in 17 years and it was past time to get a common look-and-feel using modern CSS.

Some of our design goals:

  • Modern HTML & CSS
    We did not design for specific browsers, but rather designed for modern standards, specifically HTML5 and CSS3. HTML5 is the future and is largely backwards compatible with HTML4.x. Most of our pages should now validate cleanly against HTML5.
  • Pure-CSS for themes
    Moving to a pure-CSS system for themes, without theme-specific graphics, makes them immensely easier to create and update. Doing so means we don’t have to create or modify image files when working with themes.
  • Site-wide consistency
    The site has grown very organically over the past 17 years with each developer adding their own layout, table styles, etc. We made some subtle, and some not-so-subtle, changes to make pages across the site more consistent.
  • Consistent CSS
    Using consistent CSS across the site code allows developers to re-use components easily and makes it easier for users to adjust CSS browser-side for accessibility if necessary.
  • No (or little) per-page CSS
    Instead of embedding CSS styles directly in a page, we want to have the CSS in common files. This allows for better style re-use and gets us on the path to supporting Content Security Policies.

As part of this effort we created a Style Design Philosophy document to discuss what we were working towards as well as a Style Demo page.

Despite the removal of magic quotes and the mysqli changes being far more invasive, broad-reaching, and risky, the CSS work is the code deployment I’m most worried about. Not because I think we did anything wrong or I’m worried about how it will render in browsers1, but because users hate change, and this roll-out is full of change they can see. Some subtle, some not so subtle.

I expect to be fielding a wide range of “why did X change!?” and “I don’t like the way Y looks!” over the next few weeks. I can only hope these are intermixed with some appreciative comments as well to balance out the criticism.

1 IE6 being the known exception that we will just live with.

Smile more and donate to a charity

If you shop at Amazon and are not using AmazonSmile, your favorite non-profit is missing out on money!

For the past 4 years, Amazon has donated millions of dollars to charities by having shoppers go through the AmazonSmile website. You, the buyer, shop just as you normally would and Amazon gives 0.5% of your purchase to the non-profit of your choice. It costs you, the buyer, absolutely nothing. The only catch is that you have to purchase through the AmazonSmile website.

Remembering to go to the AmazonSmile website is the hardest part of the whole endeavor. Luckily there are some browser plugins that will do that redirection for you:

If you shop at Amazon I encourage you to install a plugin to make sure you are buying through AmazonSmile and helping, even if it’s just a little, a non-profit you love.

My donations go to Distributed Proofreaders, you can select them for your charity on AmazonSmile using this link.

To be clear, I’m not encouraging anyone to shop at Amazon who isn’t already (shop at local merchants whenever possible!) but if you are shopping there, I encourage you to use AmazonSmile.

DP code release with mysqli goodness

Today we set free the second DP code release this year: R201707. This comes just six months after the last major code release. Both were focused on getting us moved to modern coding practices and middleware.

Today’s release moved the code off the deprecated mysql PHP extension and over to the mysqli PHP extension for connecting to the MySQL database. This will enable the site to run on PHP 7.x in addition to PHP 5.3 and later. This change was essential in enabling the code to run on modern operating systems, such as Ubuntu 16.041.

This release also included the ability to run against phpBB 3.2 allowing pgdp.net and others to upgrade to the latest-and-greatest (and supported) version of phpBB.

Perhaps most importantly to some of our international users, this release includes a full French translation of the DP user interface.

Next up for the DP code is modernizing our HTML and CSS to bring it up-to-date as well as standardizing the look-and-feel across the site. Work is well under way by several volunteers on this front.

Many thanks to all of the volunteers who developed and tested the code in this release!


1 Technically you can run PHP 5.6 on Ubuntu 16.04 as well, but 7.x is clearly the future.

DP code release with modern PHP goodness

Today I’m proud to announce a new release of the software that runs pgdp.net: R201701. The last release was a year ago and I’m trying to hold us to a yearly release cadence (compared to the 9 years in the last one).

This version contains a slew of small bug fixes and enhancements. The most notable two changes that I want to highlight are support for PHP versions > 5.3 and the new Format Preview feature.

This is the first DP release to not depend on PHP’s Magic Quotes allowing the code to run on PHP versions > 5.3 up to, but not including, PHP 7.x. This means that the DP code can run on modern operating systems such as Ubuntu 14.041 and RHEL/CentOS 7. This is a behind-the-scenes change that end users should never notice.

The most exciting user-visible change in this release is the new Format Preview functionality that assists proofreaders in formatting rounds. The new tool renders formatting via a simple toggle allowing the user to see what the formatted page would look like and alerting if it detects markup problems.

What’s next for the DP code base? We have a smattering of smaller changes coming in over the next few months. The biggest change on the horizon is moving from the deprecated mysql extension to mysqli, which will allow the code to run on PHP 7.x, and moving to phpBB 3.2.

Many thanks to all of the DP volunteers who made this release possible, including developers, squirrels, and the multitude of people who assisted in testing!


1 Ubuntu 16.04 uses PHP 7.0, but can be configured to use PHP 5.6.