Detecting Windows-1252 encoding

For DP’s move to Unicode we need to handle accepting files from content providers that are not in UTF-8. Usually these files come in as Windows-1252, but sometimes they might be ISO-8859-1, UTF-16, or even in UTF-32. We need to get the detection correct to ensure a valid conversion to UTF-8.

For reasons beyond my ken, PHP’s mb_detect_encoding() function appears to be completely unable to detect the difference between Windows-1252 and ISO-8859-1 for strings that clearly have characters in the 0x80 to 0x9F ranges. Shockingly, it also wasn’t able to detect files encoded as UTF-16 with BOMs which I absolutely don’t understand. And it appears I’m not the only person having problems with it.

So we rolled our own, which I feel is almost as blasphemous as writing our own date handling library, but here we are. In case others out there are looking for something similar, here you go. Keep in mind that our objective is to determine an encoding from an expected set and ultimately convert the string to UTF-8.

This detection doesn’t have to be perfect. If the file isn’t in UTF-8 we warn the project manager about the detected encoding before they load the files, so if we guess the encoding wrong there’s a human to double-check it before proceeding.

# Attempt to detect a string's encoding from a subset of expected encodings:
# * UTF-8 (includes pure-ASCII which is a valid subset)
# * UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE based on the BOM
# * Windows-1252
# * ISO-8859-1
# These strings match what mb_detect_encoding() would return. The function
# returns False if it's unable to guess, although it will readily return
# ISO-8859-1 in many circumstances. function guess_string_encoding($text) { if(preg_match('//u', $text)) return 'UTF-8'; # evaluate the BOM, if one exists, borrowed from # https://stackoverflow.com/questions/49546403/php-checking-if-string-is-utf-8-or-utf-16le $first2 = substr($text, 0, 2); $first4 = substr($text, 0, 4); if ($first4 == "\x00\x00\xFE\xFF") return 'UTF-32BE'; elseif ($first4 == "\xFF\xFE\x00\x00") return 'UTF-32LE'; elseif ($first2 == "\xFE\xFF") return 'UTF-16BE'; elseif ($first2 == "\xFF\xFE") return 'UTF-16LE'; # if there are any characters in ranges that are either control characters # or invalid for ISO-8859-1 or CP-1252, return False if(preg_match('/[\x00-\x08\x0E-\x1F\x81\x8D\x8F\x90\x9D]/', $text, $matches)) return False; # if we get here, we're going to assume it's either Windows-1252 or ISO-8859-1. # if the string contains characters in the ISO-8859-1 reserved range, # that's probably Windows-1252 if(preg_match('/[\x80-\x9F]/', $text)) return 'Windows-1252'; # Give up and return ISO-8859-1 return 'ISO-8859-1'; }

Like all dproofreaders code, the above is in the GPL v2.0.

Noto Sans Mono webfont

Over at Distributed Proofreaders we’re busy working to convert the site code over to Unicode from Latin-1. Part of that work is finding a monospace proofreading font that covers set of Unicode codepoints we need and is, or can be made available as, a webfont.

The Google Noto font family is an obvious candidate as it contains two monospace fonts: Noto Mono and Noto Sans Mono. Frustratingly, while Noto Sans is available directly from the Google Font CDN, neither of the two Mono fonts are included.

Noto Mono is available for download as a webfont from FontSquirrel. Note that by default FontSquirrel will provide you with the Western Latin subset of the font, so if you want the full thing disable subsetting. Note that Noto Mono is deprecated in preference to Noto Sans Mono.

Noto Sans Mono is not available from FontSquirrel, although their Noto Sans page says they can’t provide the font due to licensing restrictions. Noto Sans is licensed under the SIL Open Font License, which FontSquirrel reads as preventing them from providing a webfont version of it. Noto Sans Mono is under the same license.

If your read of the SIL Open Font License is different, or if you are willing to change the name of the font and use it, you can use their Webfont Generator to upload the desired TTF variants you’ve downloaded from the Google Noto Sans Mono page.

It’s worth talking about font variants for a moment. Noto Sans Mono comes with 36 variants, such as Regular, Bold, ExtraBold, Light, ExtraLight, Condensed — you get the idea. Each one of these contains glyphs to render a wide range of Unicode codepoints in the desired form. You are probably most interested in the Regular and Bold forms.

Let’s assume you upload the Regular and Bold TTF files to FontSquirrel’s Webfont Generator. You will probably want to enable Expert mode and de-select many of the things that FontSquirrel will do to the font, like truetype hinting, fixing missing glyphs, fixing vertical metrics, etc. Google has done a great job on these fonts and you shouldn’t need any of that. If you want to retain the font’s full set of glyphs you will want to disable subsetting too.

The download will include two WOFF, two WOFF2 files, and a stylesheet to use them. FontSquirrel doesn’t understand that we’ve uploaded two variants of the same font, so we need to fix the stylesheet so that the bold and regular versions work like we want them to. We need to use style linking to tell the web browser that they are the same font, just different styles. Note how the CSS block below uses the same font names for both the Regular and the Bold, but the bold version has a ‘font-weight: bold’ tag:


@font-face {
font-family: 'Not Noto Sans Mono';
src: url('notnotosansmono-regular.woff2') format('woff2'),
url('notnotosansmono-regular.woff') format('woff');
font-weight: normal;
font-style: normal;
}

@font-face {
font-family: 'Not Noto Sans Mono';
src: url('notnotosansmono-bold.woff2') format('woff2'),
url('notnotosansmono-bold.woff') format('woff');
font-weight: bold;
font-style: normal;
}

This allows us to use the ‘Not Noto Sans Mono’ font in regular and bold versions just like we would expect:

<p style='font-family: Not Noto Sans Mono;'>Regular mono</p>
<p style='font-family: Not Noto Sans Mono; font-weight: bold;'>Bold mono</p>

It would be far simpler if Google just provided Noto Sans Mono on their font CDN. I’ve sent them an email to that effect, but who knows how that will go.

If you’re wary of converting Noto Sans Mono yourself, a reasonable alternative is DejaVu Sans Mono which also has a very wide set of the Unicode codepoints — it’s available for download as a webfont from FontSquirrel