Detecting Windows-1252 encoding

For Distributed Proofreader‘s move to Unicode we need to handle accepting files from content providers that are not in UTF-8. Usually these files come in as Windows-1252, but sometimes they might be ISO-8859-1, UTF-16, or even in UTF-32. We need to get the detection correct to ensure a valid conversion to UTF-8.

For reasons beyond my ken, PHP’s mb_detect_encoding() function appears to be completely unable to detect the difference between Windows-1252 and ISO-8859-1 for strings that clearly have characters in the 0x80 to 0x9F ranges. Shockingly, it also wasn’t able to detect files encoded as UTF-16 with BOMs which I absolutely don’t understand. And it appears I’m not the only person having problems with it.

So we rolled our own, which I feel is almost as blasphemous as writing our own date handling library, but here we are. In case others out there are looking for something similar, here you go. Keep in mind that our objective is to determine an encoding from an expected, known set and ultimately convert the string to UTF-8.

This detection doesn’t have to be perfect. If the file isn’t in UTF-8 we warn the project manager about the detected encoding before they load the files, so if we guess the encoding wrong there’s a human to double-check it before proceeding.

# Attempt to detect a string's encoding from the following subset
# of expected encodings:
# * UTF-8 (includes pure-ASCII which is a valid subset)
# * UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE based on the BOM
# * Windows-1252
# * ISO-8859-1
# These strings match what mb_detect_encoding() would return.
# The function returns false if it's unable to guess.
# It will readily return ISO-8859-1 in many circumstances. function guess_string_encoding($text) {
# See if preg_match() sees it as valid UTF-8 already if (preg_match('//u', $text)) { return 'UTF-8';
} # evaluate the BOM, if one exists, borrowed from # https://stackoverflow.com/questions/49546403/php-checking-if-string-is-utf-8-or-utf-16le $first2 = substr($text, 0, 2); $first4 = substr($text, 0, 4); if ($first4 == "\x00\x00\xFE\xFF") return 'UTF-32BE'; elseif ($first4 == "\xFF\xFE\x00\x00") return 'UTF-32LE'; elseif ($first2 == "\xFE\xFF") return 'UTF-16BE'; elseif ($first2 == "\xFF\xFE") return 'UTF-16LE'; # If the string contains characters in ranges that are either
# control characters or invalid for ISO-8859-1 or CP-1252
# return false indicating we are unable to reliably guess if (preg_match('/[\x00-\x08\x0E-\x1F\x81\x8D\x8F\x90\x9D]/', $text, $matches)) { return false;
} # If we get here, we're going to assume it's either Windows-1252
# or ISO-8859-1. If the string contains characters in the
# ISO-8859-1 reserved range, that's probably Windows-1252. if (preg_match('/[\x80-\x9F]/', $text)) { return 'Windows-1252'; }
# Give up and return ISO-8859-1 return 'ISO-8859-1'; }

Like all dproofreaders code, the above is in the GPL v2.0.

Published by

cpeel

I'm a gay geek living in Seattle, WA.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s