Detecting Windows-1252 encoding

For DP’s move to Unicode we need to handle accepting files from content providers that are not in UTF-8. Usually these files come in as Windows-1252, but sometimes they might be ISO-8859-1, UTF-16, or even in UTF-32. We need to get the detection correct to ensure a valid conversion to UTF-8.

For reasons beyond my ken, PHP’s mb_detect_encoding() function appears to be completely unable to detect the difference between Windows-1252 and ISO-8859-1 for strings that clearly have characters in the 0x80 to 0x9F ranges. Shockingly, it also wasn’t able to detect files encoded as UTF-16 with BOMs which I absolutely don’t understand. And it appears I’m not the only person having problems with it.

So we rolled our own, which I feel is almost as blasphemous as writing our own date handling library, but here we are. In case others out there are looking for something similar, here you go. Keep in mind that our objective is to determine an encoding from an expected set and ultimately convert the string to UTF-8.

This detection doesn’t have to be perfect. If the file isn’t in UTF-8 we warn the project manager about the detected encoding before they load the files, so if we guess the encoding wrong there’s a human to double-check it before proceeding.

# Attempt to detect a string's encoding from a subset of expected encodings:
# * UTF-8 (includes pure-ASCII which is a valid subset)
# * UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE based on the BOM
# * Windows-1252
# * ISO-8859-1
# These strings match what mb_detect_encoding() would return. The function
# returns False if it's unable to guess, although it will readily return
# ISO-8859-1 in many circumstances. function guess_string_encoding($text) { if(preg_match('//u', $text)) return 'UTF-8'; # evaluate the BOM, if one exists, borrowed from # https://stackoverflow.com/questions/49546403/php-checking-if-string-is-utf-8-or-utf-16le $first2 = substr($text, 0, 2); $first4 = substr($text, 0, 4); if ($first4 == "\x00\x00\xFE\xFF") return 'UTF-32BE'; elseif ($first4 == "\xFF\xFE\x00\x00") return 'UTF-32LE'; elseif ($first2 == "\xFE\xFF") return 'UTF-16BE'; elseif ($first2 == "\xFF\xFE") return 'UTF-16LE'; # if there are any characters in ranges that are either control characters # or invalid for ISO-8859-1 or CP-1252, return False if(preg_match('/[\x00-\x08\x0E-\x1F\x81\x8D\x8F\x90\x9D]/', $text, $matches)) return False; # if we get here, we're going to assume it's either Windows-1252 or ISO-8859-1. # if the string contains characters in the ISO-8859-1 reserved range, # that's probably Windows-1252 if(preg_match('/[\x80-\x9F]/', $text)) return 'Windows-1252'; # Give up and return ISO-8859-1 return 'ISO-8859-1'; }

Like all dproofreaders code, the above is in the GPL v2.0.

Published by

cpeel

I'm a gay geek living in Seattle, WA.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s