Matching UTF-8 encoded characters with regular expressions

in Programming

Working with accented characters (or any unicode, non latin character for that matter) often poses problems when trying to match them using regular expression functions such as preg_match or preg_replace in PHP.

The w expression is meant to match any word character, but it won’t match é or ï in a unicode (UTF-8) encoded string. For example, let’s take the following Wiktionary definition in raw form (succès):

# {{absolument}} [[réussite|Réussite]] de ce que l’on [[espérait]] et la [[gloire]] qui s'[[ensuit]].

Say we want to strip away the first half of those double bracketed words (i.e. “[[réussite|”).

$string = "# {{absolument}} [[réussite|Réussite]] de ce que l'on</span>
		[[espérait]] et la [[gloire]] qui s'[[ensuit]].";

$pattern = "/[[[sw]*?|/";

$results = $preg_replace($pattern, "", $string);

Unfortunately, the pattern won’t match, despite the fact that it’s technically correct (“match any double bracket followed by a space or word character until you find a pipe character”). The problem is the accented character. One solution is to use the u modifier. This tells the preg function that the pattern should be treated as UTF-8. The trick, however, is to remember that it goes at the end of the pattern, after the trailing slash (or parenthesis) and not inside the pattern following the characters you’re looking to match. Though it looks and works like a switch, it’s not actually preceded by any such identifier.

$pattern = "/[[[sw]*?|/u";

// or

$pattern = "([[[sw]*?|)u";

Either format should now work to match any accented characters it encounters, provided that everything is truly encoded in UTF-8, such as the data stored in Wikipedia and Wiktionary.


Leave a Reply

Using Gravatars in the comments - get your own and be recognized!

XHTML: These are some of the tags you can use: <a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>