Thursday, March 01, 2007

Unicode Support for PHP 6

PHP 6 was scheduled to be released at the end of this year with lots of improvements and new support for Unicode (oh, don't forget about major changes in how PHP works, mostly with current common behavior that we found today, like Register Global, Magic Quotes, Safe Mode, etc). First alpha release was scheduled to be released at the end of the first quarter of 2007.

Here's some information about Unicode and how it will be implemented in PHP 6:
Unicode is an effort to map the characters of all human languages for use with computers. Version 5.0 of Unicode, released in the fall of 2006, contains nearly 100,000 characters and has the capacity for about a million. Support for Unicode in software is well underway, usually via one of the Unicode Transformation Formats: UTF-8, UTF-16, or UTF-32

Unicode support in PHP 6.0 will include a broad selection of International Components for Unicode (ICU). These components will include provision for such actions as converting between one locale or character set and another, collation, transliteration, Unicode text processing, and Unicode regular expressions. Such functionality will be available when a Unicode.semantics code switch is enabled.

To accommodate this change, PHP 6.0 will switch from having a single, generic string type to having two: a Unicode string type for text data, implemented through UTF-16, and a binary type, which will include actual binary data and text data for legacy locales. Perhaps the most obvious difference in the string types is that each character in a binary string will be one byte long, while in a Unicode string, a character may use more than a single byte, depending on the language and how it is encoded. In addition, within Unicode strings, characters may be referenced by either name or code point.

When a PHP program runs, runtime encoding will specify which encoding to use. The encoding for a script will be encoded either as an INI setting, or with a declare () statement in the first line, in much the same way as in an XML file. The encoding may be changed later in the script with a pragma. The encoding for standard output and for file and directory name may also be specified, as well as how conversions between the two string types are handled. Since legacy character sets cannot support all Unicode characters, programmers will also be able to set how conversion errors are handled and the format in which PHP reports them.

With Unicode support, not only will identifiers within the code be able to use Unicode characters, but a whole range of new functionality will become available. Programmers will be able to specify how information is collated by choosing a locale, and by specifying criteria, such as how accented or upper case characters are treated. Even more usefully, text can be converted from one locale to another, so that, for example, English speakers can read Greek names in Latin characters, or a Japanese reader can convert full-width characters to half-width ones on the fly.


As usual, the developer tried to make the migration between 5.x and 6.x as smooth as possible, but broken code will likely to be seen everywhere, so when the beta or RC version has arrived, perhaps it's better to start analyzing it and tried to test your site on a test server with PHP 6 installed before the final version comes up. Doing this will reduce the time required to recode the work when the final version comes up, because mostly RC already works for most cases, while it still needs some improvements before going public with final version.

PHP Developers are getting very fast while the public hasn't adopted PHP 5.x completely, since many web hosting still uses PHP 4.x on their server (because if they upgrade to PHP 5, some of their client's website may display a lot of error messages or warnings on their site and it won't be good for the business). Now, they are ready to do the same thing with PHP 6.

References:
Minutes PHP Developer Meeting
Linux.com
Core PHP