> > Summary
> > "This document attempts to make the case that it is advantageous to 
use 
> > UTF-16 (or 16-bit Unicode strings) for text processing..."
> 
> IMHO this is one of the worst mistakes Unicode is trying to make.
> It convinces people that they should not worry about characters above
> U+FFFF just because they are very rare. UTF-16 combines the worst
> aspects of UTF-8 and UTF-32.
No, that's wrong. Here's a direct quote from the document:
"Important: Supplementary code points must be supported for full Unicode 
support, regardless of the encoding form. Many characters are assigned 
supplementary code points, and even whole scripts are entirely encoded 
outside of the BMP. The opportunity for optimization of 16-bit Unicode 
string processing is that the most commonly used characters are stored 
with single 16-bit code units, so that it is useful to concentrate 
performance work on code paths for them, while also maintaining support 
and reasonable performance for supplementary code points."
> If size is important and variable width of the representation of a code
> point is acceptable, then UTF-8 is usually a better choice. If O(1)
> indexing by code points is important, then UTF-32 it better. Nobody
> wants to process texts in terms of UTF-16 code units. Nobody wants to
> have surrogate processing sprinkled around the code, and thus if one
> accepts an API which extracts variable width characters, then the API
> could as well deal with UTF-8, which is better for interoperability.
> UTF-16 makes no sense.
No, that's wrong. I've provided links to many documents written by experts 
with experience in the field. For example, Dr. Mark Davis is a co-founder 
of Unicode, president of the Consortium, original architect of ICU, and 
Chief Globalization Architect at IBM. Richard Gillam was a member of IBM's 
Unicode Technology Group and an Engineer at the Unicode Technology Center 
for Java Technology. He was also part of the team that added Unicode to 
JavaScript. People like Markus Scherer have similar backgrounds. Each of 
those documents says the same thing: UTF-16 is the best overall trade-off 
of space & time & ease-of-use.
But I'll tell you what. Find a document, written by someone with 
substantial Unicode experience, that recommends UTF-32 as the best overall 
in-memory encoding. I haven't found such a document, not a single one, but 
maybe you can. (I mean that; maybe I wasn't searching in the right places 
or with the right words.)
Received on Sun Mar 25 2007 - 22:32:50 UTC
This archive was generated by hypermail 2.3.0
: Wed Oct 23 2024 - 09:15:01 UTC