[R6RS] draft Unicode SRFI

Marc Feeley feeley
Wed Jul 6 09:53:16 EDT 2005


I have started to implement the Unicode SRFI in Gambit and have  
encountered some problems.  The Unicode collation algorithm, which is  
used by string-locale=?, etc, is very complex, see

   http://www.unicode.org/unicode/reports/tr10/

The algorithm uses non-trivial collation tables for each locale.   
There is no point to reimplement all of this from scratch and it  
makes sense to base the implementation on the wide character C  
library function "wcscoll".  Unfortunately this function is not  
supported equally well by all C libraries, for example with Mac OS X  
the man page says:

   BUGS
        The current implementation of wcscoll() only works in single- 
byte
        LC_CTYPE locales, and falls back to using wcscmp() in locales  
with
        extended character sets.

Moreover, assuming wcscoll had widespread support, it is not clear to  
me how to implement the case-independent variants string-locale- 
ci=?, ... because wcscoll does not support a case-independent  
comparison.  The best I can come up with is to downcase both strings  
in a locale dependent way, and then use wcscoll.  But how do you  
downcase in a locale dependent way?  I thought the C library function  
"iconv" could be used for this, but it seems to only be useful for  
converting between different character encodings.

I'm beginning to wonder if it is a good idea to put the locale  
specific string procedures in the language.  The runtime system will  
be larger (in binary code size and in various tables) and we don't  
seem to be able to pin down a definition of what a "locale" is and  
how the locale is specified to the runtime system.  I think that,  
given the support in R6RS for Unicode strings, all the locale  
dependent string operations can be written portably and placed in a  
"locale" library.  Wouldn't this make more sense?

Marc



More information about the R6RS mailing list