[R6RS] Unicode normalization

Matthew Flatt mflatt at cs.utah.edu
Thu Mar 2 13:04:05 EST 2006

After experimenting with Unicode normalization and re-reading the SRFI
discussion, I propose the following simple change to the SRFI:

 * Add `string-normzlize-nfd',
       `string-normzlize-nfc', and
   Each of these procedures takes a string and return its D, KD, C, or
   KC normalization, respectively.

Originally, I imagined just picking one. But the one I would have
picked is NFC, and but the time you have NFC, it's a small step to have
all four. Also, all of them are useful to programs that deal with
Unicode somewhat explicitly.

I recommend against using any normalization for symbols or input
streams. Here's my rationale:

   * Normalizing symbols without normalizing strings will lead to
     confusion, since 'ê (that's a quote followed by U+00EA) would not
     be the same as (string->symbol "ê") with NFD or NFKD
     normalization. If NFC or NFKC is used, adjust the example by
     decomposing ê to two characters in the string.

   * For consistency overall, normalization needs to be pushed down to
     the lexical level. Otherwise, #\ê might turn out to be the ê
     character, or it might be a syntax error (i.e., a #\e followed by
     anon-delimiting character). In other words, we'd have to either
     require a program to be represented as a normalize stream of
     characters, or specify normalization as part of the parsing

   * Normalizing things like strings probably interferes with
     representing literal data in strings. For example, I would guess
     that pathnames like "ê" typically use NFC-like encodings, but I'm
     not sure.

By not specifying normalization, however, we push the problem into
programming environments and editors. If a programmer types an "ê", for
example, the specific meaning will depend on how the editor saves the
program text. So, I'm not sure it's the right approach.


More information about the R6RS mailing list