[R6RS] Procedures that depend on Unicode character classification

William D Clinger will at ccs.neu.edu
Tue Jun 13 22:11:50 EDT 2006


Mike wrote:
> I'd like to suggest that the procedures that depend on the Unicode
> character classification, i.e. the ...-ci... procedures, the
> {char,string}...case procedures, and `char-general-property' procedure
> be moved to a separate library.
>
> They're by far not as frequently used as most of the others, and
> leaving them out of a program might save signficant space, as the
> tables they need to operate tend to weigh in roughly at the order of
> 100kbytes (depending on the compact representation chosen).

Mike's estimate surprised me, so I spent a couple of hours
writing a parser for the UCD File Format and investigating
the table sizes.  My estimated table sizes for a 32-bit
system, with no serious compression, are:

tables for case folding and the -ci procedures:
    9 kbytes
tables for char-general-property and associated predicates:
   10 kbytes
tables for the four normalization procedures:
   20 kbytes

I don't fully understand string normalization yet, so I'm
less confident about that estimate than the other two, but
I think my estimated 20 kbytes is more likely to be on the
high side than on the low side.

Will



More information about the R6RS mailing list