[R6RS] Procedures that depend on Unicode character classification

Thu Jun 15 13:57:05 EDT 2006

I'm quoting more context than usual.

Matthew wrote:
> > The Unicode general categories are represented by symbols
> > in lower case, e.g. 'lu instead of 'Lu.  Is this really
> > what we intend for a case-sensitive R6RS?
> 
> That's what I intended, at least. I have no objection to using 'Lu.

I'd like to raise this as an issue for next week's
conference call.  The Unicode documentation seems
to use mixed case (e.g. Lu) pretty consistently for
these categories, and I think it would be confusing
to have a case-sensitive language that deviates from
the Unicode standard's usage.

BTW, Matthew, I appreciate your answers to my ignorant
questions about Unicode.

Mike wrote:
> I'd like to suggest that the procedures that depend on the Unicode
> character classification, i.e. the ...-ci... procedures, the
> {char,string}...case procedures, and `char-general-property' procedure
> be moved to a separate library.
> 
> They're by far not as frequently used as most of the others, and
> leaving them out of a program might save signficant space, as the
> tables they need to operate tend to weigh in roughly at the order of
> 100kbytes (depending on the compact representation chosen).

If these procedures are not frequently used, an
implementor might consider trading speed for space.

> But anyways, Scheme 48 puts the category, along with 1:1 case-mappings
> and special-case information into a single word, and it.  It then uses
> a compact array to represent the table, which should be faster than
> binary search in most cases.  (Two indirections always.)

With the implementation of char-general-category
that I checked in last night, the double indirection
used for common characters is about 9 times as fast
as the binary search used for uncommon characters.

That's 160 nsec versus 1.5 usec on a SunBlade 1500.
For comparison, the char-whitespace? procedure of
Scheme 48 v1.3 came in at about 5.8 usec on that
benchmark and machine.  When you have more speed
to trade for space, the time/space tradeoff may
change enough to justify a different design.

> That's true, but I did write the Scheme 48 code with an eye towards
> portability (it relies on no Scheme-48-specific procedures, but does
> use some SRFIs), and, given the effort I had to put into it, porting
> it would seem to be much less work than doing it from scratch.

The current release of Scheme 48 does not contain
that code.  When I asked whether I could look at
it to see whether I could reduce the table sizes,
you said you doubted whether I could.  I believed
you in the sense that I doubted whether I could
reduce the table sizes given your design, but I
also believed that a different design might yield
smaller tables.  The only way to find out is to
attempt a different design.

With the straightforward design I am implementing,
the reference implementation will represent about
eight hours of work.  That is not much compared
to the effort of changing all of Larceny's code
generators to accomodate Unicode characters and
strings.

> PS: Have you looked at the bytes proposal? :-)

Yes.  I posted my comments a few minutes ago.

Will