[R6RS] Procedures that depend on Unicode character classification

William D Clinger will at ccs.neu.edu
Wed Jun 14 09:56:09 EDT 2006


Concerning my estimated table sizes for Unicode support,
Mike wrote:
> Could you give hints as to what representations you used?

The following implementation of char-general-property has
not been tested, but it should give you an idea.

Will

--------

; Given a character, returns its Unicode general property.
; The tables used to implement this procedure occupy about 9260 bytes.

(define (char-general-property c)
  (let ((n (char->integer c)))
    (vector-ref vector-of-general-property-symbols
        (if (< n 256)
            (bytes-ref general-property-indices-for-common-characters n))
            (bytes-ref general-property-indices-for-all-characters
                       (binary-search
                        n
                        vector-of-code-points-with-same-property)))))

; Given an exact integer k and a vector of exact integers
; in increasing order such that k is at least as large as
; the first element of the vector, returns the largest i
; such that element i of the vector is less than k.

(define (binary-search k vec) ...)

; The symbols that represent Unicode general properties.
; There are 30 of these.
; This table occupies about 128 bytes, not counting
; the space occupied by the symbols themselves.

(define vector-of-general-property-symbols
  '#(Lu Ll Lt Lm Lo
     Mn Mc Me
     Nd Nl No
     Pc Pd Ps Pe Pi Pf Po
     Sm Sc Sk So
     Zs Zl Zp
     Cc Cf Cs Co Cn))

; The following array of bytes implements a direct mapping
; from small code points to indices into the above vector.
; This array occupies about 264 bytes.

(define general-property-indices-for-common-characters
  (list->bytes
   '(25 25 25 ...)))

; The following array of bytes, together with the vector below it,
; implements an indirect mapping from all Unicode scalar values to
; indices into the above vector.
; This array occupies about 1780 bytes.

(define general-property-indices-for-all-characters
  (list->bytes
   '(25 22 17 19 ...)))

; The following vector of exact integers represents the
; Unicode scalar values whose Unicode general category
; is different from the Unicode scalar value immediately
; less than it.
; This array contains 1772 entries, and occupies about 7096 bytes.

(define vector-of-code-points-with-same-property
  '#(#x00 #x20 #x21 #x24 #x25 #x28 #x29 #x2a
     #x2b #x2c #x2d #x2e #x30 #x3a #x3c #x3f
     #x41 #x5b #x5c #x5d #x5e #x5f #x60 #x61
     #x7b #x7c #x7d #x7e #x7f ...))

[end of implementation]



More information about the R6RS mailing list