[R6RS] Procedures that depend on Unicode character classification
William D Clinger
will at ccs.neu.edu
Wed Jun 14 09:56:09 EDT 2006
Concerning my estimated table sizes for Unicode support,
Mike wrote:
> Could you give hints as to what representations you used?
The following implementation of char-general-property has
not been tested, but it should give you an idea.
Will
--------
; Given a character, returns its Unicode general property.
; The tables used to implement this procedure occupy about 9260 bytes.
(define (char-general-property c)
(let ((n (char->integer c)))
(vector-ref vector-of-general-property-symbols
(if (< n 256)
(bytes-ref general-property-indices-for-common-characters n))
(bytes-ref general-property-indices-for-all-characters
(binary-search
n
vector-of-code-points-with-same-property)))))
; Given an exact integer k and a vector of exact integers
; in increasing order such that k is at least as large as
; the first element of the vector, returns the largest i
; such that element i of the vector is less than k.
(define (binary-search k vec) ...)
; The symbols that represent Unicode general properties.
; There are 30 of these.
; This table occupies about 128 bytes, not counting
; the space occupied by the symbols themselves.
(define vector-of-general-property-symbols
'#(Lu Ll Lt Lm Lo
Mn Mc Me
Nd Nl No
Pc Pd Ps Pe Pi Pf Po
Sm Sc Sk So
Zs Zl Zp
Cc Cf Cs Co Cn))
; The following array of bytes implements a direct mapping
; from small code points to indices into the above vector.
; This array occupies about 264 bytes.
(define general-property-indices-for-common-characters
(list->bytes
'(25 25 25 ...)))
; The following array of bytes, together with the vector below it,
; implements an indirect mapping from all Unicode scalar values to
; indices into the above vector.
; This array occupies about 1780 bytes.
(define general-property-indices-for-all-characters
(list->bytes
'(25 22 17 19 ...)))
; The following vector of exact integers represents the
; Unicode scalar values whose Unicode general category
; is different from the Unicode scalar value immediately
; less than it.
; This array contains 1772 entries, and occupies about 7096 bytes.
(define vector-of-code-points-with-same-property
'#(#x00 #x20 #x21 #x24 #x25 #x28 #x29 #x2a
#x2b #x2c #x2d #x2e #x30 #x3a #x3c #x3f
#x41 #x5b #x5c #x5d #x5e #x5f #x60 #x61
#x7b #x7c #x7d #x7e #x7f ...))
[end of implementation]
More information about the R6RS
mailing list