[R6RS] proposed conversions between strings and bytevectors

William D Clinger will at ccs.neu.edu
Fri Mar 23 17:32:07 EDT 2007


I volunteered to draft rough specifications of the
following six procedures, which I recommend be added
to either library section 1.2 (Unicode strings) or
library chapter 2 (bytevectors).

                                * * *

(string->utf8 string)                                  procedure

Returns a newly allocated (unless empty) bytevector that
contains the UTF-8 encoding of the given string.

(string->utf16 string)                                 procedure
(string->utf16 string endianness)                      procedure

Returns a newly allocated (unless empty) bytevector that
contains the UTF-16BE or UTF-16LE encoding of the given
string (with no byte mark).

The endianness, if specified, must be the symbol big or the
symbol little.  If endianness is not specified or is big, then
UTF-16BE is used.  If endianness is the symbol little, then
UTF-16LE is used.

(string->utf32 string)                                 procedure
(string->utf32 string endianness)                      procedure

Returns a newly allocated (unless empty) bytevector that
contains the UTF-32BE or UTF-32LE encoding of the given
string (with no byte mark).

The endianness, if specified, must be the symbol big or the
symbol little.  If endianness is not specified or is big, then
UTF-32BE is used.  If endianness is the symbol little, then
UTF-32LE is used.

(utf8->string bytevector)                              procedure

The bytevector must contain the representation of a string in
UTF-8; otherwise an exception is raised.  Returns a newly
allocated (unless empty) string whose character sequence is
encoded by the given bytevector.

(utf16->string bytevector)                             procedure
(utf16->string bytevector endianness)                  procedure

If no endianness is specified, the bytevector must contain the
representation of a string according to the UTF-16 encoding
scheme (which permits but does not require a byte order mark).
If endianness is the symbol big, the bytevector must contain
the UTF-16BE encoding of a string.  If endianness is the symbol
little, the bytevector must contain the UTF-16LE encoding of a
string.  If endianness is anything other than the symbol big or
the symbol little, then an exception is raised.

Returns a newly allocated (unless empty) string whose character
sequence is encoded by the given bytevector.

(utf32->string bytevector)                             procedure
(utf32->string bytevector endianness)                  procedure

If no endianness is specified, the bytevector must contain the
representation of a string according to the UTF-32 encoding
scheme (which permits but does not require a byte order mark).
If endianness is the symbol big, the bytevector must contain
the UTF-32BE encoding of a string.  If endianness is the symbol
little, the bytevector must contain the UTF-32LE encoding of a
string.  If endianness is anything other than the symbol big or
the symbol little, then an exception is raised.

Returns a newly allocated (unless empty) string whose character
sequence is encoded by the given bytevector.

                                * * *

Note:  The six procedures specified above could be implemented
by the following (untested) definitions.  The complexity and
length of this code is part of the rationale for adding these
procedures to the R6RS.

    (import (r6rs bytevector)
            (r6rs i/o ports)
            (r6rs arithmetic bitwise))

    (define (string->utf8 string)
      (call-with-bytevector-output-port
       (lambda (out) (put-string out string))
       (make-transcoder (utf-8-codec))))

    ; (utf-32-codec) might write a byte order mark,
    ; so it's better not to use textual i/o for this.

    (define (string->utf16 string . rest)
      (let* ((endianness (cond ((null? rest) 'big)
                               ((eq? (car rest) 'big) 'big)
                               ((eq? (car rest) 'little) 'little)
                               (else (endianness-violation
                                      'string->utf16
                                      (car rest)))))
             (n (string-length string)))
        (call-with-bytevector-output-port
         (lambda (out)
           (define put16
             (if (eq? endianness 'big)
                 (lambda (u16)
                   (put-u8 out (bitwise-bit-field u16 8 16))
                   (put-u8 out (bitwise-bit-field u16 0 8)))
                 (lambda (u16)
                   (put-u8 out (bitwise-bit-field u16 0 8))
                   (put-u8 out (bitwise-bit-field u16 8 16)))))
           (do ((i 0 (+ i 1)))
               ((= i n))
             (let ((sv (char->integer (string-ref string i))))
               (if (<= sv #xffff)
                   (put16 sv)
                   (let* ((sp1 (bitwise-or #xdc00 (bitwise-and sv #x03ff)))
                          (sp0 (bitwise-or #xd800
                                           (bitwise-bit-field
                                            (- sv #x10000)
                                            10 20))))
                     (put16 sp0)
                     (put16 sp1)))))))))

    ; The current draft response to formal comment 181
    ; says we are removing utf-32-codec from the report,
    ; so we can't use textual i/o for this.

    (define (string->utf32 string . rest)
      (let* ((endianness (cond ((null? rest) 'big)
                               ((eq? (car rest) 'big) 'big)
                               ((eq? (car rest) 'little) 'little)
                               (else (endianness-violation
                                      'string->utf32
                                      (car rest)))))
             (n (string-length string))
             (result (make-bytevector (* 4 n))))
        (do ((i 0 (+ i 1)))
            ((= i n) result)
          (bytevector-u32-set! result
                               (* 4 i)
                               (char->integer (string-ref string i))
                               endianness))))

    (define (utf8->string bytevector)
      (call-with-port
       (lambda (in)
        (call-with-string-output-port
         (lambda (out)
           (put-string out (get-string-all in)))))
       (open-bytevector-input-port bytevector
                                   (make-transcoder (utf-8-codec)))))

    (define (utf16->string bytevector . rest)
      (let* ((endianness (cond ((null? rest) 'big)
                               ((eq? (car rest) 'big) 'big)
                               ((eq? (car rest) 'little) 'little)
                               (else (endianness-violation
                                      'utf16->string
                                      (car rest)))))

             ; There is no standard codec for UTF-16LE, so
             ; we copy the bytevector while swapping its bytes.

             (bytevector (if (eq? endianness 'little)
                             (let* ((n (bytevector-length bytevector))
                                    (bv (make-bytevector n)))
                               (if (odd? n)
                                   (assertion-violation
                                    'utf16->string
                                    "Bytevector has odd length." bytevector)
                                   (do ((i 0 (+ i 2)))
                                       ((= i n) bytevector)
                                     (let ((octet0
                                            (bytevector-ref bytevector i))
                                           (octet1
                                            (bytevector-ref bytevector
                                                            (+ i 1))))
                                       (bytevector-set! bv i octet1)
                                       (bytevector-set! bv (+ i 1) octet0)))))
                             bytevector)))

        (call-with-string-output-port
         (lambda (out) (put-string out string))
         (make-transcoder (utf-16-codec)))))

    ; The current draft response to formal comment 181
    ; says we are removing utf-32-codec from the report,
    ; so we can't use textual i/o for this.

    (define (utf32->string bytevector . rest)
      (let* ((endianness (cond ((null? rest) 'big)
                               ((eq? (car rest) 'big) 'big)
                               ((eq? (car rest) 'little) 'little)
                               (else (endianness-violation
                                      'string->utf32
                                      (car rest)))))
             (n (bytevector-length bytevector))
             (result (if (zero? (remainder n 4))
                         (make-string (quotient n 4))
                         (assertion-violation
                          'utf32->string
                          "Bytevector has bad length." bytevector))))
        (do ((i 0 (+ i 4))
             (j 0 (+ j 1)))
            ((= i n) result)
          (string-set! result
                       j
                       (integer->char
                        (bytevector-u32-ref bytevector i endianness))))))

    (define (endianness-violation who what)
      (assertion-violation who "Bad endianness." what))

                                * * *

Will



More information about the R6RS mailing list