[R6RS] Unicode scalar value escape sequences

Marc Feeley feeley
Tue Mar 1 08:33:28 EST 2005


> Marc> Increasingly I believe it is wrong in a high-level language like
> Marc> Scheme to have both a character type and a string type.  The character
> Marc> data type is really low-level, archaic and motivated by performance.
> Marc> Ask yourself this question: if performance was not an issue would it
> Marc> be possible to do text processing (elegantly) using only the string
> Marc> data type and the following primitives?
> 
> Marc>      (string-length str)
> Marc>      (substring str start end)
> Marc>      (string-append str...)
> Marc>      (string=? str...) ; and <, <=, ...
> Marc>      (char->integer str)  ; where str is a string of length 1
> Marc>      (integer->char n)    ; returns a string of length 1
> Marc>      (read-char [port])   ; returns a string of length 1
> 
> But you still retain CHAR->INTEGER, INTEGTER->CHAR, and READ-CHAR.
> (And thus, a lot of the current character predicates.)  So the only
> real difference in your proposal is that you'd have (string?
> #\a) (or whatever the character notation is) return #t.  Right?

Well strings would still use the "..." notation, so instead of
writing (string? #\a) you'd write (string? "a"), and instead
of (char->integer #\a) you'd write (char->integer "a"), and instead
of (string-ref "abc" 1) you'd write (substring "abc" 1 2), or
you could keep the string-ref procedure and define it as

   (define (string-ref str i) (substring str i (+ i 1)))

Having strings as the sole text type would also eliminate the problem
of case mapping of Unicode characters which map a single character to
more than one character.  For example upcasing the German esszet
character (the code 0x00DF) is supposed to give "SS" (note that even
though the esszet is now deprecated from the German language, there
are still plenty of texts around that do use it).  So char-upcase
would be replaced by string-upcase, and

   (string-upcase "\u00DF")  =>  "SS"

> Marc> I wonder how novice users react when confronted with the two text
> Marc> related datatypes in most current languages (strings and
> Marc> characters).
> 
> I think characters and strings are intuitive concepts to most human
> beings, not just novice Scheme users.

I'm not convinced.

Assuming we keep the character type in Scheme an alternative syntax to
consider is using the #"<char>" syntax for characters to make it
compatible with the string syntax.  We would have:

   #"a"      = #\a
   #"\\"     = #\\
   #" "      = #\space
   #"\n"     = #\newline
   #"\u00df" = (integer->char #x00df)  i.e. equivalent to Matthew's #\u00df
   #"\0"     = (integer->char 0)       i.e. equivalent to Matthew's #\nul

If this syntax is adopted, I propose we drop the proposed special
syntax for #\nul, #\alarm, etc. since this duplication of
functionality is conceptual clutter.  We could go as far as removing
the #\<char> syntax altogether, but I expect resistance from users due
to inertia.

While we're on the subject of strings, does anyone have a strong
opinion on mutability of strings.  Would it improve the language to
have immutable strings?  Would it improve the language to have only
immutable strings?  (I'm not talking about strings in literals, which
are currently immutable)

Marc


More information about the R6RS mailing list