[R6RS] Unicode scalar value escape sequences

Marc Feeley feeley
Mon Feb 28 22:56:06 EST 2005


> After thinking about it a little bit more, I like Marc's proposal for
> allowing regular number literals to specify the scalar value of a
> character literal better than the C/Java clone:
> 
> #\n Unicode character n (n must start with a # character and it must
>     represent an exact integer, for example #\#x20 is the space character,
>     #\#d9 is the tab character, and #\#e1.2e2 is the lower case character
>     "x")
> 
> Of course, the downside is that this doesn't carry over directly to
> string literals.  Marc, have you thought about allowing \#<n> in
> string literals, and requiring some kind of delimiter (like ; or # or
> whatever) after it?

Although Gambit has supported this notation for some time now, I'm not
convinced it is really the best approach.  I think a syntax that is
shared by characters and strings would be better (and have a single
unified syntax).  So if this is a valid string: "\u1234" then this
should be a valid character #\u1234 .  A syntax like #\#d32, while
precise and flexible, is awkward if it can't be used in strings.

Increasingly I believe it is wrong in a high-level language like
Scheme to have both a character type and a string type.  The character
data type is really low-level, archaic and motivated by performance.
Ask yourself this question: if performance was not an issue would it
be possible to do text processing (elegantly) using only the string
data type and the following primitives?

     (string-length str)
     (substring str start end)
     (string-append str...)
     (string=? str...) ; and <, <=, ...
     (char->integer str)  ; where str is a string of length 1
     (integer->char n)    ; returns a string of length 1
     (read-char [port])   ; returns a string of length 1

The last 3 could be generalized to work on strings of any length but
that's not the main point here.  Doing away with the character data
type would have the interesting side-effect that only a syntax for
string escapes is needed.

I wonder how novice users react when confronted with the two text
related datatypes in most current languages (strings and characters).
In Scheme, symbols are another text related type.  I'm sure this
multiplicity of text types must lead to confusion with novice Scheme
users.  The different properties of strings and symbols justify that
they be different types.  I don't think the same can be said for
characters and strings.  If strings were immutable and interned,
all three types would be subsumed by strings.

Marc


More information about the R6RS mailing list