[R6RS] Unicode scalar value escape sequences

Michael Sperber sperber
Tue Mar 1 02:40:34 EST 2005


>>>>> "Marc" == Marc Feeley <feeley at IRO.UMontreal.CA> writes:

>> After thinking about it a little bit more, I like Marc's proposal for
>> allowing regular number literals to specify the scalar value of a
>> character literal better than the C/Java clone:
>> 
>> #\n Unicode character n (n must start with a # character and it must
>>     represent an exact integer, for example #\#x20 is the space character,
>>     #\#d9 is the tab character, and #\#e1.2e2 is the lower case character
>>     "x")
>> 
>> Of course, the downside is that this doesn't carry over directly to
>> string literals.  Marc, have you thought about allowing \#<n> in
>> string literals, and requiring some kind of delimiter (like ; or # or
>> whatever) after it?

Marc> Although Gambit has supported this notation for some time now, I'm not
Marc> convinced it is really the best approach.  I think a syntax that is
Marc> shared by characters and strings would be better (and have a single
Marc> unified syntax).  So if this is a valid string: "\u1234" then this
Marc> should be a valid character #\u1234 .  A syntax like #\#d32, while
Marc> precise and flexible, is awkward if it can't be used in strings.

Yeah, but my question was what you thought about allowing that same
notation with a delimiter in strings.  After all, \u sequences with
less than 4 digits must also be delimited.

Let me turn this into a proposal to be clear:

I propose using Gambit's notation for character literals that specify
a character through its scalar value.

I propose allowing 

\<number>;

where <number> must start with a # sign in string literals to denote
a character through its scalar value.  (Pick any other delimiter you
like, if it's only the semicolon from liking the proposal.)

Marc> Increasingly I believe it is wrong in a high-level language like
Marc> Scheme to have both a character type and a string type.  The character
Marc> data type is really low-level, archaic and motivated by performance.
Marc> Ask yourself this question: if performance was not an issue would it
Marc> be possible to do text processing (elegantly) using only the string
Marc> data type and the following primitives?

Marc>      (string-length str)
Marc>      (substring str start end)
Marc>      (string-append str...)
Marc>      (string=? str...) ; and <, <=, ...
Marc>      (char->integer str)  ; where str is a string of length 1
Marc>      (integer->char n)    ; returns a string of length 1
Marc>      (read-char [port])   ; returns a string of length 1

But you still retain CHAR->INTEGER, INTEGTER->CHAR, and READ-CHAR.
(And thus, a lot of the current character predicates.)  So the only
real difference in your proposal is that you'd have (string?
#\a) (or whatever the character notation is) return #t.  Right?

Marc> I wonder how novice users react when confronted with the two text
Marc> related datatypes in most current languages (strings and
Marc> characters).

I think characters and strings are intuitive concepts to most human
beings, not just novice Scheme users.

-- 
Cheers =8-} Mike
Friede, V?lkerverst?ndigung und ?berhaupt blabla


More information about the R6RS mailing list