[R6RS] R6RS Unicode SRFI controversial issues
Fri Jun 17 18:12:00 EDT 2005
I'm writing the R6RS Unicode SRFI and have encountered a few issues
which I think may be controversial (I won't mention the other less
controversial stuff to save bandwidth). Please give your opinion on
String literal escapes:
1) Matthew suggested using \<newline> (i.e. a backslash at the
end of a line)
in a string literal to indicate that the string continues on
the next line.
I believe CommonLisp has this too, and it ignores all the
following the newline (so that strings can be indented).
Should R6RS do
the same? Moreover, should R6RS prohibit newlines in strings
that are not
preceded by a backslash? My position is yes on both questions.
2) Matthew suggested using the \u<x>...<x> (with <= 4 hex digits
\U<x>...<x> (with <= 6 hex digits <x>) escapes. The
is similar but not exactly the same as Java's. Java requires
4 hex digits. Moreover, Java transforms the \u<x>...<x>
Unicode characters before lexical analysis (so that "!\u0022
valid string since \u0022 represents the closing
doublequote). I propose
that the \u<x>...<x> escape require exactly 4 hex digits and
the handling of \u<x>...<x> escapes be done as part of the
parser, i.e. "!\u0022 is not a valid string, but "!\u0022" is
equivalent to "!\"". For consistency, I propose that the
\U<x>...<x> escape require exactly 6 hex digits. That makes the
syntax easy to remember:
\<o><o><o> : range 0 to #xFF (as in C)
\x<x><x> : range 0 to #xFF (as in C)
\u<x><x><x><x> : range 0 to #xFFFF excluding
\U<x><x><x><x><x><x> : range 0 to #x10FFFF excluding
1) #\newline is defined as the Unicode character 10, which has
been called linefeed. I suggest we add #\linefeed and that
(char->integer #\newline) = (char->integer #\linefeed) = 10.
2) For consistency with the string literal escapes the character
syntax in R6RS should support a 2/4/6 digit hexadecimal
#\x<x><x> : range 0 to #xFF
#\u<x><x><x><x> : range 0 to #xFFFF excluding
#\U<x><x><x><x><x><x> : range 0 to #x10FFFF excluding
The octal notation should not be supported because #\0 to #\7
be ambiguous (and making a special case for the single digit
would be ugly and error prone).
3) The named characters should be followed by a delimiter, so that
the datum (#\spaceous) is an error instead of being
the two element list (#\space ous) as in R5RS.
4) For consistency with the case sensitivity of symbols (and the
case is significant to distinguish #\u... and #\U...), the named
characters should also be case sensitive. #\Space should be
By the same reasoning, booleans should also be case
sensitive, so that
#F and #T are errors.
More information about the R6RS