Lexical syntax and datum syntax

4.1  Symbol and identifier syntax

4.1.1  Escaped symbol constituents

While revising the syntax of symbols and identifiers, the editors’ goal was to make symbols subject to write/read invariance, i.e. to allow each symbol to be written out using put-datum (section on “Textual output”) or write (section on “Simple I/O”), and read back in using get-datum (section on “Textual input”) or read (section on “Simple I/O”)), yielding the same symbol. This was not the case in Revised5 Report on the Algorithmic Language Scheme, as symbols could contain arbitrary characters such as spaces which could not be part of their external representation. Moreover, symbols could distinguish case, whereas their external representation could not.

For representing unusual characters in the symbol syntax, the report provides the \x escape syntax, which allows specifying an arbitrary Unicode scalar value. This also has the advantage that arbitrary symbols can be represented using only ASCII, which allows referencing them from Scheme programs restricted to ASCII or some other subset of Unicode.

Among existing implementations of Scheme , a popular choice for extending the set of characters that can occur in symbols is the vertical-bar syntax of Common Lisp. The vertical-bar syntax of Common Lisp carries the risk of confusing the syntax of identifiers with that of consecutive lexemes, and also does not allow representing arbitrary characters only using ASCII. Consequently, it was not adopted for R6RS.

4.1.2  Case sensitivity

The change from case-insensitive syntax in R5RS to case-sensitive syntax is a major change. Many technical arguments exist in favor of both case sensitivity and case insensitivity, and any attempt to list them all here would be incomplete. Switching to case sensitivity will break backwards compatibility, and might set a precedent for switching a technically more or less arbitrary decision again in the future.

The editors decided to switch to case sensitivity because they perceived that a significant majority of the Scheme community favored the change. This perception has been strengthened by polls at the 2004 Scheme workshop, on the plt-scheme mailing list, and the r6rs-discuss mailing list.

The directives described in appendix on “Optional case insensitivity” allow specifying that a code portion (or other syntactic data) was written under the old assumption of case-insensitivity and therefore must be case-folded upon reading.

4.1.3  Identifiers starting with ->

R6RS introduces a special rule in the lexical syntax for identifiers starting with the characters ->. In R5RS, such identifiers were not valid lexemes. (In R5RS, a lexeme starting with a - character—except for - itself—has to be a representation of a number object.) However, many existing Scheme implementations prior to R6RS already supported identifiers starting with ->. (Many readers would classify any lexeme as an identifier starting with - for which string->number returns #f.) As a result, a significant amount of otherwise portable Scheme code used identifiers starting with ->, which are a convenient choice for many names. Therefore, R6RS legalizes these identifiers. The separate production in the grammar is not particularly elegant. However, designing a more elegant production that does not overlap with representations of number objects or other lexeme classes has proved to be surprisingly difficult.

4.2  Comments

While R5RS only provided the ; syntax for comments, the report now describes three essential kinds: In addition to ;, #| and |# delimit block comments, and #; starts a “datum comment”. (#!r6rs is also a kind of command, albeit with a specific, fixed purpose.)

Block comments provide a more convenient way of writing multi-line comments, and are an often-requested and often-implemented syntactic addition to the language.

The rationale for #; is not as readily apparent: It automatically comments out a single datum, the basic unit of Scheme syntax, something that the other comment mechanisms cannot do. #| ...|# cannot generally be used to comment out an arbitrary datum or set of data. Moreover, while #; is probably most useful during development and debugging, it is still useful to have a standard notation for commenting out a datum, particularly since programmers sometimes develop and debug a single piece of code concurrently on multiple systems.

4.3  Future extensions

The # is the prefix of several different kinds of syntactic data: vectors, bytevectors, syntactic abbreviations related to quasiquotes and syntax construction, nested comments, characters, #!r6rs, and implementation-specific extensions to the syntax that start with #!. In each case, the character following the # identifies the class of syntactic datum. In the case of bytevectors, the syntax anticipates several different kinds of homogeneous vectors, even though R6RS specifies only one. The u8 after the #v identifies the components of the vector as unsigned 8-bit entities or octets.