Lexical syntax and datum syntax

4.1  Symbol and identifier syntax

4.1.1  Escaped symbol constituents

While revising the syntax of symbols and identifiers, the editors' goal was to make symbols subject to write/read invariance, i.e. to allow each symbol to be written out using put-datum (section on “Textual output”) or write (section on “Simple I/O”), and read back in using get-datum (section on “Textual input”) or read (section on “Simple I/O”), yielding the same symbol. This was not the case in Revised5 Report on the Algorithmic Language Scheme, as symbols could contain arbitrary characters such as spaces which could not be part of their external representation. Moreover, symbols could be distinguished internally by case, whereas their external representation could not.

For representing unusual characters in the symbol syntax, the report provides the \x escape syntax, which allows an arbitrary Unicode scalar value to be specified. This also has the advantage that arbitrary symbols can be represented using only ASCII, which allows referencing them from Scheme programs restricted to ASCII or some other subset of Unicode.

Among existing implementations of Scheme, a popular choice for extending the set of characters that can occur in symbols is the vertical-bar syntax of Common Lisp. The vertical-bar syntax of Common Lisp carries the risk of confusing the syntax of identifiers with that of consecutive lexemes, and also does not allow representing arbitrary characters using only ASCII. Consequently, it was not adopted for R6RS.

4.1.2  Case sensitivity

The change from case-insensitive syntax in R5RS to case-sensitive syntax is a major change. Many technical arguments exist in favor of both case sensitivity and case insensitivity, and any attempt to list them all here would be incomplete.

The editors decided to switch to case sensitivity, because they perceived that a significant majority of the Scheme community favored the change. This perception has been strengthened by polls at the 2004 Scheme workshop, on the plt-scheme mailing list, and the r6rs-discuss mailing list.

The suggested directives described in appendix on “Optional case insensitivity” allow programs to specify that a section of the code (or other syntactic data) was written under the old assumption of case-insensitivity and therefore must be case-folded upon reading.

4.1.3  Identifiers starting with ->

R6RS introduces a special rule in the lexical syntax for identifiers starting with the characters ->. In R5RS, such identifiers are not valid lexemes. (In R5RS, a lexeme starting with a - character—except for - itself—must be a representation of a number object.) However, many existing Scheme implementations prior to R6RS already supported identifiers starting with ->. (Many readers would classify any lexeme as an identifier starting with - for which string->number returns #f.) As a result, a significant amount of otherwise portable Scheme code used identifiers starting with ->, which are a convenient choice for certain names. Therefore, R6RS legalizes these identifiers. The separate production in the grammar is not particularly elegant. However, designing a more elegant production that does not overlap with representations of number objects or other lexeme classes has proven to be surprisingly difficult.

4.2  Comments

While R5RS provides only the ; syntax for comments, the report now describes three comment forms: In addition to ;, #| and |# delimit block comments, and #; starts a “datum comment”. (#!r6rs is also a kind of comment, albeit with a specific, fixed purpose.)

Block comments provide a convenient way of writing multi-line comments, and are an often-requested and often-implemented syntactic addition to the language.

A datum comment always comments out a single datum—no more, and no less, something the other comment forms cannot reliably do. Their uses include commenting out alternative versions of a form and commenting out forms that may be required only in certain circumstances. Datum comments are perhaps most useful during development and debugging and may thus be less likely to appear in the final version of a distributed library or top-level program; even so, a programmer or group of programmers sometimes develop and debug a single piece of code concurrently on multiple systems, in which case a standard notation for commenting out a datum is useful.

4.3  Future extensions

The # is the prefix of several different kinds of syntactic entities: vectors, bytevectors, syntactic abbreviations related to syntax construction, nested comments, characters, #!r6rs, and implementation-specific extensions to the syntax that start with #!. In each case, the character following the # specifies what kind of syntactic datum follows. In the case of bytevectors, the syntax anticipates several different kinds of homogeneous vectors, even though R6RS specifies only one. The u8 after the #v identifies the components of the vector as unsigned 8-bit entities or octets.