Chapter 3

Lexical syntax and read syntax

The syntax of Scheme code is organized in three levels:

the lexical syntax that describes how a program text is split into a sequence of lexemes,
the read syntax, formulated in terms of the lexical syntax, that structures the lexeme sequence as a sequence of syntactic datums , where a syntactic datum is a recursively structured entity,
the program syntax formulated in terms of the read syntax, imposing further structure and assigning meaning to syntactic datums.

Syntactic datums (also called external representations) double as a notation for data, and Scheme’s (rnrs i/o ports (6)) library (library section on “Port I/O”) provides the get-datum and put-datum procedures for reading and writing syntactic datums, converting between their textual representation and the corresponding values. Each syntactic datum uniquely determines a corresponding datum value. A syntactic datum can be used in a program to obtain the corresponding datum value using quote (see section 9.5.1).

Scheme source code consists of syntactic datums and (non-significant) comments Syntactic datums in Scheme source code are called forms. Consequently, Scheme’s syntax has the property that any sequence of characters that is a form is also a syntactic datum representing some object. This can lead to confusion, since it may not be obvious out of context whether a given sequence of characters is intended to denote data or program. It is also a source of power, since it facilitates writing programs such as interpreters and compilers that treat programs as data (or vice versa). A form nested inside another form is called a subform.

A datum value may have several different external representations. For example, both “#e28.000” and “#x1c” are syntactic datums representing the exact integer 28, and the syntactic datums “(8 13)”, “( 08 13 )”, “(8 . (13 . ()))” all represent a list containing the integers 8 and 13. Syntactic datums that denote equal objects (in the sense of equal?; see section 9.6) are always equivalent as forms of a program.

Because of the close correspondence between syntactic datums and datum values, this report sometimes uses the term datum to denote either a syntactic datum or a datum value when the exact meaning is apparent from the context.

An implementation is not permitted to extend the lexical or read syntax in any way, with one exception: it need not treat the syntax #!<identifier>, for any <identifier> (see section 3.2.4) that is not r6rs, as a syntax violation, and it may use specific #!-prefixed identifiers as flags indicating that subsequent input contains extensions to the standard lexical or read syntax. The syntax #!r6rs may be used to signify that the input afterward is written with the lexical syntax and read syntax described by this report when no other #!<identifier> appears; #!r6rs is otherwise treated as a comment; see section 3.2.3.

This chapter overviews and provides formal accounts of the lexical syntax and the read syntax.

3.1 Notation

The formal syntax for Scheme is written in an extended BNF. Non-terminals are written using angle brackets. Case is insignificant for non-terminal names.

All spaces in the grammar are for legibility. <Empty> stands for the empty string.

The following extensions to BNF are used to make the description more concise: <thing>* means zero or more occurrences of <thing>, and <thing>⁺ means at least one <thing>.

Some non-terminal names refer to the Unicode scalar values of the same name: <character tabulation> (U+0009), <linefeed> (U+000A), <carriage return> (U+000D), <line tabulation> (U+000B), <form feed> (U+000C), <carriage return> (U+000D), <space> (U+0020), <next line> (U+0085), <line separator> (U+2028), and <paragraph separator> (U+2029).

3.2 Lexical syntax

The lexical syntax determines how a character sequence is split into a sequence of lexemes, omitting non-significant portions such as comments and whitespace. The character sequence is assumed to be text according to the Unicode standard [45]. Some of the lexemes, such as numbers, identifiers, strings etc., of the lexical syntax are syntactic datums in the read syntax, and thus represent data. Besides the formal account of the syntax, this section also describes what datum values are denoted by these syntactic datums.

The lexical syntax, in the description of comments, contains a forward reference to <datum>, which is described as part of the read syntax. Being comments, however, these <datum>s do not play a significant role in the syntax.

Case is significant except in boolean datums, number datums, and hexadecimal numbers denoting Unicode scalar values. For example, #x1A and #X1a are equivalent. The identifier Foo is, however, distinct from the identifier FOO.

3.2.1 Formal account

<Interlexeme space> may occur on either side of any lexeme, but not within a lexeme.

Identifiers, numbers, characters, booleans, and dot must be terminated by a <delimiter> (e.g., parenthesis, space, or comment) or by the end of the input.

The following two characters are reserved for future extensions to the language: { }

<lexeme> → <identifier> | <boolean> | <number> | <character> | <string> | ( | ) | [ | ] | #( | #vu8( | ’ | ‘ | , | ,@ | . | #’ | #‘ | #, | #,@ <delimiter> → <interlexeme space> | ( | ) | [ | ] | " | ; <whitespace> → <character tabulation> | <linefeed> | <line tabulation> | <form feed> | <carriage return> | <next line> | <any character whose category is Zs, Zl, or Zp> <line ending> → <linefeed> | <carriage return> | <carriage return> <linefeed> | <next line> | <carriage return> <next line> | <line separator> <comment> → ; ⟨all subsequent characters up to a <line ending> or <paragraph separator>⟩ | <nested comment> | #; <interlexeme space> <datum> | #!r6rs <nested comment> → #| <comment text> <comment cont>* |# <comment text> → ⟨character sequence not containing #| or |#⟩ <comment cont> → <nested comment> <comment text> <atmosphere> → <whitespace> | <comment> <interlexeme space> → <atmosphere>*

<identifier> → <initial> <subsequent>* | <peculiar identifier> <initial> → <constituent> | <special initial> | <inline hex escape> <letter> → a | b | c | ... | z | A | B | C | ... | Z <constituent> → <letter> | ⟨any character whose Unicode scalar value is greater than 127, and whose category is Lu, Ll, Lt, Lm, Lo, Mn, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co⟩ <special initial> → ! | $ | % | & | * | / | : | < | = | > | ? | ^ | _ | ~ <subsequent> → <initial> | <digit> | <any character whose category is Nd, Mc, or Me> | <special subsequent> <digit> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <hex digit> → <digit> | a | A | b | B | c | C | d | D | e | E | f | F <special subsequent> → + | - | . | @ <inline hex escape> → \x<hex scalar value>; <hex scalar value> → <hex digit>⁺ <peculiar identifier> → + | - | ... | -> <subsequent>* <boolean> → #t | #T | #f | #F <character> → #\<any character> | #\<character name> | #\x<hex scalar value> <character name> → nul | alarm | backspace | tab | linefeed | vtab | page | return | esc | space | delete <string> → " <string element>* " <string element> → <any character other than " or \> | <line ending> | \a | \b | \t | \n | \v | \f | \r | \" | \\ | \<line ending> | \<space> | <inline hex escape>

A <hex scalar value> represents a Unicode scalar value between 0 and #x10FFFF, excluding the range [#xD800, #xDFFF].

The rules for <num R>, <complex R>, <real R>, <ureal R>, <uinteger R>, and <prefix R> below should be replicated for R = 2, 8, 10, and 16. There are no rules for <decimal 2>, <decimal 8>, and <decimal 16>, which means that numbers containing decimal points or exponents must be in decimal radix.

3.2.2 Line endings

Line endings are significant in Scheme in single-line comments (see section 3.2.3) and within string literals. In Scheme source code, any of the line endings in <line ending> marks the end of a line. Moreover, the two-character line endings <carriage return> <linefeed> and <carriage return> <next line> each count as a single line ending.

In a string literal, a line ending not preceded by a \ denotes a linefeed character, which is the standard line-ending character of Scheme.

3.2.3 Whitespace and comments

Whitespace characters are spaces, linefeeds, carriage returns, character tabulations, form feeds, line tabulations, and any other character whose category is Zs, Zl, or Zp. Whitespace is used for improved readability and as necessary to separate lexemes from each other. Whitespace may occur between any two lexemes, but not within a lexeme. Whitespace may also occur inside a string, where it is significant.

The lexical syntax includes several comment forms. In all cases, comments are invisible to Scheme, except that they act as delimiters, so, for example, a comment cannot appear in the middle of an identifier or number.

A semicolon (;) indicates the start of a line comment.The comment continues to the end of the line on which the semicolon appears.

Another way to indicate a comment is to prefix a <datum> (cf. Section 3.3.1) with #;, possibly with <interlexeme space> before the <datum>. The comment consists of the comment prefix #; and the <datum> together. This notation is useful for “commenting out” sections of code.

Block comments may be indicated with properly nested #|and |# pairs.

#| The FACT procedure computes the factorial of a non-negative integer. |# (define fact (lambda (n) ;; base case (if (= n 0) #;(= n 1) 1 ; identity of * (* n (fact (- n 1))))))

Rationale: #| ...|# cannot be used to comment out an arbitrary datum or set of datums; it works only when none of the datums include a string with an unmatched #| or |# character sequence. While #| ...|# and ; can often be used, with care, to comment out a datum, only #; allows the programmer to clearly communicate that a single datum has been commented out, as opposed to a block or line of arbitrary text.

The lexeme #!r6rs, which signifies that the program text that follows is written with the lexical and read syntax described in this report, is also otherwise treated as a comment.

3.2.4 Identifiers

Most identifiersallowed by other programming languages are also acceptable to Scheme. In general, a sequence of letters, digits, and “extended alphabetic characters” is an identifier when it begins with a character that cannot begin a number. In addition, +, -, and ... are identifiers, as is a sequence of letters, digits, and extended alphabetic characters that begins with the two-character sequence ->. Here are some examples of identifiers:

lambda q soup list->vector + V17a <= a34kTMNs ->- the-word-recursion-has-many-meanings

Extended alphabetic characters may be used within identifiers as if they were letters. The following are extended alphabetic characters:

! $ % & * + - . / : < = > ? @ ^ _ ~

Moreover, all characters whose Unicode scalar values are greater than 127 and whose Unicode category is Lu, Ll, Lt, Lm, Lo, Mn, Mc, Me, Nd, Nl, No, Pd, Pc, Po, Sc, Sm, Sk, So, or Co can be used within identifiers. In addtion, any character can be used within an identifier when denoted via an <inline hex escape>. For example, the identifier H\x65;llo is the same as the identifier Hello, and the identifier \x3BB; is the same as the identifier λ.

Any identifier may be used as a variableor as a syntactic keyword(see sections 4.2 and 6.3.2) in a Scheme program. Any identifier may also be used as a syntactic datum, in which case it denotes a symbol(see section 9.11).

3.2.5 Booleans

The standard boolean objects for true and false are written as #t and #f.

3.2.6 Characters

Characters are written using the notation #\<character>or #\<character name> or #\x<hex scalar value>.

For example:

`#\a`	lower case letter a
`#\A`	upper case letter A
`#\(`	left parenthesis
`#\`	space character
`#\nul`	U+0000
`#\alarm`	U+0007
`#\backspace`	U+0008
`#\tab`	U+0009
`#\linefeed`	U+000A
`#\vtab`	U+000B
`#\page`	U+000C
`#\return`	U+000D
`#\esc`	U+001B
`#\space`	U+0020
	preferred way to write a space
`#\delete`	U+007F
`#\xFF`	U+00FF
`#\x03BB`	U+03BB
`#\x00006587`	U+6587
`#\λ`	U+03BB
`#\x0001z`	`&lexical` exception
`#\λx`	`&lexical` exception
`#\alarmx`	`&lexical` exception
`#\alarm x`	U+0007
	followed by `x`
`#\Alarm`	`&lexical` exception
`#\alert`	`&lexical` exception
`#\xA`	U+000A
`#\xFF`	U+00FF
`#\xff`	U+00FF
`#\x ff`	U+0078
	followed by another datum, `ff`
`#\x(ff)`	U+0078
	followed by another datum,
	a parenthesized `ff`
`#\(x)`	`&lexical` exception
`#\(x`	`&lexical` exception
`#\((x)`	U+0028
	followed by another datum,
	parenthesized `x`
`#\x00110000`	`&lexical` exception
	out of range
`#\x000000001`	U+0001
`#\xD800`	`&lexical` exception
	in excluded range

(The notation &lexical exception means that the line in question is a lexical syntax violation.)

Case is significant in #\<character>, and in #\⟨character name⟩, but not in #\x<hex scalar value>. A <character> must be followed by a <delimiter> or by the end of the input. This rule resolves various ambiguous cases involving named characters, requiring, for example, the sequence of characters “#\space” to be interpreted as the space character rather than as the character “#\s” followed by the identifier “pace”.

3.2.7 Strings

String are written as sequences of characters enclosed within doublequotes ("). Within a string literal, various escape sequencesdenote characters other than themselves. Escape sequences always start with a backslash (\):

\a : alarm, U+0007
\b : backspace, U+0008
\t : character tabulation, U+0009
\n : linefeed, U+000A
\v : line tabulation, U+000B
\f : formfeed, U+000C
\r : return, U+000D
\" : doublequote, U+0022
\\ : backslash, U+005C
\<linefeed> : nothing
\<space> : space, U+0020 (useful for terminating the previous escape sequence before continuing with whitespace)
\x<hex scalar value>; : specified character (note the terminating semi-colon).

These escape sequences are case-sensitive, except that the alphabetic digits of a <hex scalar value> can be uppercase or lowercase.

Any other character in a string after a backslash is an error. Except for a line ending, any character outside of an escape sequence and not a doublequote stands for itself in the string literal. For example the single-character string "λ" (doublequote, a lower case lambda, doublequote) denotes the same string literal as "\x03bb;". A line ending stands for a linefeed character.

Examples:

`"abc"`	U+0061, U+0062, U+0063
`"\x41;bc"`	`"Abc"` ; U+0041, U+0062, U+0063
`"\x41; bc"`	`"A bc"`
	U+0041, U+0020, U+0062, U+0063
`"\x41bc;"`	U+41BC
`"\x41"`	`&lexical` exception
`"\x;"`	`&lexical` exception
`"\x41bx;"`	`&lexical` exception
`"\x00000041;"`	`"A"` ; U+0041
`"\x0010FFFF;"`	U+10FFFF
`"\x00110000;"`	`&lexical` exception
	out of range
`"\x000000001;"`	U+0001
`"\xD800;"`	`&lexical` exception
	in excluded range
`"A`
`bc"`	U+0041, U+000A, U+0062, U+0063
	if no space occurs after the `A`

3.2.8 Numbers

The syntax of written representations for numbers is described formally by the <number> rule in the formal grammar. Case is not significant in numerical constants.

A number may be written in binary, octal, decimal, or hexadecimal by the use of a radix prefix. The radix prefixes are #b(binary), #o(octal), #d(decimal), and #x(hexadecimal). With no radix prefix, a number is assumed to be expressed in decimal.

A numerical constant may be specified to be either exact or inexact by a prefix. The prefixes are #efor exact, and #ifor inexact. An exactness prefix may appear before or after any radix prefix that is used. If the written representation of a number has no exactness prefix, the constant is inexact if it contains a decimal point, an exponent, a “#” character in the place of a digit, or a nonempty mantissa width; otherwise it is exact.

In systems with inexact numbers of varying precisions, it may be useful to specify the precision of a constant. For this purpose, numerical constants may be written with an exponent marker that indicates the desired precision of the inexact representation. The letters s, f, d, and l specify the use of short, single, double, and long precision, respectively. (When fewer than four internal inexact representations exist, the four size specifications are mapped onto those available. For example, an implementation with two internal representations may map short and single together and long and double together.) In addition, the exponent marker e specifies the default precision for the implementation. The default precision has at least as much precision as double, but implementations may wish to allow this default to be set by the user.

3.1415926535898F0 Round to single, perhaps 3.141593 0.6L0 Extend to long, perhaps .600000000000000

An inexact real number with nonempty mantissa width, x|p, denotes the best binary floating point approximation of x using a p-bit significand. For example, 1.1|53 is an external representation of the best approximation of 1.1 in IEEE double precision. If x is an external representation of an inexact real number that contains no vertical bar, it should be treated as if specified with a mantissa width of 53.

Implementations that use binary floating point representations of real numbers should represent x|p using a p-bit significand if practical, or by a greater precision if a p-bit significand is not practical, or by the largest available precision if p or more bits of significand are not practical within the implementation.

Note: The precision of a significand should not be confused with the number of bits used to represent the significand. In the IEEE floating point standards, for example, the significand’s most significant bit is implicit in single and double precision but is explicit in extended precision. Whether that bit is implicit or explicit does not affect the mathematical precision. In implementations that use binary floating point, the default precision can be calculated by calling the following procedure:

(define (precision) (do ((n 0 (+ n 1)) (x 1.0 (/ x 2.0))) ((= 1.0 (+ 1.0 x)) n)))

Note: When the underlying floating-point representation is IEEE double precision, the |p suffix should not always be omitted: Denormalized numbers have diminished precision, and therefore should carry a |p suffix with the actual width of the significand.

The literals +inf.0 and -inf.0 represent positive and negative infinity, respectively. The +nan.0 literal represents the NaN that is the result of (/ 0.0 0.0), and may represent other NaNs as well.

If x is an external representation of an inexact real number and contains no vertical bar and no exponent marker other than e, the inexact real number it denotes is a flonum (see library section on “Flonums”). Some or all of the other external representations of inexact reals may also denote flonums, but that is not required by this report.

3.3 Read syntax

The read syntax describes the syntax of syntactic datumsin terms of a sequence of <lexeme>s, as defined in the lexical syntax.

Syntactic datums include the lexeme datums described in the previous section as well as the following constructs for forming compound datums:

pairs and lists, enclosed by ( ) or [ ] (see section 3.3.2)
vectors (see section 3.3.3)
bytevectors (see section 3.3.4)

3.3.1 Formal account

The following grammar describes the syntax of syntactic datums in terms of various kinds of lexemes defined in the grammar in section 3.2:

<datum> → <lexeme datum> | <compound datum> <lexeme datum> → <boolean> | <number> | <character> | <string> | <symbol> <symbol> → <identifier> <compound datum> → <list> | <vector> | <bytevector> <list> → (<datum>*) | [<datum>*] | (<datum>⁺ . <datum>) | [<datum>⁺ . <datum>] | <abbreviation> <abbreviation> → <abbrev prefix> <datum> <abbrev prefix> → ’ | ‘ | , | ,@ | #’ | #‘ | #, | #,@ <vector> → #(<datum>*) <bytevector> → #vu8(<u8>*) <u8> → ⟨any <number> denoting an exact integer in {0, ..., 255}⟩

3.3.2 Pairs and lists

List and pair datums, denoting pairs and lists of values (see section 9.10) are written using parentheses or brackets. Matching pairs of brackets that occur in the rules of <list> are equivalent to matching pairs of parentheses.

The most general notation for Scheme pairs as syntactic datums is the “dotted” notation (<datum₁> . <datum₂>) where <datum₁> is the representation of the value of the car field and <datum₂> is the representation of the value of the cdr field. For example (4 . 5) is a pair whose car is 4 and whose cdr is 5.

A more streamlined notation can be used for lists: the elements of the list are simply enclosed in parentheses and separated by spaces. The empty listis written () . For example,

(a b c d e)

and

(a . (b . (c . (d . (e . ())))))

are equivalent notations for a list of symbols.

The general rule is that, if a dot is followed by an open parenthesis, the dot, open parenthesis, and matching closing parenthesis can be omitted in the external representation.

The sequence of characters “(4 . 5)” is the external representation of a pair, not an expression that evaluates to a pair. Similarly, the sequence of characters “(+ 2 6)” is not an external representation of the integer 8, even though it is a base-library expression evaluating to the integer 8; rather, it is a syntactic datum representing a three-element list, the elements of which are the symbol + and the integers 2 and 6.

3.3.3 Vectors

Vector datums, denoting vectors of values (see section 9.14), are written using the notation #(<datum> ...). For example, a vector of length 3 containing the number zero in element 0, the list (2 2 2 2) in element 1, and the string "Anna" in element 2 can be written as following:

#(0 (2 2 2 2) "Anna")

This is the external representation of a vector, not a base-library expression that evaluates to a vector.

3.3.4 Bytevectors

Bytevector datums, denoting bytevectors (see library chapter on “Bytevectors”), are written using the notation #vu8(<u8> ...), where the <u8>s represent the octets of the bytevector. For example, a bytevector of length 3 containing the octets 2, 24, and 123 can be written as follows:

#vu8(2 24 123)

This is the external representation of a bytevector, not an expression that evaluates to a bytevector.

3.3.5 Abbreviations

’<datum>

‘<datum>

,<datum>

,@<datum>

#’<datum>

#‘<datum>

#,<datum>

#,@<datum>

Each of these is an abbreviation:
    ’<datum> for (quote <datum>),
    ‘<datum> for (quasiquote <datum>),
    ,<datum> for (unquote <datum>),
    ,@<datum> for (unquote-splicing <datum>),
    #’<datum> for (syntax <datum>),
    #‘<datum> for (quasisyntax <datum>),
    #,<datum> for (unsyntax <datum>), and
    #,@<datum> for (unsyntax-splicing <datum>).