From: William D Clinger Date: Mon, 18 Jun 2007 23:17:45 -0400 Subject: [Formal] please repair and simplify lexical syntax Submitter: William D Clinger Issue type: Defect Priority: Minor Component: Lexical Syntax Report version: 5.94 Summary: please repair and simplify lexical syntax Full description of issue: The current draft's description of lexical syntax contains several technical errors. The lexical description is also so complex as to invite controversy concerning its interpretation. For example, I count well over 800,000 distinct errors in the current reference implementation of get-datum, but its author does not agree that these are errors [private communication]. Implementations of the R5RS were allowed to simplify the lexical syntax by extending it to a simpler and more regular syntax that includes the required R5RS syntax as a proper sublanguage. Many did so. One of the more popular extensions adds an nonterminal, which generates all identifiers, all numbers, and many non-R5RS tokens besides. Any token that is rejected by number->string is then accepted as an identifier. The current draft R6RS forbids the use of such extensions to simplify the lexical syntax. Hence simplicity of lexical syntax has become more important for the R6RS than it was in the R5RS. Unfortunately, the lexical syntax has become more complicated, not simpler. * * * Details: The end of input is not listed as a lexeme, but it probably should be. The current draft says "Identifiers, numbers, characters, booleans, and dot must be terminated by a (e.g. parenthesis, space, or comment) or by the end of the input." That sounds like an excellent requirement, but it is vacuous. According to the formal syntax, a can be , and the empty sequence of characters is one kind of , so it is simply impossible for a lexeme (other than the end of input) not to be followed by a . The solution I recommend is to change the first production of to . In the 5.94 draft, the following are legal: foo#;13#;15() ; read as foo () foo#|comment|#() ; read as foo () foo#!r6rs#!r6rs() ; read as foo () but the following are illegal: foo#() foo#!r5rs#!r5rs() Treating #!r6rs as a delimiter, but not #!r5rs or #!fold-case, is especially confusing. I recommend the addition of # to the list of delimiters. This might cause problems for backwards compatibility, however, since several systems have been allowing # as a . Since there are over 235,000 Unicode characters that can begin an R6RS identifier, programmers are likely to assume that identifiers can begin with any alphabetic character. With the current draft, however, 163 alphabetic characters (of Unicode 5.0.0) cannot begin an identifier. This should be fixed. The peculiar identifiers that begin with -> are not needed by the current draft, and are no more useful than many other extensions that are just as widely implemented. There is no principled reason for the R6RS to allow the -> peculiarities while forbidding others. Either remove the -> wart from the lexical syntax, or generalize the lexical syntax so the -> identifiers are no longer exceptions to the general rules for constructing identifiers. Scheme's lexical syntax for numbers has always been complex, but the draft R6RS proposes to make that more of a problem by outlawing simplifying extensions. I think it is time to drop the # notation for insignificant digits. Our earlier discussion of this revealed general confusion concerning its semantics, and it also appears that hardly anyone uses it. The R6RS is arguably a better solution than the # notation, and should replace it. The #\nul, #\esc, and #\delete characters have no corresponding two-letter escape sequence in strings, while the other eight named characters do. This seems arbitrary and capricious. The formal syntax is ambiguous, which implies it is not LR. This creates unnecessary obstacles to the use of some standard scanner and parser generators. The ambiguities I have noticed so far include: The fact that any two input characters are separated by . The first two productions for overlap. The first and third productions for overlap. The two productions for overlap. RESPONSE: The delimiter problem has been resolved. The last three ambiguities mentioned are harmless and were retained to avoid introducing new problems this late in the process.