Formal comment #46 (defect)

LF should not be the only line separator
Reported by:	Reinder Verlinde

Component:	unicode
Version:	5.91

Section 3.2.1 contains the following production: 
     <intra-line whitespace> --> <any <whitespace> that is not <linefeed>

First, a minor, textual issue: the < and > tags do not balance (a > is
missing at the end of the line)

The major issue here is the choice to make "line feed" the only inter-
line separator.

http://en.wikipedia.org/wiki/Newline#Unicode, although far from
normative, states:

"The Unicode standard addresses the problem by defining a large number
of characters that conforming applications should recognize as line
terminators:

      LF:    Line Feed, u000A
      CR:    Carriage Return, u000D
      CR+LF: CR followed by LF, u000D followed by u000A
      NEL:   Next Line, u0085
      FF:    Form Feed, u000C
      LS:    Line Separator, u2028
      PS:    Paragraph Separator, u2029"

but in my reading, it is consistent with
http://www.unicode.org/reports/tr14. If a goal of the spec is to make
Scheme Unicode compliant, it must follow the mandatory aspects of that
page.

When not making this change, be aware that source files will be
single-line on some (minority) platforms (examples: Mac OS 9 and
earlier, and if I read things correctly, EBCDIC-based systems)

This may mean additional changes to the grammar. I haven't completely
thought it over, but I think the cleanest approach would be to define
the lexical syntax as consisting of two phases, the first of which
normalizes line endings (say to LFs).

RESPONSE:

Unicode TR 14 does not seem relevant to the question of which
characters are recognized as a linefeed following backslash in a
string.  TR 14 is about rendering text that is represented as a
sequence of Unicode scalar values.

Unicode TR 13, however, seems directly relevant. The recommendation of
TR 13 is to recognize the platform-specific newline character (not LS
(U+2028) and PS (U+2029).

The R6RS grammar is meant to apply to a decoded stream. That is,
specific bytes that represent a newline some stream are meant to be
converted to the LF character in the character stream that is parsed
as a program.

For example, under Mac OS 9, it is expected that the decoder used for
reading ASCII-encoded source code will decode a #xD byte as an LF
character. If the file is read via an R6RS port, the port should use a
transcoder whose eol-style is 'cr.

Putting these pieces together, the R6RS grammar is almost consistent
with Unicode TR 13. To make it more consistent, however, LS and PS
should be added as alternatives for <linefeed> in the grammar, and we
intend to make this change in the next draft.