Formal comment #68 (defect)

R6RS must provide a UTF-16 codec, because UTF-16 is an essential encoding
Reported by:	John Cowan

Component:	i/o
Version:	5.91

R6RS implementations are currently required to support the UTF-8,
Latin-1 (ISO 8859-1), UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE
encodings. This list omits the essential UTF-16 encoding.

The difference between UTF-16 and UTF-16{BE,LE} is that in the former,
the presence of a BOM (U+FEFF) character at the beginning of the input
stream indicates the ordering of the bytes that make up each
character. The BOM is not considered part of the content. (If no BOM
is present, the environment's default ordering is used; failing that,
big-endian order is used.)

In the UTF-16BE and UTF-16LE encodings, no BOM is permitted; an
initial U+FEFF character has its alternative semantics of zero-width
no-break space. These encodings are far less commonly used than the
UTF-16 encoding.

In particular, the Windows operating system consistently creates
UTF-16 documents in little-endian order (not UTF-16LE documents)
whenever characters must be written that are not available in the
locale-dependent encoding. In essence, Windows systems provide two
different encodings at any one time: the "ANSI" (locale-dependent,
8-bit or 8/16-bit) encoding, and the UTF-16 encoding. (The MS-DOS
compatibility support provides a third encoding for use by MS-DOS
programs.) Failing to provide a UTF-16 codec will make it
unnecessarily hard to process Unicode documents generated by Windows.

In addition, UTF-16 (not UTF-16LE or UTF-16BE) is one of the two
encodings which all XML processors (parsers) are required to accept,
the other being UTF-8. Depending on the predominant language of the
document, UTF-16 encoding may be more or less compact than UTF-8
encoding. Failing to provide a UTF-16 codec will make a substantial
range of XML documents difficult to process.

I propose that a procedure named "utf-16-codec" be added to section
15.3.3 (p. 86). I further propose that the codecs for the rarely used
UTF-{16,32}{BE,LE} encodings be removed. No form of UTF-32 encoding is
in common use in I/O, though UTF-32 format is sometimes convenient for
internal use.

RESPONSE:

The next draft of the report will reflect these suggestions.