Formal comment #134 (defect)

the CHAR? type - relaxing surrogate restriction
Reported by:	Thomas Lord

Component:	other
Version:	5.91

R6RS component: The CHAR? type. (Section 9.14) 

Summary

The restriction in section 9.14, prohibitting the domain of
INTEGER->CHAR from including surrogates, should be
relaxed. Implementations should be permitted, not required, to adopt
that restriction.

Body

The text of 9.14 says, concerning the domain of the INTEGER->CHAR
procedure:
      (integer->char sv)

Sv must be a scalar value, i.e. a non-negative exact integer in
[0,#xD7FF] union [#xE000,#x10FFFF].

I think it should say: 

Implementations are permitted to require that Sv must be a scalar
value, i.e. a non-negative exact integer in [0,#xD7FF] union
[#xE000,#x10FFFF].

or words to that effect. 

Opinions vary about the desirability of an implementation in which an
"unpaired surrogate" can be represented as a CHAR? value. There seem
to be no definitive arguments for or against this proposition. I would
be happy to explain in detail an implementation that permits unpaired
surrogates as CHAR? values, and why I prefer such an implementation.

John Cowan and I have both asserted that a problem with allowing
unpaired surrogates as CHAR? values is that there is no standard way
to write them to a UTF-8 or UTF-16 port. That is true, but it is not
an argument for the restriction in 9.14.

What is not clear to me is why the authors favor the restriction and
what kind of arguments, examples, logic etc. to offer in order to
attempt to persuade them otherwise. Would it be helpful for me to
describe an implementation that doesn't have the restriction? Or to
explain how the I/O issues can be addressed? I am hoping it is a
simple matter to drop the restriction on the general principle that
restrictions like that need a strong, positive rationale which, in
this case, is clearly lacking.

Very briefly, therefore: 

In general, the less restricted model is simpler and more powerful. In
an implementation without the restriction, the CHAR? type can simply
be isomorphic with a set of exact integers in some (possibly improper)
superset of [0,#xFFFFFFFF]. That enables things like "bucky bits" (a
fine lisp tradition). It is certainly easy to teach and learn. It
seems to be simpler to implement, too.

The I/O issues can be solved in a clever way -- by reinterpreting
ill-formed UTF-8 and UTF-16 as spellings of sequences of certain
private-use codepoints. Round-trips with processes that don't
understand these private use characters are perfectly robust to the
extent that those processes are conforming.

RESPONSE:

We concede that scalar values (as reflected by the R6RS character data
type) is not a suitable representation for all forms of data. Perhaps
more surprisingly, scalar values turn out to be unsuitable for
representing UTF-16 code units.

In response to the key remark:

 What is not clear to me is why the authors favor the restriction and
 what kind of arguments, examples, logic etc. to offer in order to
 attempt to persuade them otherwise.

Given the complexity of the topic, and given that our areas of
expertise lie elsewhere, the editors have simply chosen to follow
other experts on this point: the Unicode consortium. By our reading,
every consortium standard and recommendation that we find explicitly
prohibits unpaired surrogates, including the the UTF-8 encoding, the
UTF-16 encoding, the UTF-32 encoding, and recommendations for
implementing the ANSI C wchar_t type.

Even UCS-4, which originally permitted a larger range of values that
includes the surroagte range, has been redefined to match UTF-32
exactly. That is, the original UCS-4 range was shrunk and surrogates
were exlcuded.

We would be pursuaded by a published recommendation from the Unicode
consortium that seems to us to unambiguously support your suggestion,
e.g., a recommendation of Unicode code points as a suitable definition
for a "character" datatype.