Formal comment #229 (enhancement)

ports, characters, strings, Unicode
Reported by: 	Thomas Lord
Version: 	5.92

Type of Issue: Simplification/Enhancement/Defect

R6RS components: base library, concepts, formal syntax, I/O, Lexical
Syntax, Unicode Synopsis:

    Conformant implementations should not be *required* to support any
    characters beyond the portable character set of R5RS.

    The report should define a standard way to extend beyond the
    portable character set by addition of characters corresponding to
    Unicode scalar values.

    The report should recognize and honor a role for a character type
    that transcends the specifics of Unicode and encompasses discrete
    communications channels in general. In particular, the report
    should permit the inclusion of characters which do not correspond
    to Unicode scalar values.

    The fundamental conformance requirement of an implementation
    should explicitly pertain to observable consequences of running a
    program, principly reflected as operations on ports.

Disclaimers:

    This comment is incomplete: some changes are indicated but not
    fully spelled out; some needed changes (under the premise of this
    comment) have no doubt been missed; the proposed substitute
    wording is, at best, a rough first draft; the notion of permitting
    implementations to support less than all of Unicode has broad
    implications that merit discussion; the implications of the
    proposals herein have not explained, here, for the standard
    libraries.

Full Description:

    I propose a number of changes to the treatment of ports, characters and strings.

    * Change to "Summary", page 1 

    For

            "Chapter 2 explain's Scheme's number types"

    Substitute

            "Chapter 2 explains several of Scheme's fundamental types."

    * Changes to "1.1 Basic Types", page 7 

    Retitle:

                1.1 Fundamental Types

    For

            Characters

            Scheme characters mostly correspond to textual
            characters. More precisely, they are isomorphic to the
            scalar values of the Unicode standard.

            Strings

            Strings are finite sequences of characters with fixed
            length and thus represent arbitrary Unicode texts.

    Substitute

            Ports

            A port is an object representing one end of a discrete
            communications channel over which Scheme programs can
            transmit and/or receive characters selected from a finite
            alphabet associated with the port.

            Characters

            Character objects represent characters such as are
            transmitted and received over a communication channel
            associated with a port. Most commonly, character objects
            correspond to Unicode scalar values and are used as
            primitive elements when representing textual data.

            Strings

            A string is a linear data structure representing a finite
            sequence of arbitrary characters. Elements of a string are
            addressed by an integer index. For example, a Unicode text
            can be usefully represented as a string.

    * Chapter 2, "Numbers", pages 10 and 11 

    Retitle the chapter: "Fundamental Types"

    Renumber the entire current content of Chapter 2, "2.1" (renumbering the current "2.1" to "2.1.1", etc.)

    For

                "This chapter describes Scheme's representations for numbers" (page 10)

    Substitute

                "This section describes Scheme's representations for numbers" (page 10)

    Add a new introduction:

                ~2. Fundamental Types

                This chapter explains several of Scheme's fundamental types.

    Add a new section:

                2.2 Ports, Characters, and Strings

                This section describes Scheme's mechanisms and
                representation for synchronous communication between
                Scheme programs and processes which are external to
                the execution of a program. Thus, ports characters,
                and strings comprise an important part of Scheme's
                model for the formally observable side effects of
                running a program and the model for observations of
                external events which may effect a running program.

                Often but not always, such observable communication
                conveys textual information. Thus, it is useful to
                first explain these types beginning with an abstract
                mathematical model of communication, and then to
                explain how that model applies specifically to textual
                information.

                2.2.1 Program Execution as World-line

                            and Implementation Correctness

                Conceptually, for the purpose of understanding the
                observable consequences of running a program, the
                execution of a Scheme program corresponds to a
                relativistic world-line. Information about events
                external to a running program become available to that
                program at a specific point on the execution's
                world-line when the program explicitly completes a
                step to receive that information. Similarly,
                information from the running program becomes
                externally observable when explicitly transmitted at a
                specific point on the execution's world-line. In
                portable programs, all transmissions and receipt of
                information are comprised of discrete atomic events --
                the conveyance of a single character via a port -- and
                these are totally ordered along the conceptual
                world-line of a program. Each is a unique
                event. Implementations are permitted, however, to make
                extensions which allow for simultaneous transmissions
                and/or receipts.

                In an important sense, the transmission and receipt
                events that occur as a Scheme program runs are the
                *only* formally observable consequence of running the
                program. An implementation is correct, in an important
                sense, provided only that these events occur as
                specified and in a permitted order when running a
                portable program.

                It should be noted that, while the order of
                communication events on the world-line of a running
                program is formally well-defined, that order is not
                directly observable. That is to say that external
                observations of and transmissions to a Scheme program
                may occur, from the perspective of external observers,
                in a different order, and possibly with loss of
                information. Only causality relationships, as imposed
                externally and as implied by execution-order rules in
                this report, define a partial ordering of
                communications events upon which all observers can, in
                principle, agree.

                [This section should cite the source of its conceptual
                model of communication, the paper:

                    "The Mutual Exclusion Problem: Part I -- A Theory

                        of Interprocess Communication", Leslie
                        Lamport; Journal for the Association of
                        Computing Machinery; Volume 33, Number 2,
                        April 1986.

                ]

                2.2.2 Ports as Discrete Communication Channel Terminals

                Scheme adopts a mathematical model of communication
                based on discrete communication channels. Each channel
                is associated with a finite, abstract alphabet. The
                channel conveys letters from that alphabet in one or
                both directions, one at a time. For example, the size
                of the alphabet, together with the number of letters
                than can be conveyed in a unit of time, determine the
                bandwidth of the channel.

                A port object represents a Scheme program's direct
                interface to one end of such a communication's
                channel. It is through a port object that a program
                transmits and receives on the channel. It is
                noteworthy that a port represents only one terminal
                point on the channel: the physical channel itself as
                well as the terminal point(s) of external processes
                are not directly accessible to the program.

                In this model of communication, we make no a priori
                assumptions about the alphabet whose letters are
                conveyed, other than it is finite. In particular,
                distinct ports may use different alphabets.

                When two ports use different alphabets, it is
                sometimes useful to treat the alphabets as disjoint
                sets and othertimes useful to identify letters in one
                alphabet with letters in another. An example of the
                latter case can be seen by comparing an ASCII-only
                channel to a Unicode scalar value channel: it is often
                desirable to treat ASCII as a subset of Unicode. An
                example of usefully disjoint alphabets can be seen by
                comparing a Unicode channel, used to convey textual
                information, to channel used to control a certain
                style of traffic signal, on which a program wishes to
                transmit letters that correspond to "red", "yellow",
                and "green".

                It is, nevertheless, the case that many useful
                procedures reasonably operate generically on all
                letters, without regard to which alphabet they come
                from. For example, if a procedure is intended to
                concatenate finite sequences of letters ("strings", in
                Scheme) the same implementation for that procedure
                suffices regardless of whether the sequence comprises
                text, traffic signals, or some mix of these. For that
                reason, Scheme includes the fundamental type
                "character", which contains all letters from all
                alphabets supported by an implementation.

                [This section should cite the source of the
                mathematical model of communication to which it
                refers, such as:

                    "The Mathematical Theory of Communication",

                        Claude E. Shannon and Warren Weaver; University of Illinois Press; 1963

                ]

                2.2.3 Unicode Scalar Values: A Portable, Textual Alphabet

                This report defines certain character values which
                must be supported by all implementations and others
                which may be supported by any implementation but only
                in specified ways. Together, these comprise the
                Unicode scalar values and they are included in Scheme
                so that portable programs may reliably manipulate
                textual information in the broadest practical range of
                human languages and, more specifically, to that
                portable Scheme program can reliably manipulate the
                source text of portable Scheme programs.

                Unicode scalar values are formally defined by an
                established but evolving standard, "The Unicode
                Standard," as published by The Unicode
                Consortium. Informally speaking, the scalar values
                "roughly correspond" to the character-like elements of
                human writing systems however, in its details the
                exact relationship to writing systems is complex and
                readers are referred to The Unicode Standard for a
                complete explanation.

                2.2.4 Character Order

                Communications channel alphabets in general, and
                Unicode in particular, are frequently defined by
                standards procedures which are external to the process
                which defines Scheme. Frequently, as with Unicode
                scalar values, a total ordering of the letters within
                an alphabet are included in the definition.

                Consequently, Scheme includes procedures which compare
                two or more characters for their ordering. Portable
                program may rely on Unicode scalar values being
                well-ordered and on that order corresponding to the
                definitions of The Unicode Standard.

                When characters represent letters from either an
                unordered alphabet or from disjoint alphabets, the
                ordering imposed on them may be implementation
                specific or the characters may be unordered. Thus,
                portable programs which assume that all characters
                they encounter are well-ordered may cause errors if
                run in implementations and contexts that present these
                programs with non-portable characters. Nevertheless,
                it is generally reasonable for portable programs that
                are concerned mainly with Unicode scalar values to
                assume that all characters they encounter will be
                well-ordered.

                2.2.5 Character Enumeration

                Similarly, external standards, The Unicode Standard in
                particular, often define a mapping from the letters of
                an abstract alphabet to (usually non-negative) exact
                integer values.

                Because of the central importance of enabling portable
                programs to reliably manipulate textual data, this
                report requires implementations to convert Unicode
                scalar values to the corresponding integer, and vice
                versa. Implementations are permitted but not required
                to include additional characters that can be converted
                to and from integers, provided they satisfy this
                Unicode requirement.

                Implementations may include characters for which there
                is no conversion to and from integers, using the
                standard procedures defined herein. Nevertheless, it
                is generally reasonable for portable programs that are
                concerned mainly with Unicode scalar values to assume
                that all characters they encounter will be convertable
                to and from integers.

                2.2.6 Strings and String Ordering

                Ports, by definition, convey characters, one at a
                time. It is commonly necessary, especially when
                textual information is being manipulated, to manage
                finite sequences of characters.

                Scheme's string objects represent finite sequences of
                arbitrary characters.

                When two strings are comprised entirely of
                well-ordered characters, a natural lexical ordering of
                the strings may be inferred. In the case of characters
                corresponding to Unicode scalar values, that ordering
                is an imperfect but frequently useful approximation of
                the lexical linguistic ordering of texts.

                2.2.7 Characters, Strings, and Case Conversions

                The lexical syntax of Scheme relies upon certain very
                limited forms of case conversion among textual
                letters. These conversions are a subset of a standard,
                linguistically approximate case conversion among
                Unicode scalar values. Scheme includes procedures
                which effect these conversions, as well as their
                natural character-wise extensions to strings.

                2.2.8 Ports, Characters, and Strings: A Summary

                Ports are communication channel end-points held by a
                running Scheme program. Characters are letters, from
                finite abstract alphabets, conveyed over these
                channels. Strings are finite sequences of characters.

                Portable programs must restrict themselves to
                characters corresponding to Unicode scalar
                values. These characters are well-ordered and
                correspond to standardized integer values. A
                linguistically approximate case conversion is defined
                among these characters.

                Implementations may extend the character type (and by
                implication, the port and string types) with
                additional characters. The full set of characters
                supported by an implementation may be well-ordered but
                need not be.

    [or words to similar effect]

    * Chapter 3, "Lexical syntax and read syntax" 

    In general, implementations should not be required to support more
    than a minimal portable character set while, at the same time,
    there should be only one permitted way to add support for fully
    general Unicode scalar value characters.

    In 3.2.1 ("Formal Account" p. 12) the definition of <consitutent> is too strong.

    For

                <any character whose Unicode scalar value....>

    Substitute

                <any character, supported by the implementation, whose Unicode scalar value ....>

    In 3.2.3, p.14:

    For

                Moreover, all characters whose...

    Substitute

                Moreover, all chacters supported by an implemtnation, whose

    Similar fixes to 3.2.5, p. 14.

    In 3.2.6, p 15, the definition of "\x" notation needs similar fixes.

    * Chapter 4, section 4.3, "Exceptional situations", p. 18 

    It is unclear whether or not it is intended to permit
    implementations to use the condition system as a means to
    asynchronously communicate information to an application.

    If so, slight changes are merited to the proposed addition of
    section 2.2 ("Ports, Characters, and Strings") above.

    [Note: it is a matter worthy of explicit debate whether or not the
    condition system should be used for asynchronous communication.]

    * Chapter 9, Section 9.1, "Base Types" 

    Add "port?" to the list. I suggest renaming the section,
    "Fundamental types" because "base" carries too many overtones from
    the vocabulary of object oriented programming languages.

    Ports should be considered a fundamental type for reasons given in
    the proposed addition of 2.2 ("Ports, Characters, and Strings"),
    above.

    * Chapter 9, Section 9.13, "Characters", p. 49 

    Insert a section here introducing ports.

    * Chapter 9, Section 9.13, "Characters", p. 49ff 

    For

        *Characters* are objects that represent Unicode scalar values[46].

    Substitute

        *Characters* are objects that represent abstract letters from
        a communications channel (port) alphabet.

    For

        *Note:* Unicode defines .... (whose code is in the range #x10000 to #X10FFFF).

    Substitute

        All implementations of scheme are required to support the
        characters [as per the R5 portable character set].

        Implementations should additionally support a larger character
        set corresponding to Unicode scalar values.

    For

            [the definitions of char->integer and integer->char]

    Substitute

            (char->integer /char/) procedure (integer->char /int/) procedure

            For characters with an integer mapping (see section 2.2)
            these procedures implement a bijective mapping between
            characters and integers. In particular, characters which
            correspond to Unicode scalar values must be mapped to the
            corresponding exact integer.

            For other characters which an implementation may support,
            these procedures have unspecified behavior and return
            values.

    For (p.50)

                These procedures impose a total ordering on the set of
                characters according to their Unicode scalar values.

    Substitute

                These procedures define a partial ordering among
                characters. For characters with an integer mapping (as
                given by char->integer) the ordering among characters
                is the same as the ordering of the corresponding
                integers.


RESPONSE:

We agree that implementations should be able to extend the set of
characters beyond Unicode scalar values. However, we believe that such
extensions can be added in libraries other than the standard ones, and
made invisible to the standard operations, so that the standard need
not specifically address the issue. At the same time, the editors are
convinced that defining `character' to Unicode scalar values --- as
far as the standard operations are concerned --- is an important step
in promoting portability of Scheme programs.

Discussion on the r6rs-discuss mailing list covered a broad range of
issues. In addition to the above decision, the editors agreed on
several changes and non-changes:

 - strings are mutable, but `string-set!' will be moved to a separate
   library, much like the handling of `set-car!' and `set-cdr!'

 - `string-ref' will remain in the standard

 - implementors will be encouraged to provide `string-ref' and
   `string-set!' that provide results in O(1) time

 - `string-for-each' will be added

 - characters correspond to Unicode scalar values (as opposed, in
   particular, UTF-16 code units)