Formal comment #235 (enhancement) String positions and string slices Reported by: John Cowan Version: 5.92 Summary Proposes string positions and string slices to solve problems with O(n) Unicode string operations. Description This proposal is designed to provide a set of procedures for R6RS strings that will preserve the historic linear behavior of most R5RS implementations, while still providing the Unicode scalar value semantics of R6RS characters. These procedures are specifically designed to be equally applicable to representations using fixed-width arrays, variable-width arrays, and trees of various types. Internally, characters in strings may not be laid out in uniform-width cells in memory. Consequently, random access to characters by the natural number of their position in the sequence may not run in constant time. Positions in strings are therefore identified by opaque string-positions. Following the convention of string editors, string-positions identify not characters in the string but the positions between them. String-positions are *not* necessarily disjoint from existing Scheme data types; they may be identical to exact non-negative integers, for example. Each string-position is associated with a set of strings into which it is valid. Strings may be "sliced up" and seen in parts with the string-slice procedure or its derivatives; string-positions that were valid in a string are also valid at corresponding positions in slices of the string, because slices of the string still refer to the same underlying string structure. Because operating on a portion of string is such a common operation, the running time of string-slice is guaranteed to be sublinear in the number of characters within that slice of string; programmers are encouraged to use string-slice frequently. IMPORTANT NOTE: Any use of string-set! on a string invalidates all string positions associated with those string; it is an error to use any of them after that. There is no method by which to find the string associated with a position, primarily because there is no such unique string, but rather a set of strings, some of which are slices of others; and in part to permit efficient implementations of string in which positions are represented as offsets into the string storage that fit in machine registers and require no storage on the heap. Contrariwise, there is no guarantee that string positions are exact non-negative integers, so implementations less focussed on run-time performance may provide enough information to verify the correct use of positions. The following new procedures are proposed for the base library: (string-start-position ) -> position Returns the first position in , before which there are no characters. (string-end-position ) -> position Returns the last position in , after which there are no characters. (position-in-string? ) -> boolean Returns true if is a valid position in , and false if not. must be a string-position in either case. (string-position=? ) -> boolean Returns true if and identify the same position in . Returns false if they identify different positions. It is an error if either of or is not a valid position in . (string-position ) -> boolean (string-position>=? ) -> boolean (string-position>? ) -> boolean (string-position<=? ) -> boolean Procedures imposing a total ordering on string-positions. It is an error if and are not valid positions into . (string-forward []) -> position Returns a string-position for the position in that is characters after , or false if there are fewer than characters after in . If is not supplied, its default value is 1. It is an error if is not a valid position in . (string-backward []) -> position Returns a string-position for the position in that is characters before , or false if there are fewer than characters before in . If is not supplied, its default value is 1. It is an error if is not a valid position in . (string-slice ) -> string Returns a string that contains the sequence of characters in between and . This slice of the string may refer to storage shared by . The running time of string-slice on average must be sublinear in the number of characters between and . For example, string-slice may run in (amortized) constant time, if a string is a triple of internal storage, a start bound, and an end bound, and string-slice need only construct a new triple with tighter bounds; or string-slice may run in logarithmic time, if a string is structured as a tree of content. string-slice may *not* simply copy all of the characters in ; programmers may rely on its performance, and should not be afraid to use it frequently. (string-prefix ) -> string (string-suffix ) -> string string-prefix returns a slice of that contains the sequence of characters before . string-suffix returns a slice of that contains the sequence of characters after . (define (string-prefix string end-position) (string-slice string (string-start-position string) end-position)) (define (string-suffix string start-position) (string-slice string start-position (string-end-position string))) Because a string comprises a sequence of characters, each character in the sequence may be assigned an index (an exact and non-negative number) as used in the string-ref and string-set! procedures. There is an isomorphism between these indices and string-positions; the following procedures deal with this isomorphism. (string-position->index ) -> index Returns the number of characters in that precede , or the number of iterated applications of string-forward to the starting position in necessary to find a position equal to . This operation may run in time proportional to the value of the index. It is an error if is not a valid position in . (define (string-position->index position string) (do ((index 0 (+ index 1)) (position* (string-start-position string) (string-forward position* string))) ((string-position=? position* position) index))) (index->string-position ) -> position Returns a string-position after characters in , or iteratively applies string-forward times to the starting position in . This operation may run in time proportional to the value of the index. It is an error if exceeds the number of characters in . (define (index->string-position index string) (do ((index index (- index 1)) (position (string-start-position string) (string-forward position string))) ((zero? index) position))) (string-position-difference ) -> integer difference Returns the number of characters after and before . If precedes in , this is the additive inverse of the number of characters after and before , i.e. (- (string-position-difference )). This is provided separately so that implementations may provide a more efficient implementation than counting up with string-forward. The following procedures provide a portable interface to searching and functional editing. They are proposed as the (r6rs string) library, but could be moved to a SRFI instead. (string-search-forward ) -> [match-start-position match-end-position] or [#F #F] (string-search-backward ) -> [match-start-position match-end-position] or [#F #F] string-search-forward searches forward through for the first occurrence of the pattern; string-search-backward searches backward through for the last occurrence of the pattern. If a match is found, returns two positions identifying the starting and ending positions of the match in ; if no match is found, returns (values #f #f). If is not a string, the behavior is implementation-dependent. (string-search-forward/ci ) -> [match-start-position match-end-position] or [#F #F] (string-search-backward/ci ) -> [match-start-position match-end-position] or [#F #F] Case-insensitive variants of the preceding two procedures. (string-append ...) -> string (string-concatenate ) -> string (string-join []) -> string These return concatenations of string. string-append returns the concatenation of each .... string-concatenate returns the concatenation of each element of . string-join returns , or an empty string, if all elements of are empty; or a string composed by concatenating , each of the elements of with between each pair of consecutive elements, and . (string-replace ) -> string Returns a string composed of the sequence of characters in before , followed by the sequence of characters in , followed by the sequence of characters in after . It is an error if either of and is not a valid position in . (string-insert ) -> string Returns a string composed of the sequence of characters in before , followed by the sequence of characters in , followed by the sequence of characters in after . It is an error if is not a valid position in . (string-delete ) -> [string deleted-string position] Returns three values: a string composed by deleting the sequence of characters in between and , as if with string-delete; the string that was deleted; and a position identifying the position in the first string where the second string was deleted. RESPONSE: To preserve backwards compatibility for Scheme programs that manipulate characters and strings in a portable way, the editors have taken a conservative approach by retaining Scheme's traditional interfaces to characters and strings while expanding the character type to encompass Unicode scalar values. The next draft will - encourage implementors to make string-ref and string-set! run in O(1) average time, as has traditionally been expected - move string-set! and string-fill! into a separate library, as was done with set-car! and set-cdr! - add string-for-each - add transformations between strings and their encodings as bytevectors in UTF-8, UTF-16, and UTF-32 The editors agree that Scheme needs higher-level operations on Unicode texts. Scheme's traditional strings may assist when implementing those texts, but the editors suspect it would be better to add texts as a new data type than to extend Scheme's mutable strings. The editors also believe the design and specification of this new data type is an important and delicate task that should be given at least as much time and public review as was given to SRFI 75 and to the specifications of strings in various drafts of the R6RS. The editors therefore will not adopt the formal comment's proposal for R6RS, but would like to encourage the submission of a SRFI along these lines, and hope that a new data type of Unicode texts could be included in some future revision of the Scheme reports.