[R6RS] Source code encoding

Marc Feeley feeley
Mon Mar 7 15:07:56 EST 2005


> Marc> Something's strange here.  First of all there is no need for a BOM in
> Marc> UTF-8 because UTF-8 is a sequence of bytes. [...]
> 
> For an explanation, check 
> 
> http://www.unicode.org/faq/utf_bom.html#BOM

But this reference also says that adding a BOM on UTF-8 is only useful
as a signature to disambiguate it from some encodings like UTF-32 and
Latin-1, but we would not use these encodings.  Moreover have you read
this part:

   Q: Can a UTF-8 data stream contain the BOM character (in UTF-8
      form)?  If yes, then can I still assume the remaining UTF-8
      bytes are in big-endian order?

   A: Yes, UTF-8 can contain a BOM. However, it makes no difference as
      to the endianness of the byte stream. UTF-8 always has the same
      byte order. An initial BOM is only used as a signature ? an
      indication that an otherwise unmarked text file is in
      UTF-8. Note that some recipients of UTF-8 encoded data do not
      expect a BOM. Where UTF-8 is used transparently in 8-bit
      environments, the use of a BOM will interfere with any protocol
      or file format that expects specific ASCII characters at the
      beginning, such as the use of "#!" of at the beginning of Unix
      shell scripts.

It would mean that you can't use a UTF-8 encoded Scheme source
file as a shell script.  That would be bad.

I maintain that allowing UTF-16 + BOM and UTF-8 is a good compromise
(it covers the two most popular Unicode file encodings, allows shell
scripts, plain ASCII files need not be changed, and a wide range of
editors can be used).  We could however add that an initial BOM
on a UTF-8 encoded file is ignored.

Marc


More information about the R6RS mailing list