[R6RS] Source code encoding

Marc Feeley feeley
Mon Mar 7 13:54:57 EST 2005


> Marc> Why would this be interesting, since an ASCII encoded file also happens
> Marc> to be a UTF-8 encoded file?  Why would you want to distinguish these
> Marc> encodings by adding a BOM to UTF-8?
> 
> You don't---you want the BOM to distinguish UTF-8 from UTF-16, not
> ASCII from UTF-8.

Something's strange here.  First of all there is no need for a BOM in
UTF-8 because UTF-8 is a sequence of bytes.  Second, the bytes 0xFE
and 0xFF cannot appear at the beginning of a valid UTF-8 encoding, so
the BOM at the beginning of a UTF-16 + BOM encoded file cannot be
mistook for UTF-8 encoded characters.  So the three cases are:

 INITIAL BYTES                         ENCODING
 0xFE 0xFF ....                        A UTF-16 encoded file in big-endian
 0xFF 0xFE ....                        A UTF-16 encoded file in little-endian
 byte other than 0xFE and 0xFF ....    A UTF-8 encoded file

This also has the advantage that you can "peek" the first byte
to know which encoding to use (so if it is not 0xFE or 0xFF you
can call the Scheme parser and the first character it sees will
include the first byte of the file).

Marc


More information about the R6RS mailing list