[R6RS] Source code encoding

Tue Mar 15 08:33:19 EST 2005

> Marc> Which decoders are you refering to?  Are these really
> Marc> widespread? 
> 
> Anything based on the author looking at Gillam's book, which has a
> table "If the file starts with ..., it's ...".  Part of the problem
> for me is that I have a poor sense as to what Unicode tools are out
> there.  But the same holds true for the other side of the equation,
> namely how many tools we care about don't support UTF-8.

Depends what you mean by "we".  There are plenty of Windows users who
use notepad and wordpad for programming.  Do we care?  I actually do,
but I may be the minority.  Wordpad saves Unicode files as UTF-16 +
BOM, but not UTF-8.  In notepad you can save as UTF-16 + BOM, or UTF-8
+ BOM, but not plain UTF-8.  There's also the issue of using "old"
versions of an operating system or editor that does not support UTF-8
(e.g. earlier versions of Windows that nevertheless support UTF-16 +
BOM).

> >> - Because the perceived (by me) complexity.
> 
> Marc> You mean implementation complexity?  
> 
> No, I meant issue complexity.
> 
> Marc> How about:
> 
> Marc> (define (determine-encoding port)
> Marc>   (case (peek-byte port)
> Marc>      ((#xFE #xFF) 'UTF-16+BOM)
> Marc>      ((#xEF)      'UTF-8+BOM)
> Marc>      (else        'UTF-8)))
> 
> Does *this exact* code actually work in Gambit-C?

Not currently, but it is on the TODO list (as you know Gambit treats
files as streams of bytes ***and*** streams of characters, so it is
not difficult to implement peek-byte on text file ports).  Anyway the
code was simply to show that there's really not much logic that is
needed to support all three encodings.

Marc