[R6RS] Source code encoding

Marc Feeley feeley
Tue Mar 15 08:33:19 EST 2005


> Marc> Which decoders are you refering to?  Are these really
> Marc> widespread? 
> 
> Anything based on the author looking at Gillam's book, which has a
> table "If the file starts with ..., it's ...".  Part of the problem
> for me is that I have a poor sense as to what Unicode tools are out
> there.  But the same holds true for the other side of the equation,
> namely how many tools we care about don't support UTF-8.

Depends what you mean by "we".  There are plenty of Windows users who
use notepad and wordpad for programming.  Do we care?  I actually do,
but I may be the minority.  Wordpad saves Unicode files as UTF-16 +
BOM, but not UTF-8.  In notepad you can save as UTF-16 + BOM, or UTF-8
+ BOM, but not plain UTF-8.  There's also the issue of using "old"
versions of an operating system or editor that does not support UTF-8
(e.g. earlier versions of Windows that nevertheless support UTF-16 +
BOM).

> >> - Because the perceived (by me) complexity.
> 
> Marc> You mean implementation complexity?  
> 
> No, I meant issue complexity.
> 
> Marc> How about:
> 
> Marc> (define (determine-encoding port)
> Marc>   (case (peek-byte port)
> Marc>      ((#xFE #xFF) 'UTF-16+BOM)
> Marc>      ((#xEF)      'UTF-8+BOM)
> Marc>      (else        'UTF-8)))
> 
> Does *this exact* code actually work in Gambit-C?

Not currently, but it is on the TODO list (as you know Gambit treats
files as streams of bytes ***and*** streams of characters, so it is
not difficult to implement peek-byte on text file ports).  Anyway the
code was simply to show that there's really not much logic that is
needed to support all three encodings.

Marc


More information about the R6RS mailing list