[R6RS] Source code encoding

Tue Mar 15 08:33:19 EST 2005

I understand your arguments, but don't agree with your conclusions.

> Marc> Now you are advocating for using UTF-8 only.  Why not allow UTF-16 +
> Marc> BOM also, since it does not conflict in any way with UTF-8 and UTF-16
> Marc> + BOM is the norm on Windows for encoding Unicode text files?  What is
> Marc> the downside of supporting both of these popular Unicode encodings?
> 
> - Because there are standard decoders out there where you can say
>   "UTF-xx + BOM" where the auto-detection wouldn't work in the setup
>   you describe.

Which decoders are you refering to?  Are these really widespread?  I
suspect it might be easiest to ask the developers of these decoders to
also have a mode to autodetect between UTF-8 and UTF-16 + BOM, since
that is possible and reasonable in itself.

> - Because, if we allow two different concrete encodings now, we might
>   want to add a third one in the future, and it's not clear that
>   leaving out the BOM on one of them where it's actually allowed will
>   scale.

But an important reason for leaving out the BOM for UTF-8 is to allow
shell scripts.  This alone precludes using BOMs with UTF-8, so we have
to give up on autodetection between all possible Unicode encodings.

> - Because this auto-detection based on a tag that isn't there always
>   makes me feel queasy, and doesn't seem very robust.

I don't understand.  It is possible to distinguish (with no ambiguities)
the following encodings

   - UTF-16 + BOM
   - UTF-8 + BOM
   - UTF-8

So why do you say it is not robust?

> - Because the perceived (by me) complexity.

You mean implementation complexity?  How about:

(define (determine-encoding port)
  (case (peek-byte port)
     ((#xFE #xFF) 'UTF-16+BOM)
     ((#xEF)      'UTF-8+BOM)
     (else        'UTF-8)))

Marc