This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Source encoding rules are extreme.
Type: Stage:
Components: Unicode Versions: Python 2.3
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: lemburg Nosy List: kirill_simonov, lemburg, ods
Priority: low Keywords:

Created on 2003-02-06 22:17 by kirill_simonov, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Messages (10)
msg14485 - (view) Author: Kirill Simonov (kirill_simonov) Date: 2003-02-06 22:17
According to the PEP 0263, a source code that contains
non-ASCII
characters (ord(ch)>127) and does not define an
encoding causes
DeprecationWarning. In the future, such code will cause
SyntaxError.

While I believe that the idea of defining source code
encoding is very
useful, I think that the current solution is
unnecessary extreme.

It is very unfriendly for beginners. Imagine a student that
types her first script:

name = raw_input("What's your name? ")   # russian
here, of course
print "Hi %s!" % name

Do not even try to convince me that she must define an
encoding
here. That feature would break any possibility to use
Python in schools.

Actually the source code encoding only affects Unicode
literals.
The above script works the same way with any defined
encoding,
so the warning for this code is unnecessary.

As a solution, I propose to issue DeprecationWarning
(or SyntaxError)
only when a non-ASCII character is contained in a
Unicode literal.
msg14486 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-02-06 22:45
Logged In: YES 
user_id=38388

Sorry, but the implementation we chose decodes the complete
file,
not only the Unicode literals, so if you want to use a specific 
encoding in the source code, you have to be explicit about it.

Python's source code was originally never meant to contain
non-ASCII characters. The PEP implementation now officially
allows this provided that you use an encoding marker, e.g.

"""
# -*- coding: windows-1251 -*-
name = raw_input("   ? ")
print " %s" % name
"""

Note that this is also needed in order to support UTF-16
file formats which use two bytes per character. Python
will automatically detect these files, so if you really don't
like the coding marker, simply write the file using a UTF-16
aware editor which prepends a UTF-16 BOM mark to the
file.
msg14487 - (view) Author: Kirill Simonov (kirill_simonov) Date: 2003-02-06 23:28
Logged In: YES 
user_id=36553

Hello,

Yes, I understand that the encoding is for the whole source
file.

But

1. The current implementation already assumes that one uses
ASCII-
compatible encoding. So we can make a step further and do
not use any
encoding while reading a source file. And then we'll
translate u"..." using
'ascii' encoding.

2. How do you want to support UTF-16 encoding? This will
completely
break ordinary string literals! "aa" is a source code would
become "a\x00a\x00" after compilation. Or do I miss something?

3. Do not forget that your change breaks billions of scripts
that use
non-ASCII characters even in comments!

4. I can write a patch. I would be forced to do this anyway.

msg14488 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-02-10 09:43
Logged In: YES 
user_id=38388

I've had a private discussion with Guido and Roman Suzi:

We'll add a way to set the source code default encoding via the
site.py/sitecustomize.py files. This should then allow anyone
wishing to customize the default behaviour to do so.
msg14489 - (view) Author: Kirill Simonov (kirill_simonov) Date: 2003-02-10 15:39
Logged In: YES 
user_id=36553

I like this. Thanks.
msg14490 - (view) Author: Denis S. Otkidach (ods) * Date: 2003-02-12 14:36
Logged In: YES 
user_id=63454

8-bit string in Python is just a stream of bytes now. Why should I specify 
encoding for inline image data for instance? And what encoding should I 
use?
msg14491 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-02-12 14:49
Logged In: YES 
user_id=38388

Encode the 8-bit data as base64 value and put that into the
source code.
msg14492 - (view) Author: Denis S. Otkidach (ods) * Date: 2003-02-12 15:05
Logged In: YES 
user_id=63454

Hmm... There no type for byte streams in Python anymore? Too much to 
change in existing code.  Base64 is not the best solution - too many 
unwanted and slow operations. There are too many areas where we need 
literals for binary data. One more example: translation tables for different 
encodings. Yea, I know about unicode/encode/decode etc, but they are 
_very_ slow for many applications. Use map(ord, [...list of ints...])?
msg14493 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-02-12 15:28
Logged In: YES 
user_id=38388

You shouldn't put binary data into Python source files 
to begin with. If you absolutely must, then base64 provides
a good start for an ASCII-encoding. The other alternative
is using Python octal escapes. Both are fast.

I don't know where you get the idea from that 
encode/decode are slow. They are certainly faster than
first building a list of ints in memory and then applying
map() to the list.
msg14494 - (view) Author: Denis S. Otkidach (ods) * Date: 2003-02-12 16:24
Logged In: YES 
user_id=63454

encode/decode is slow compared to translate. Octal/hexadecimal escapes 
are OK. I've noticed that defining arbitrary encoding of source allows 
arbitrary binary data in stings (a bit ugly, but is OK when this setting is 
hidden in site.py), so there is no problem even for old code.
History
Date User Action Args
2022-04-10 16:06:37adminsetgithub: 37929
2003-02-06 22:17:14kirill_simonovcreate