Issue 681960: Source encoding rules are extreme.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/37929

classification

Title:	Source encoding rules are extreme.
Type:		Stage:
Components:	Unicode	Versions:	Python 2.3

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	kirill_simonov, lemburg, ods
Priority:	low	Keywords:

Created on 2003-02-06 22:17 by kirill_simonov, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Messages (10)
msg14485 - (view)	Author: Kirill Simonov (kirill_simonov)	Date: 2003-02-06 22:17
According to the PEP 0263, a source code that contains non-ASCII characters (ord(ch)>127) and does not define an encoding causes DeprecationWarning. In the future, such code will cause SyntaxError. While I believe that the idea of defining source code encoding is very useful, I think that the current solution is unnecessary extreme. It is very unfriendly for beginners. Imagine a student that types her first script: name = raw_input("What's your name? ") # russian here, of course print "Hi %s!" % name Do not even try to convince me that she must define an encoding here. That feature would break any possibility to use Python in schools. Actually the source code encoding only affects Unicode literals. The above script works the same way with any defined encoding, so the warning for this code is unnecessary. As a solution, I propose to issue DeprecationWarning (or SyntaxError) only when a non-ASCII character is contained in a Unicode literal.
msg14486 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-02-06 22:45
Logged In: YES user_id=38388 Sorry, but the implementation we chose decodes the complete file, not only the Unicode literals, so if you want to use a specific encoding in the source code, you have to be explicit about it. Python's source code was originally never meant to contain non-ASCII characters. The PEP implementation now officially allows this provided that you use an encoding marker, e.g. """ # -- coding: windows-1251 -- name = raw_input(" ? ") print " %s" % name """ Note that this is also needed in order to support UTF-16 file formats which use two bytes per character. Python will automatically detect these files, so if you really don't like the coding marker, simply write the file using a UTF-16 aware editor which prepends a UTF-16 BOM mark to the file.
msg14487 - (view)	Author: Kirill Simonov (kirill_simonov)	Date: 2003-02-06 23:28
Logged In: YES user_id=36553 Hello, Yes, I understand that the encoding is for the whole source file. But 1. The current implementation already assumes that one uses ASCII- compatible encoding. So we can make a step further and do not use any encoding while reading a source file. And then we'll translate u"..." using 'ascii' encoding. 2. How do you want to support UTF-16 encoding? This will completely break ordinary string literals! "aa" is a source code would become "a\x00a\x00" after compilation. Or do I miss something? 3. Do not forget that your change breaks billions of scripts that use non-ASCII characters even in comments! 4. I can write a patch. I would be forced to do this anyway.
msg14488 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-02-10 09:43
Logged In: YES user_id=38388 I've had a private discussion with Guido and Roman Suzi: We'll add a way to set the source code default encoding via the site.py/sitecustomize.py files. This should then allow anyone wishing to customize the default behaviour to do so.
msg14489 - (view)	Author: Kirill Simonov (kirill_simonov)	Date: 2003-02-10 15:39
Logged In: YES user_id=36553 I like this. Thanks.
msg14490 - (view)	Author: Denis S. Otkidach (ods) *	Date: 2003-02-12 14:36
Logged In: YES user_id=63454 8-bit string in Python is just a stream of bytes now. Why should I specify encoding for inline image data for instance? And what encoding should I use?
msg14491 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-02-12 14:49
Logged In: YES user_id=38388 Encode the 8-bit data as base64 value and put that into the source code.
msg14492 - (view)	Author: Denis S. Otkidach (ods) *	Date: 2003-02-12 15:05
Logged In: YES user_id=63454 Hmm... There no type for byte streams in Python anymore? Too much to change in existing code. Base64 is not the best solution - too many unwanted and slow operations. There are too many areas where we need literals for binary data. One more example: translation tables for different encodings. Yea, I know about unicode/encode/decode etc, but they are _very_ slow for many applications. Use map(ord, [...list of ints...])?
msg14493 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-02-12 15:28
Logged In: YES user_id=38388 You shouldn't put binary data into Python source files to begin with. If you absolutely must, then base64 provides a good start for an ASCII-encoding. The other alternative is using Python octal escapes. Both are fast. I don't know where you get the idea from that encode/decode are slow. They are certainly faster than first building a list of ints in memory and then applying map() to the list.
msg14494 - (view)	Author: Denis S. Otkidach (ods) *	Date: 2003-02-12 16:24
Logged In: YES user_id=63454 encode/decode is slow compared to translate. Octal/hexadecimal escapes are OK. I've noticed that defining arbitrary encoding of source allows arbitrary binary data in stings (a bit ugly, but is OK when this setting is hidden in site.py), so there is no problem even for old code.

History
Date	User	Action	Args
2022-04-10 16:06:37	admin	set	github: 37929
2003-02-06 22:17:14	kirill_simonov	create