Issue 1324237: ISO8859-9 broken

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/42469

classification

Title:	ISO8859-9 broken
Type:		Stage:
Components:	Unicode	Versions:	Python 2.4

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	exa, lemburg
Priority:	normal	Keywords:

Created on 2005-10-11 21:35 by exa, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (9)
msg26561 - (view)	Author: Eray Ozkural (exa)	Date: 2005-10-11 21:35
Probably not limited to ISO8859-9. The problem is that the encodings returned by getlocale() and getpreferredencoding() are not guaranteed to work with, say, encode method of string. I'm on MDK10.2 and i switch to Turkish locale >>> locale.setlocale(locale.LC_ALL, '') 'tr_TR' There is nothing in sys.stdout.encoding! >>> sys.stdout.encoding >>> So I take a look at the encoding: >>> locale.getlocale() ['tr_TR', 'ISO8859-9'] >>> locale.getpreferredencoding() 'ISO-8859-9' Too bad I cannot use either encoding to encode innocent unicode strings >>> a = unicode('André','latin-1') >>> print a.encode(locale.getpreferredencoding()) Traceback (most recent call last): File "<stdin>", line 1, in ? LookupError: unknown encoding: ISO-8859-9 >>> print a.encode(locale.getlocale()[1]) Traceback (most recent call last): File "<stdin>", line 1, in ? LookupError: unknown encoding: ISO8859-9 So I take a look at python page and I see that all encoding names are in lowercase. That's no good, because: >>> locale.getpreferredencoding().lower() '\xfdso-8859-9' (see bug 1193061 ) So I have to do this by hand! But of course this is unacceptable for any locale aware application. >>> print a.encode('iso-8859-9') André Expected: 1. I expect the encoding string returned by getpreferredencoding and getlocale to be identical 2. I expect the encoding string returned to work with encode method and in general any function that accepts locales. Got: 1. Different, ad hoc strings 2. Not all aliases present, only lowercases present, no reliable way to find a canonical locale name. Recommendations: a. Please consider the Java-like solution to make Locale into a class or an enum, something reliable, rather than just a string. b. Please test the locale functions in locales other than US (that is not really a locale anyway)
msg26562 - (view)	Author: Eray Ozkural (exa)	Date: 2005-10-11 21:46
Logged In: YES user_id=1454 BTW, I put this into Unicode category, because the bugs in it seemed relevant to localization. Thank you very much for your consideration.
msg26563 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-10-21 14:12
Logged In: YES user_id=38388 Something in your installation must be broken: it seems the system cannot find the ISO-8859-9 codec. Note that the .encode() method uses the codec registry for the lookup of the codec. The lookup itself is done case-insensitive and subject to a few other normalizations (see encodings/__init__.py). Please check your system and then report back whether you still see the reported error. Thanks.
msg26564 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-10-21 14:18
Logged In: YES user_id=38388 Something in your installation must be broken: it seems the system cannot find the ISO-8859-9 codec. Note that the .encode() method uses the codec registry for the lookup of the codec. The lookup itself is done case-insensitive and subject to a few other normalizations (see encodings/__init__.py). Please check your system and then report back whether you still see the reported error. Thanks.
msg26565 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-10-21 14:25
Logged In: YES user_id=38388 SF has problems again it seems... Anyway, I tried to set the TR_tr locale on my system and got a surprising result: >>> import locale >>> locale.setlocale(locale.LC_ALL, 'tr_TR') 'tr_TR' >>> locale.getpreferredencoding().lower() 'ans\xfd_x3.4-1968' >>> locale.getpreferredencoding() 'ANSI_X3.4-1968' So I think the problem lies with the fact that string.lower() is locale dependent and the GLIBC folks chose a highly incompatible way of dealing with the special Turkish situation of the capital "I" mapping to lower-case. While this kind of mapping may make sense for text processing in applications it certainly does not make sense when dealing with programming code or things that need to be specified in plain ASCII. In short: the encoding used for the TR_tr locale is not ASCII-compatible and thus not suitable for Python source code. I'm not sure what to say to this. My only advice is to not set the global locale setting to TR_tr, but only do this when it comes to actually processsing text in an application. Alternatively, you could write you application text using Unicode and the use the ISO-8859-9 codec to encode it for I/O.
msg26566 - (view)	Author: Eray Ozkural (exa)	Date: 2005-10-24 14:01
Logged In: YES user_id=1454 First, my system isn't broken. All applications run fine in this particular locale setting. The system was Mandrake 10.2, and now I have upgraded to Mandriva 2006, which is the same regarding this matter (However, I will check once again). I do not understand your suggestion of not setting the locale to tr_TR. I am not doing that. I am doing: locale.setlocale(locale.LC_ALL, '') which must work for _any_ locale not just one or two. As you know, that is the standard way of starting up a localized application. My suggestions stand: 1. Make the locale identifier something else than a string. Make it an object, just like in Java standard library 2. To _all_ text processing functions affected by locale setting, most notably lower() and upper() methods, append an optional argument of locale. The problem here might be greater than you seem to think it is. I should be able to use the result of locale.getpreferredencoding() without recourse to any text processing (the frustrating bit here is that, simply using lower is not sufficient in this case, but that is just a side matter). The simple answer is that it should return an ID or an Object that is not text. I suggest you to also review the Java standard library about these functions. At any rate, it is unacceptable for locale-specific functions to not work in some locales, in a locale-aware application that supports any locale. Regards, -- Eray Ozkural, eray at uludag.org.tr Uludag Developer http://uludag.org.tr
msg26567 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-10-24 14:51
Logged In: YES user_id=38388 I can only repeat: Python will not work if you set up the GLIBC to have it convert ASCII characters from lower to upper or vice-versa to characters outside the ASCII range. Please reread my reply. If you write a locale aware application that deals with text data, you should use Unicode to store the text data - not 8-bit strings. And no, writing a locale aware application does not mean that you start it up with setlocale(LC_ALL, '') - this simply doesn't work and is also the reason why the locale module goes through great lengths in only temporarily using this C API in order to apply a few conversions. If you think that we should have locale dependent string conversion functions that work in the same way (temporarily set a certain locale and then reset it to what it was previously set to), please provide a patch for the locale module. Thanks.
msg26568 - (view)	Author: Eray Ozkural (exa)	Date: 2005-10-24 15:08
Logged In: YES user_id=1454 I had read your reply very carefully. Had it made sense, I would think that this bug was invalid. However, it is not invalid. More explanation below. Please read carefully. First, we are not foolish enough to store the text as 8-bit strings in our application. All text is stored as UTF-8, if that is what you are wondering. This bug report is wholly concerned with the question: Why does not the following function work? >>> print a.encode(locale.getpreferredencoding()) where a is obviously unicode. What makes you think it is not unicode? It was indicated carefully in the original report. What use is encode(.) function if this is not supposed to work on _every_ locale? Why should my glibc have anything to do with the failure of this function? Again, the bug that lower() and upper() are broken is the subject of an existing report. This bug report is concerned with the above issue, the incapacity of the lower and upper functions is completely irrelevant to this bug. Second, you say that ` setlocale(LC_ALL, '') ` simply doesn't work. I request your wisdom then, perhaps you can also help the authors of the gettext manual then, what is the correct way to initialize an application which uses gettext?
msg26569 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-10-24 15:32
Logged In: YES user_id=38388 1. The reason why a.encode(locale.getpreferredencoding()) does not work is that the normalization function used by the codec lookup function uses .lower() to normalize the encoding string and expects this to happen using the ASCII mapping of lower case characters - just like many other places in Python standard library. 2. string.lower() and .upper() are not broken - it's just that they depend on the GLIBC settings for these mappings. Python expects -at the very least- to have these GLIBC APIs map the ASCII characters in their ASCII defined way. Obviously, this is no longer the case when switching to the TR_tr locale. I'd consider that a bug in GLIBC - not Python. A C implementation that maps a captial ASCII I to anything else than a lower case ASCII i is broken, IMHO. It certainly is not ASCII compatible and that's one of the few requirements Python makes regarding the platform. 3. If you want to use gettext, please use the gettext module - this does not need any of the setlocale() functions. 4. Since you are already using Unicode, I don't really understand why you want to bother with all the problems that setlocale() introduces at all ? You should probably leave setlocale() untouched and instead use the locale module functions to access various converter functions, e.g. for numeric values.

History
Date	User	Action	Args
2022-04-11 14:56:13	admin	set	github: 42469
2005-10-11 21:35:58	exa	create