Issue 568669: gettext module charset changes

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/36749

classification

Title:	gettext module charset changes
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.3

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:	barry	Nosy List:	barry, loewis, nnorwitz
Priority:	normal	Keywords:	patch

Created on 2002-06-13 20:13 by barry, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
gettext-diff.txt	barry, 2002-06-13 20:13

Messages (4)
msg40302 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2002-06-13 20:13
The GNU gettext docs make two recommendations: that the source string to gettext() be in us-ascii, and that the default output charset be in the locale's character set. I think the latter makes the most sense for our ugettext() methods. The attached patch sets the default character set to us-ascii for NullTranslations. For GNUTranslations, the default character set is taken from the Content-Type: header if given in the .po/.mo file, otherwise it's taken from the default locale information, if available. It falls back to the base class charset (by default us-ascii). This patch also provides the following: - add a set_charset() method to the NullTranslations base class, so that it is easier to change the default character set. For symmetry, I also rename charset() to get_charset() and keep the former for backwards compatibility. - convert Lib/test/test_gettext.py to unittest style (sans the cvs rm of Lib/test/output/test_gettext which we'll do separately) - update the docs for all the code changes described above.
msg40303 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-06-13 20:53
Logged In: YES user_id=21627 Obtaining the locale's codeset by parsing environment variables is bogus. For example, in most installations, the codeset for de_DE@euro is iso-8859-15. However, this is impossible to find out by just parsing the environment variables. Instead, the proper way is to use locale._nl_langinfo(CODESET) where available. If that is not available, the following heuristics could be applied: - On Windows, it is "mbcs" - On Unix, parse the environment variables As for the actual usage of the charset, I think you misinterpret the gettext recommendation: the result of gettext ought to be in the locale's encoding (this is not a default encoding). This means that, if the codeset of the locale and the charset of the catalog differ, character set conversion needs to be invoked; I can see no traces of that happening in your patch. The common case is a catalog in UTF-8, and the user's codeset is language-specific (such as Latin-9). In that case, conversion works well. There is also the case of unsupported conversions (e.g. usage of EURO SIGN in the catalog, but Latin-1 in the locale); in this case, glibc iconv uses transliteration (to "EUR", in the example). Since we have no transliteration, we would probably fall back to return the string in the catalog's encoding :-(
msg40304 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2003-04-12 01:15
Logged In: YES user_id=33168 Barry, what's the status of this patch now? Should it be closed?
msg40305 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2003-04-12 01:43
Logged In: YES user_id=12800 Yes, let's reject this.

History
Date	User	Action	Args
2022-04-10 16:05:25	admin	set	github: 36749
2002-06-13 20:13:57	barry	create