This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: gettext module charset changes
Type: Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: barry Nosy List: barry, loewis, nnorwitz
Priority: normal Keywords: patch

Created on 2002-06-13 20:13 by barry, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
gettext-diff.txt barry, 2002-06-13 20:13
Messages (4)
msg40302 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2002-06-13 20:13
The GNU gettext docs make two recommendations: that the
source string to gettext() be in us-ascii, and that the
default output charset be in the locale's character
set.  I think the latter makes the most sense for our
ugettext() methods.

The attached patch sets the default character set to
us-ascii for NullTranslations.  For GNUTranslations,
the default character set is taken from the
Content-Type: header if given in the .po/.mo file,
otherwise it's taken from the default locale
information, if available.  It falls back to the base
class charset (by default us-ascii).

This patch also provides the following:

- add a set_charset() method to the NullTranslations
base class, so that it is easier to change the default
character set.  For symmetry, I also rename charset()
to get_charset() and keep the former for backwards
compatibility.

- convert Lib/test/test_gettext.py to unittest style
(sans the cvs rm of Lib/test/output/test_gettext which
we'll do separately)

- update the docs for all the code changes described above.
msg40303 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-06-13 20:53
Logged In: YES 
user_id=21627

Obtaining the locale's codeset by parsing environment
variables is bogus. For example, in most installations, the
codeset for de_DE@euro is iso-8859-15. However, this is
impossible to find out by just parsing the environment
variables.

Instead, the proper way is to use
locale._nl_langinfo(CODESET) where available. If that is not
available, the following heuristics could be applied:
- On Windows, it is "mbcs"
- On Unix, parse the environment variables

As for the actual usage of the charset, I think you
misinterpret the gettext recommendation: the result of
gettext ought to be in the locale's encoding (this is not a
default encoding). This means that, if the codeset of the
locale and the charset of the catalog differ, character set
conversion needs to be invoked; I can see no traces of that
happening in your patch. 

The common case is a catalog in UTF-8, and the user's
codeset is language-specific (such as Latin-9). In that
case, conversion works well. There is also the case of
unsupported conversions (e.g. usage of EURO SIGN in the
catalog, but Latin-1 in the locale); in this case, glibc
iconv uses transliteration (to "EUR", in the example). Since
we have no transliteration, we would probably fall back to
return the string in the catalog's encoding :-(
msg40304 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2003-04-12 01:15
Logged In: YES 
user_id=33168

Barry, what's the status of this patch now?  Should it be
closed?
msg40305 - (view) Author: Barry A. Warsaw (barry) * (Python committer) Date: 2003-04-12 01:43
Logged In: YES 
user_id=12800

Yes, let's reject this.
History
Date User Action Args
2022-04-10 16:05:25adminsetgithub: 36749
2002-06-13 20:13:57barrycreate