Issue1026480
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2004-09-11 21:28 by kowaltowski, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (3) | |||
---|---|---|---|
msg22432 - (view) | Author: Tomasz Kowaltowski (kowaltowski) | Date: 2004-09-11 21:28 | |
I have no problems in Python in using strings which contain accented letters (my Emacs has no problems in producing them using one-byte iso-8859-1 encoding). However functions 'lower' and 'upper' do not work properly on these letters as shown below (I hope all accents appear properly within your browsers): ------------------------------------------------------------- as = "aáàâãä" # except for the first 'a', all other have accents AS = "AÁÀÂÃÄ" # except for the first 'A', all other have accents print "direct: %s -- %s" % (as, AS) print "lower: %s -- %s" % (as.lower(), AS.lower()) print "upper: %s -- %s" % (as.upper(), AS.upper()) ------------------------------------------------------------- The output is: -------------------------------------------------------------- direct: aáàâãä -- AÁÀÂÃÄ lower: aáàâãä -- aÁÀÂÃÄ upper: Aáàâãä -- AÁÀÂÃÄ -------------------------------------------------------------- i.e., accented letters (above 128) are not translated. It did not make any difference to put the line # -*- coding: iso-latin-1 -*- about the encoding as recommended by PEP 0263. I am not sure whether this is a bug or it is intentional, i.e. these functions work only for pure ASCII letters. However it is a major inconvenience for those who use any language which is not English but uses the Latin aplphabet :-(. There should be some mechanism to signal these functions which Latin variant (iso-8859-1, iso-8859-2, ...) is being used, so that they behave properly; eg, optional second argument? |
|||
msg22433 - (view) | Author: Scott David Daniels (scott_daniels) * | Date: 2004-09-13 20:00 | |
Logged In: YES user_id=493818 Note: lower and upper are defined as for ASCII on strs, but works correctly for unicode. uas = u"aáàâãä" # except first 'a', all have accents UAS = u"AÁÀÂÃÄ" # except first 'A', all have accents print u"direct: %s -- %s" % (uas, UAS) print u"lower: %s -- %s" % (uas.lower(), UAS.lower()) print u"upper: %s -- %s" % (uas.upper(), UAS.upper()) What you are asking is pretty hopeless. With two modules loaded with differing encodings, whose idea of "how to uppercase an 8-bit character" should be used? What you might want to use is: def codedupper(coding, string): return string.decode(coding).upper().encode(coding) def codedlower(coding, string): return string.decode(coding).lower().encode(coding) or: def latinupper(string): return string.decode('latin-1').upper().encode('latin-1') def latinlower(string): return string.decode('latin-1').lower().encode('latin-1') Any of these functions is well-defined even with several modules of differing encodings loaded. |
|||
msg22434 - (view) | Author: Tomasz Kowaltowski (kowaltowski) | Date: 2004-09-14 00:12 | |
Logged In: YES user_id=185428 I guess you are right from conceptual point of view. It is just somewhat frustrating because almost every language which uses the Latin alphabet needs characters above 128 (is English the only exception?). On the other hand 'lower' and 'upper' work for Unicode (really utf-8) representation in which many alphabets do not even have the concept of lower and upper cases! Your suggestion about 'latinlower' and 'latinupper' is basically what I asked for, but about 10 times slower than direct 'lower' and 'upper' :-(. Thanks anyway -- I guess the matter may be closed. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:07 | admin | set | github: 40901 |
2004-09-11 21:28:34 | kowaltowski | create |