Issue989185
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2004-07-12 03:59 by donut, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
east_asian_width.diff | hyeshik.chang, 2004-07-14 15:15 | a patch that moves unicode.width to unicodedata.east_asian_width | ||
east_asian_width2.diff.gz | hyeshik.chang, 2004-08-01 16:01 | revision 2 |
Messages (20) | |||
---|---|---|---|
msg21489 - (view) | Author: Matthew Mueller (donut) | Date: 2004-07-12 03:59 | |
Python 2.4a1+ (#38, Jul 11 2004, 20:36:10) [GCC 3.3.4 (Debian 1:3.3.4-3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> u'\u3060'.width() 2 >>> u'\u305f\u3099'.width() 4 Width should be two in both cases. |
|||
msg21490 - (view) | Author: Hyeshik Chang (hyeshik.chang) * | Date: 2004-07-12 04:46 | |
Logged In: YES user_id=55188 This sounds that we need to normalize to NFC before evaluations for unicode.width(). So, I think we'll need to choose how to use normalization database from width() method. 1. export normalization CAPI functions from unicodedata module like ucnhash_CAPI and unicodeobject uses it when width() is first called. 2. move unicode.width() to unicodedata module and use normalization functions statically. I would prefer 2. ;) |
|||
msg21491 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-07-12 09:45 | |
Logged In: YES user_id=38388 To be honest: I don't really know how .width() ended up as method. The use context seems to be rather limited in that it only applies to East Asian code points according to Unicode Standard Annex #11. I'd suggest to move the whole implementation to unicodedata instead (and then apply normalization before looking up the width). Reading the UAX11 (http://www.unicode.org/reports/tr11/) I also have a feeling that taking the sum of all widths in a string of Unicode code points is not a very useful approach. Since the width is mainly used for rendering East Asian text, only the per code point information is useful. I think that it would be more appropriate to raise an exception if you pass in more than one code point to the function. |
|||
msg21492 - (view) | Author: Matthew Mueller (donut) | Date: 2004-07-12 13:06 | |
Logged In: YES user_id=65253 I don't think normalization is sufficient. For example, consider: >>> u'\u01b5\u0327\u0308'.width() 3 >>> unicodedata.normalize('NFC',u'\u01b5\u0327\u0308').width() 3 But width should be one. |
|||
msg21493 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-07-12 14:00 | |
Logged In: YES user_id=38388 It would help if you would include the Unicode code point descriptions... 01B5;LATIN CAPITAL LETTER Z WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER Z BAR;;;01B6; 0327;COMBINING CEDILLA;Mn;202;NSM;;;;;N;NON-SPACING CEDILLA;;;; 0308;COMBINING DIAERESIS;Mn;230;NSM;;;;;N;NON-SPACING DIAERESIS;Dialytika;;; Ie. your example does not even include East Asian characters. If you read the TR11, you'll find that: """ ED7. Not East Asian (Neutral) - all other characters. Neutral characters do not occur in legacy East Asian character sets. By extension, they also do not occur in East Asian typography. For example, there is no traditional Japanese way of typesetting Devanagari. Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but since for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na) under the recommendations below. """ Combining marks as the ones that your example uses cannot be processed by doing a simple database lookup. The two marks you include are marked as A -- Ambiguous. Furthermore, you should not mistake the East Asian Width for the display width. It is merely a hint for rendering engines. See the TR for details. Hye-Shik, could you give an example of where the EAS is actually useful in Python programming ? I hvae a feeling that it is going to cause more confusion than do good. It may also be wise to rename the function to east_asian_width() to signal that the return value does not have anything to do with a display with, glyphs, etc. |
|||
msg21494 - (view) | Author: Matthew Mueller (donut) | Date: 2004-07-12 14:28 | |
Logged In: YES user_id=65253 TR11 says "Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but since for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na) under the recommendations below." In addition, the current implementation gives a width of 1 to not east asian characters. So talking about fixing the effect of combining characters on non-east asian charecters is IMHO, just as applicable as combining characters on asian text. And for display width, I'd say it is useful when writing to a terminal. But not it its current form. Combining characters obviously have no width, whether they are "wide"(which just means they are normally combined with wide characters) or not. |
|||
msg21495 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-07-12 15:23 | |
Logged In: YES user_id=38388 I don't understand your complaint: width 1 means "narrow" just as defined in the TR ?! How do you write Asian characters to a terminal ? I think you are mixing glyphs with code points here. |
|||
msg21496 - (view) | Author: Hyeshik Chang (hyeshik.chang) * | Date: 2004-07-12 15:53 | |
Logged In: YES user_id=55188 Major usages that I expected for width() are: - Hints for terminal-based applications (for cursor position and layouts) - To generate fixed-width text documents not ugly: eg. printing "------" decoration under each subjects. - More readable limit for table columns: eg. topics in web bulletins; limiting by same 'characters' will recur very wide some topic lines full of East Asian characters and narrow snipped english topics. - To locate "^" in correct position on Python tracebacks. This isn't implemented in standard traceback, but width() allows 3rd party can implement sys.excepthook for East Asian easily. In fact, I don't known if width() can easily modified to support variety of combining characters from Western characters. But if it isn't too heavy or complicated, I would volunteer to extend the width() implemention to make it provide generic fixed-width rendering hint. |
|||
msg21497 - (view) | Author: Matthew Mueller (donut) | Date: 2004-07-12 16:03 | |
Logged In: YES user_id=65253 My complaint was that you were attacking my example using non-asian characters, when the TR specifically says they are handled as narrow. I write Asian characters the same as anything else. If it's a unicode string python converts it with sys.stdout.encoding (for print anyway). Otherwise you just have to write in whatever encoding the terminal expects. And when you are talking about a fixed-width text terminal, wide characters take 2 columns, narrow take 1. Assuming you ignore combining characters, which is what this is all about. You said the width is only a "hint for rendering engines", but I cannot think of any rendering engine that would benefit from a hint that can be 2-3x wider (due to counting combining characters) than when you actually display it. |
|||
msg21498 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-07-12 19:41 | |
Logged In: YES user_id=38388 Thanks for your descriptions, Hye-Shik. Since the application space is very much targetted at East Asian scripts, I would like the implementation to be moved into unicodedata where all the other special Unicode features are implemented. The .width() method should be removed. Now that I understand better what the EAW is about, I would also like to see the function be renamed to east_asian_width() since that's what the function is based on. If possible, I'd also rather like to see the full width mapping implemented (as defined in the TR). The reduction to narrow vs. wide seems to be oversimplified. The east_asian_width() function should return the characters: "N", "A", "H", "W", "F", "Na" and let the user decide how to map these to character or string widths. We have followed the same methodology for the other Unicode database properties and this has not only given us much more flexibility, it also is standards compliant and you can get good documentation on these features. Matthew, I suggest you write your own implementation of what you think is right. In the face of ambiguity, there's no such thing as the right approach to a certain problem. Thanks. |
|||
msg21499 - (view) | Author: Hyeshik Chang (hyeshik.chang) * | Date: 2004-07-14 15:15 | |
Logged In: YES user_id=55188 Marc-Andre, here's a patch written as you've suggested. Can you please give a review on this? |
|||
msg21500 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-07-15 09:39 | |
Logged In: YES user_id=38388 Hye-Shik, the patch only includes the move to unicodedata, but not the full implementation of the EAW as per the TR. I would much prefer to have the east_asian_width() function return the strings defined in the TR because this allows users of the function to read the information and implement their own interpretation of "width". The new function should work very much like unicodedata.category(). It would also be wise to move the data itself over to the unicode database - that way the extra data does not affect Python programs that don't use the function. Thanks. |
|||
msg21501 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2004-07-15 17:24 | |
Logged In: YES user_id=21627 I still think a function is useful which computes the number of Ems that a conventional application would expect. That function should raise an exception for a neutral character - the example of the combining characters shows that such characters should *not* be treated as narrow "for all practical purposes". Whether or not it is useful to include the entire UAX#11 classification in the database I don't know - it seems the only application of the data would be computation of the width, anyway. It would not be wise to move the data to the unicode database, as the extra data currently don't affect Python programs that don't use the function, anyway - the data does not consume any additional space. |
|||
msg21502 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-07-15 17:36 | |
Logged In: YES user_id=38388 Martin, you can code such a function in your application based on the information you'd get from unicodedata.east_asian_width(). As we've seen, there is no generally sound way to define such a function. As for the location of the data: the unicodedata module is the place where any extra information related to Unicode should go. unicodectype.c is reserved for data needed at C level by the Python Unicode C API. |
|||
msg21503 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-07-23 10:12 | |
Logged In: YES user_id=38388 Hye-Shik, are you working on a patch to move the implementation as suggested ? Just asking, because I don't was the current layout to go into the Python 2.4 final... |
|||
msg21504 - (view) | Author: Hyeshik Chang (hyeshik.chang) * | Date: 2004-07-23 12:29 | |
Logged In: YES user_id=55188 Yes. I am. I was little bit busy. I'll submit new patch after this weekend. |
|||
msg21505 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-07-23 12:31 | |
Logged In: YES user_id=38388 Great ! Thanks. |
|||
msg21506 - (view) | Author: Hyeshik Chang (hyeshik.chang) * | Date: 2004-08-01 16:01 | |
Logged In: YES user_id=55188 Here's my revised patch for new API. :) |
|||
msg21507 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2004-08-03 14:39 | |
Logged In: YES user_id=38388 Looks good. Please check it in. Thanks, Hye-Shik ! |
|||
msg21508 - (view) | Author: Hyeshik Chang (hyeshik.chang) * | Date: 2004-08-04 07:44 | |
Logged In: YES user_id=55188 Committed in CVS. Doc/api/concrete.tex 1.54 Doc/lib/libstdtypes.tex 1.159 Doc/lib/libunicodedata.tex 1.6 Include/unicodeobject.h 2.45 Lib/test/string_tests.py 1.39 Lib/test/test_unicode.py 1.92 Lib/test/test_unicodedata.py 1.11 Lib/test/test_userstring.py 1.13 Misc/NEWS 1.1068 Modules/unicodedata.c 2.32 Modules/unicodedata_db.h 1.11 Objects/unicodectype.c 2.16 Objects/unicodeobject.c 2.219 Objects/unicodetype_db.h 1.9 Tools/unicode/makeunicodedata.py 1.19 Thanks! |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:05 | admin | set | github: 40543 |
2004-07-12 03:59:43 | donut | create |