This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Support Unicode normalization
Type: Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: loewis Nosy List: lemburg, loewis
Priority: normal Keywords: patch

Created on 2002-10-21 19:02 by loewis, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
normal.txt loewis, 2002-10-22 06:09
normal.txt loewis, 2002-11-23 15:19
Messages (7)
msg41413 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-10-21 19:02
This patch adds support for the normalization forms
NFC, NFKC, NFD, NFKD. It passes the
NormalizationTest-3.2.0.txt tests.
msg41414 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2002-10-23 10:27
Logged In: YES 
user_id=38388

The patch looks Ok except for a few nits:

* I'd rather like a single API normalize(form) which takes
  the form as string argument instead of NFKD, etc.

* __getrecord should be renamed to _getrecord_ex;
  perhaps both should use a different name altogether,
  e.g. getunicoderecord 

* I think you have to add some #ifdef Py_UNICODE_WIDE
  in the code to avoid compiler warnings for narrow builds
  about non-const if expressions being always true due to 
  size limits.

* The filenames you are using should not include the '-Latest'
  suffix. If you download the files from unicode.org via FTP
  they don't have this extension.

* The skip test message should include a reference of where to
  get the test file from, ie.
ftp://ftp.unicode.org/Public/UNIDATA/NormalizationTest.txt

Thanks for working on this !

msg41415 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2002-10-23 10:36
Logged In: YES 
user_id=38388

One more minor nit: the indentation in the C file is 4
chars, please reindent your code accordingly
msg41416 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-10-25 15:03
Logged In: YES 
user_id=21627

This patches addresses your issues in the following way:

- single API: done.

- add _getrecord_ex: done. Rename to getunicoderecord:
  since this is a static function in unicodedata.c, this
renaming 
  would not add that much information, so not done.

- #ifdef Py_UNICODE_WIDE. I could not spot any place where
this is necessary.

- Drop -Latest: done.

- adjust skip message: done.

- reformat to 4 spaces: not done, I think PEP 7 should be
followed.
msg41417 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-11-23 15:19
Logged In: YES 
user_id=21627

This version changes the indentation to 4 spaces. Are any
further changes needed?
msg41418 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2002-11-23 21:50
Logged In: YES 
user_id=38388

Looks good (I don't have time to review the patch
in detail, though). Please check it in.

Thanks.
msg41419 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-11-23 22:08
Logged In: YES 
user_id=21627

Thanks! Committed as

libunicodedata.tex 1.4
test_normalization.py 1.1
NEWS 1.541
unicodedata.c 2.24
unicodedata_db.h 1.7
makeunicodedata.py 1.15
History
Date User Action Args
2022-04-10 16:05:46adminsetgithub: 37352
2002-10-21 19:02:28loewiscreate