This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Python may contain NFC/NFKC bug per Unicode PRI #29
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: loewis Nosy List: ajaksu2, christian.heimes, ezio.melotti, lemburg, loewis, rick_mcgowan, vstinner
Priority: normal Keywords: patch

Created on 2004-10-26 23:58 by rick_mcgowan, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_pr29.patch vstinner, 2009-05-04 10:42
Messages (11)
msg22884 - (view) Author: Rick McGowan (rick_mcgowan) Date: 2004-10-26 23:58
The Unicode Technical Committee posted Public Review
Issue #29, describing a bug in the documentation of NFC
and NFKC in the text of UAX #15 Unicode Normalization
Forms. I have examined unicodedata.c in the Python
implementation (2.3.4) and it appears the
implementation of normalization in Python 2.3.4 may
have the bug therein described. Please see the
description of the bug and the textual fix that is
being made to UAX #15, at the URL:
http://www.unicode.org/review/pr-29.html
The bug is in the definition of rule D2, affecting the
characters "blocked" during re-composition.

You may contact me by e-mail, or fill out the
Unicode.org error reporting form if you have any
questions or concerns.

Since Python uses Unicode internally, it may also be
wise to have someone from the Python development
community on the Unicode Consortium's notification list
to receive immediate notifications of public review
issues, bugs, and other announcements affecting
implementation of the standard.
msg22885 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-10-27 18:11
Logged In: YES 
user_id=38388

Thanks for submitting a bug report. The problem does indeed
occur in the Python normalization code:

>>> unicodedata.normalize('NFC', u'\u0B47\u0300\u0B3E')
u'\u0b4b\u0300'

I think the following line in unicodedata.c needs to be changed:

          if (comb1 && comb == comb1) {
              /* Character is blocked. */
              i1++;
              continue;
          }

to

          if (comb && (comb1 == 0 || comb == comb1)) {
              /* Character is blocked. */
              i1++;
              continue;
          }

Martin, what do you think ?
msg22886 - (view) Author: Rick McGowan (rick_mcgowan) Date: 2004-10-27 20:11
Logged In: YES 
user_id=1146994

Thanks all for quick reply. My initial thoughts regarding a
fix were as below. The relevant piece of code seems to be in
function "nfc_nfkc()" in the file unicodedata.c

>           if (comb1 && comb == comb1) { 
>               /* Character is blocked. */ 
>               i1++; 
>               continue; 
>           } 

That should possibly be changed to: 

>           if (comb1 && (comb <= comb1)) { 
>               /* Character is blocked. */ 
>               i1++; 
>               continue; 
>           } 

because the new spec says "either B is a starter or it has
the same or higher combining class as C".
msg22887 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-03-15 08:59
Logged In: YES 
user_id=21627

Is it true that the most recent interpretation of this PR
suggests that the correction should only apply to Unicode
4.1? If so, I think Python should abstain from adopting the
change right now, and should defer that to the point when
the Unicode 4.1 database is incorporated.
msg22888 - (view) Author: Rick McGowan (rick_mcgowan) Date: 2005-03-15 16:45
Logged In: YES 
user_id=1146994

Yes. The "current" version of UAX #15 is an annex to Unicode
4.1, which will be coming out very soon. No previous
versions of Unicode have been changed. Previous versions of
UAX #15 apply to previous versions of the standard. The UTC
plans to issue a "corrigendum" for this problem, and the
corrigendum is something that *can* be applied to
implementations of earlier versions of Unicode. In that
case, one would cite the implementation of "Unicode Version
X with Corrigendum Y" as shown on the "Enumerated Versions"
page of the Unicode web site. To follow corrigenda, you may
want to keep tabs on the "Updates and Errata" page on the
Unicode web site. This is likely to be Corrigendum #5. You
could fix the bug when you update Python to Unicode 4.1, or
fix it when the corrigendum comes out. Of course, I would
recommend fixing bugs sooner rather than later, but your
release plans may be such that one path is easier. If it's
going to be a long time before you update to 4.1, you may
want to fix the bug and cite the corrigendum when it comes
out. If you plan to update to 4.1 soon after it comes out,
perhaps fixing the bug with that update is fine.
msg22889 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-03-10 12:00
Logged In: YES 
user_id=21627

When this is fixed, the @Part3 data in the normalization
tests need to be considered as well.
msg59202 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-04 01:21
Python 2.6 and probably also 2.5 contains still the line if (comb1 &&
comb == comb1) {...}
msg86583 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-04-26 01:05
The code is the same as described by MAL and we're now on Unicode DB  5.1.
msg87111 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-05-04 10:42
Here is a patch fixing Unicode issue "PR29", I used the testcases 
given in http://www.unicode.org/review/pr-29.html
msg100382 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-04 12:16
Commited: r78646 (trunk), r78647 (py3k), r78648 (3.1).

Leave the issue open to remember me that I have to backport to 2.6 (after the 2.6.5 release).
msg101424 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-03-21 13:42
> Commited: r78646 (trunk)

Backport done: r79201 (2.6).
History
Date User Action Args
2022-04-11 14:56:07adminsetgithub: 41086
2010-03-21 13:42:03vstinnersetstatus: open -> closed
resolution: remind -> fixed
messages: + msg101424
2010-03-06 16:09:20loewissetstatus: pending -> open
2010-03-06 15:30:21ezio.melottisetstatus: open -> pending
versions: + Python 2.7, Python 3.2
nosy: + ezio.melotti

resolution: remind
stage: test needed -> resolved
2010-03-04 12:16:53vstinnersetmessages: + msg100382
2009-05-04 10:42:34vstinnersetfiles: + unicode_pr29.patch
keywords: + patch
messages: + msg87111
2009-04-26 01:05:37ajaksu2setversions: + Python 3.1, - Python 2.5
nosy: + ajaksu2, vstinner

messages: + msg86583

type: behavior
stage: test needed
2008-01-04 01:21:07christian.heimessetnosy: + christian.heimes
messages: + msg59202
versions: + Python 2.6, Python 2.5, - Python 2.3
2004-10-26 23:58:06rick_mcgowancreate