Issue 1450212: int() and isdigit() accept non-digit unicode numbers

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43036

classification

Title:	int() and isdigit() accept non-digit unicode numbers
Type:		Stage:
Components:	Interpreter Core	Versions:	Python 2.4

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	hyeshik.chang, lemburg, peufeu
Priority:	normal	Keywords:

Created on 2006-03-15 09:05 by peufeu, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg27780 - (view)	Author: Peufeu (peufeu)	Date: 2006-03-15 09:05
I had a very surprising bug this morning, in a python script which extract numeric information from human entered text. The problem is the following : many UNICODE characters, in UNICODE strings, are considered to be digits. For instance, the character "²" (does it appear on your screen ? it's u'\xb2'). The output of the following command is pretty interesting : print ''.join([x for x in map( unichr, xrange( 65536 )) if x.isdigit()]) Then, int() will happily parse the string : int( u"٥٦٧٨٩۰۱۲" ) 56789012 (I really hope this bug system supports unicode). However, I can't do a=٥٦٧٨٩۰۱۲ for instance. Philosophically, Python is right, these characters are probably all digits, and it's pretty cool to be able to parse numbers written in ARABIC-INDIC DIGITs or something, as unicodedata.name says). However, from a practical point of view, I guess most parsing done with python isn't on OCR'd cuneiform stone tablets, but rather modern computer documents... Whenever a surface (in m²) was near a phone number in my human entered text, the "²" would be absorbed as a part of the phone number, because u"²".isdigit() is True. Then bullshit phone numbers would appear on the website. Any number followed by a little footnote number will get the footnote number embedded... I had to replace all the .isdigit() with a re.compile( ur"^\d+$" ). match(). Interestingly, for re, even in unicode, \d is 0-9 and nothing else. At least, it would be normal for int() to raise an exception when fed this type of data. Please.
msg27781 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-03-15 10:42
Logged In: YES user_id=38388 Python is following the Unicode standard in this respect. If you want to make sure that only a subset of numbers is parsed, I'd suggest that you write a little helper function that implements the RE check and then lets int() do its work. Rejecting as "invalid".
msg27782 - (view)	Author: Hyeshik Chang (hyeshik.chang) *	Date: 2006-03-15 12:18
Logged In: YES user_id=55188 In the mean time, it can be simply regarded as unicode conforming. But a minor issue came up to my mind: I think the name, `isdigit', is quite similar to ISO C's equivalent. But they don't behave same; ISO C and POSIX SUSv3 specifies isdigit() is true only for 0 1 2 3 4 5 6 7 8 9. So, isdigit() of C doesn't return true for any of unicode characters > ord('9'). I just fear that the inconsistency might cause some confusion.
msg27783 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-03-15 12:32
Logged In: YES user_id=38388 I can see your point, but if we were to follow that scheme, we'd have to introduce a whole new set of APIs for Unicode character testing. Note that the comparison to C standards is flawed in this respect: Unicode APIs would have to be compared to the wide character APIs, e.g. iswdigit() which do behave (more or less) like isdigit() does in Python for Unicode characters. Furthermore, the isXYZ() and iswXYZ() APIs in C are locale aware (and so are the Python functions for strings), whereas the Python Unicode implementation deliberately is not. So in summary, you can't really compare the C functions to the Python functions.
msg27784 - (view)	Author: Peufeu (peufeu)	Date: 2006-03-15 13:05
Logged In: YES user_id=587274 It certainly is confusing, and it bit me ;) That .isdigit() is unicode-conformant is understandable (but a hint should be added to the docs IMHO). I with there was a .isasciidigit() function on the unicode string, because using a helper is ugly. However int() accepting all these characters and happily parsing them worries me a bit more. Is it really supposed to do this ?

History
Date	User	Action	Args
2022-04-11 14:56:15	admin	set	github: 43036
2006-03-15 09:05:09	peufeu	create