This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Decoding with unicode_internal segfaults on UCS-4 builds
Type: Stage:
Components: Unicode Versions: Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: doerwalter Nosy List: doerwalter, lemburg, nhaldimann
Priority: normal Keywords:

Created on 2005-08-03 19:49 by nhaldimann, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_internal.diff nhaldimann, 2005-08-05 14:50 Patch
unicode_internal.diff nhaldimann, 2005-08-05 21:08 Improved Patch
Messages (11)
msg25964 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-03 19:49
On UCS-4 builds, decoding a byte string with the
unicode_internal codec doesn't correctly work for code
points from 0x80000000 upwards and even segfaults. I
have observed the same behaviour on 2.5 from CVS and
2.4.0 on OS X/PowerPC as well as on 2.3.5 on Linux/x86.
Here's an example:

Python 2.5a0 (#1, Aug  3 2005, 21:34:05) 
[GCC 3.3 20030304 (Apple Computer, Inc. build 1671)] on
darwin
Type "help", "copyright", "credits" or "license" for
more information.
>>> "\x7f\xff\xff\xff".decode("unicode_internal")
u'\U7fffffff'
>>> "\x80\x00\x00\x00".decode("unicode_internal")
u'\x00'
>>> "\x80\x00\x00\x01".decode("unicode_internal")
u'\x01'
>>> "\x81\x00\x00\x00".decode("unicode_internal")
Segmentation fault

On little endian architectures the byte strings must be
reversed for the same effect.

I'm not sure if I understand what's going on, but I
guess there are 2 solution strategies:

1. Make unicode_internal work for any code point up to
0xFFFFFFFF.

2. Make unicode_internal raise a UnicodeDecodeError for
anything above 0x10FFFF (== sys.maxunicode for UCS-4
builds).

It seems like there are no unicode code points above
0x10FFFF, so the latter solution feels more correct to
me, even though it might break backwards compatibility
a tiny bit. The unicodeescape codec already does a
similar thing:

>>> u"\U00110000"
UnicodeDecodeError: 'unicodeescape' codec can't decode
bytes in position 0-9: illegal Unicode character
msg25965 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-08-04 14:41
Logged In: YES 
user_id=38388

I think solution 2 is the right approach, since UCS-4 only
has 0x10FFFF possible code points.

Could you provide a patch ?
msg25966 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-05 14:50
Logged In: YES 
user_id=1317086

OK, I put something together. Please review carefully as I'm
not very familiar with the C API. I have tested this with
the CVS HEAD on OS X and Linux.
msg25967 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-08-05 16:03
Logged In: YES 
user_id=89016

Your patch doesn't support PEP 293 error handlers. Could you
add support for that?
msg25968 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-05 16:35
Logged In: YES 
user_id=1317086

Ah, that PEP clears some things up for me. I will look into
it, but I hope you realize this requires tinkering with
unicodeobject.c since the error handler code seems to live
there.
msg25969 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-05 21:08
Logged In: YES 
user_id=1317086

Here's the patch with error handler support + test. Again:
Please review carefully.
msg25970 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-08-18 20:17
Logged In: YES 
user_id=89016

The patch has a problem with input strings of a length that
is not a multiple of 4, e.g.
"\x00".decode("unicode-internal") returns u"" instead of
raising an error. Also in a UCS-2 build most of the tests
are irrelevant (as it's not possible to create codepoints
above 0x10ffff even when using surrogates), so probably they
should be ifdef'd out.
msg25971 - (view) Author: Nik Haldimann (nhaldimann) Date: 2005-08-19 14:17
Logged In: YES 
user_id=1317086

I agree about the ifdefs. I'm not sure about how to handle
input strings of incorrect length. I guess raising an
UnicodeDecodeError is in order. But I think it doesn't make
sense to let it pass through the error handler, since the
data the handler would see is potentially nonsensical (e.g.,
the code point value). Can you comment on this? Is it ok to
raise a UnicodeDecodeError and skip the error handler here?
msg25972 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-08-19 15:39
Logged In: YES 
user_id=89016

The data the handler sees is nonsensical by definition. ;)
To get an idea how to handle an incorrect length, take a
look at Objects/unicodeobject.c::PyUnicode_DecodeUTF16Stateful()
msg25973 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-08-19 15:45
Logged In: YES 
user_id=38388

Assigning to Walter, the error handler expert :-)
msg25974 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-08-30 10:47
Logged In: YES 
user_id=89016

I've checked in a version that detects truncated data as:

Include/unicodeobject.h 2.49
Lib/test/test_codeccallbacks.py 1.18
Lib/test/test_codecs.py 1.26
Misc/NEWS 1.1358
Modules/_codecsmodule.c 2.22
Objects/unicodeobject.c 2.231

and

Include/unicodeobject.h 2.48.2.1
Lib/test/test_codeccallbacks.py 1.16.4.2
Lib/test/test_codecs.py 1.15.2.8
Misc/NEWS 1.1193.2.92
Modules/_codecsmodule.c 2.20.2.2
Objects/unicodeobject.c 2.230.2.1

Thanks for the patch!
History
Date User Action Args
2022-04-11 14:56:12adminsetgithub: 42248
2005-08-03 19:49:18nhaldimanncreate