Issue 1455898: patch for mbcs codecs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43070

classification

Title:	patch for mbcs codecs
Type:		Stage:
Components:	Windows	Versions:	Python 2.5

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:		Nosy List:	doerwalter, jwnmulder, lemburg, loewis, ocean-city
Priority:	high	Keywords:	patch

Created on 2006-03-22 07:31 by ocean-city, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
a.zip	ocean-city, 2006-03-22 07:31	the files to reproduce the problem
b.txt	ocean-city, 2006-03-22 07:32	broken result from a.zip
test_mbcs.py	ocean-city, 2006-05-25 11:07	test (place this in lib/test)
mbcs.patch	ocean-city, 2006-05-26 21:43	patch (support 64bit ssize_tenvironment)
fix.zip	ocean-city, 2006-07-25 16:17	addtional patch + test
fix.patch	ocean-city, 2006-07-25 18:46	addtional patch (version2)

Messages (29)
msg49812 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-22 07:31
Hello. I have noticed mbcs codecs sometimes generates broken string. I'm using Windows(Japanese) so mbcs is mapped to cp932 (close to shift_jis) When I run the attached script "a.zip", the entry "Error 00007"'s message becomes broken like attached file "b.txt". I think this happens because the string passed to PyUnicode_DecodeMBCS() sometimes terminates with leading byte, and MultiByteToWideChar() counts it for size of result string.buffer size. I hope attached patch "mbcs.patch" may fix the problem. It would be nice if this bug will be fixed in 2.4.3... Thank you.
msg49813 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-22 07:42
Logged In: YES user_id=1200846 I forgot to mention this. "mbcs.patch" is for release24-maint branch.
msg49814 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-03-22 09:11
Logged In: YES user_id=38388 As I understand your comment, the mbcs codec will have a problem if the input string terminates with a lead byte. Could you add a comment regarding this to the patch ?! I can't test the patch, since I don't have a Japanese Windows to check on, but from looking at the patch, it seems OK.
msg49815 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-03-22 09:12
Logged In: YES user_id=38388 One more nit: the doc patch is missing. Please add a patch for the API docs.
msg49816 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-22 10:13
Logged In: YES user_id=1200846 Thank you for reply. How about this? (I'm a newbie, I hope this is right tex format but... can you confirm this? I created this patch by copy & paste from PyUnicode_DecodeUTF16Stateful and some modification)
msg49817 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-22 10:31
Logged In: YES user_id=1200846 Sorry, I found problem when tried more long text file... Please wait. I'll investigate more intensibly.
msg49818 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-22 14:36
Logged In: YES user_id=1200846 Sorry, I was stupid. MSDN (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_0o2t.asp) saids, > IsDBCSLeadByte can only indicate a potential lead byte value. IsDBCSLeadByte was returning 1 for some trail byte (ex: "歴"[1]) The patch "mbcs3a.patch" worked for me, but _mbsbtype is probably compiler specific. Is that OK? The patch "mbcs3b.patch" also worked for me and it only uses Win32API, but I don't have enough faith on this implementation...
msg49819 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-23 02:51
Logged In: YES user_id=1200846 Hello. This is my final patch. (mbcs4.patch) - mbcs3a.patch: _mbsbtype depends on locale not system ANSI code page. so probably it's not good to use it with MultiByteToWideChar. - mbcs3b.patch: CharNext may cause buffer overflow. and this patch always calls CharPrev but it's not needed if string is not terminated with "potensial" lead byte. I hope this is stable enough to commit on repositry. Thank you.
msg49820 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2006-03-23 21:44
Logged In: YES user_id=89016 This isn't a bugfix in the strictest sense, so IMHO this patch shouldn't go into 2.4. If the patch goes into 2.5, it would need the appropriate changes to encodings/mbcs.py (i.e. the IncrementalDecoder would have to be changed (inheriting from BufferedIncrementalDecoder). I realize that this patch might be hard to test, as results are dependent on locale. Nevertheless at least some tests would be good (even if they are only run or do something useful on a certain locale and are skipped otherwise). ocean-city, can you update the patch for the trunk and add tests?
msg49821 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-24 14:02
Logged In: YES user_id=1200846 OK, I'll try.
msg49822 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-03-26 21:05
Logged In: YES user_id=21627 I have reservations against this patch because of the quasi-anonymous nature of the submission. ocean-city, can you please state your real name? Would you also be willing to fill out a contributor form, as shown on http://www.python.org/psf/contrib-form.html
msg49823 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-27 05:44
Logged In: YES user_id=1200846 My real name is Hirokazu Yamamoto. But sorry, I don't have FAX. (It's needed to send contributor form, isn't it?) I'll attach the patch updated for trunk. And I'll attach the tests.
msg49824 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-27 07:41
Logged In: YES user_id=1200846 I replaced tests. Probably this is better instead of comparing the two string generated by same decoder.
msg49825 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2006-03-27 16:06
Logged In: YES user_id=89016 _buffer_decode() in the IncrementalDecoder ignores the final argument. IncrementalDecoder._buffer_decode() should pass on its final argument to _codecsmodules.c::mbcs_decode(), which should be extended to accept the final argument. Also PyUnicode_DecodeMBCSStateful() must handle consumed == NULL correctly (with your patch it drops trailing lead bytes even if consumed == NULL)
msg49826 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-28 08:16
Logged In: YES user_id=1200846 You are right. I've updated the patch. (mbcs5.patch) >>> import codecs [20198 refs] >>> d = codecs.getincrementaldecoder("mbcs")() [20198 refs] >>> d.decode('\x82\xa0\x82') u'\u3042' [20198 refs] >>> d.decode('') u'' [20198 refs] >>> d.decode('', final=True) u'\x00' [20198 refs]
msg49827 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-04-07 09:10
Logged In: YES user_id=1200846 I have sent contributor form via postal mail. Probably you can get it after 10 days.
msg49828 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2006-04-25 17:22
Logged In: YES user_id=89016 I think the default value for final in mbcs_decode() should be true, so that the stateless decoder detects incomplete byte sequences too.
msg49829 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-05-02 11:40
Logged In: YES user_id=1200846 I updated the patch. (I copy and pasted "int final = 0" from above code (ex: utf_16_ex_decode), maybe they also should be changed for consistency?) And one more thing, I noticed "errors" is ignored now. We can detect invalid character if we set MB_ERR_INVALID_CHARS flag when calling MultiByteToWideChar, but we cannot tell where is the position of invalid character, and MSDN saids this flag is available Win2000SP4 or later (I don't know why) http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp So I didn't make the patch for it.
msg49830 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-05-24 20:18
Logged In: YES user_id=1200846 I updated the patch. - PyUnicode_DecodeMBCS now supports size >= INT_MAX. (I don't have machine to test such big string, but I have tested this routine replaced INT_MAX with 2 and 3) PyUnicode_DecodeMBCS does not support size >= INT_MAX yet, but probably I'll fix it too. This patch includes Patch#1494487.
msg49831 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-05-25 11:06
Logged In: YES user_id=1200846 >PyUnicode_DecodeMBCS does not support size >= INT_MAX yet, >but probably I'll fix it too. Done. Attached as "mbcs_win64_support.patch". Now, total summary... - MBCS decoder and encoder now supports 64bit Py_ssize_t environment. (I don't have such machine, but I checked routine by defining NEED_RETRY and redefining INT_MAX as 2, 3, 4) - Fixed a bug of MBCS incremental decoder which was originaly reported by me.
msg49832 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2006-05-26 15:43
Logged In: YES user_id=89016 The change to PyUnicode_Resize() should be reverted (or done in a way that doesn't lead to bugs). Unfortunately I don't have a Windows where I can test the patch, so I'm unassigning the bug. You should probably find someone on python-dev with a multibyte version of Windows to look at the patch.
msg49833 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-05-26 21:18
Logged In: YES user_id=1200846 >The change to PyUnicode_Resize() should be reverted (or done >in a way that doesn't lead to bugs). Sorry, how about this? Index: Objects/unicodeobject.c =================================================================== --- Objects/unicodeobject.c (revision 46417) +++ Objects/unicodeobject.c (working copy) @@ -326,7 +326,7 @@ return -1; } v = (PyUnicodeObject )unicode; - if (v == NULL \|\| !PyUnicode_Check(v) \|\| v->ob_refcnt != 1 \|\| length < 0) { + if (v == NULL \|\| !PyUnicode_Check(v) \|\| length < 0) { PyErr_BadInternalCall(); return -1; } @@ -335,7 +335,7 @@ possible since these are being shared. We simply return a fresh copy with the same Unicode content. / if (v->length != length && - (v == unicode_empty \|\| v->length == 1)) { + (v == unicode_empty \|\| v->length == 1 \|\| v->ob_refcnt != 1)) { PyUnicodeObject w = _PyUnicode_New(length); if (w == NULL) return -1;
msg49834 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-05-26 21:40
Logged In: YES user_id=1200846 I reverted PyUnicode_Resize() patch for now, and recreated the patch as "mbcs.patch". >You should probably find someone on python-dev with a >multibyte version of Windows to look at the patch. OK, I will.
msg49835 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-06-14 05:21
Logged In: YES user_id=21627 Thanks for the patch. Committed as r46945.
msg49836 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-07-25 16:17
Logged In: YES user_id=1200846 Sorry, I reopened this issue because I found problem. With attached "mbcs.py" and "mbcs.txt", result file "hoge.txt" gets corrupted. This happens because Codec.decode is called incrementally, while default value for final in mbcs_decode() is True. >I think the default value for final in mbcs_decode() should >be true, so that the stateless decoder detects incomplete >byte sequences too. Probably this is not true. I think "stateless" means codec doesn't have internal state for incremental usage, doesn't mean it is not called incrementally. # I hope attached "fix.patch" fixes the problem.
msg49837 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-07-25 18:46
Logged In: YES user_id=1200846 >>I think the default value for final in mbcs_decode() should >>be true, so that the stateless decoder detects incomplete >>byte sequences too. > >Probably this is not true. Sorry, I lied. Stateless decoder was exactly the thing you said. >>> import codecs >>> d = codecs.getdecoder("cp932") >>> d('\x82') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'cp932' codec can't decode byte 0x82 in position 0: incomplete multibyte sequence Problem was on StreamReader. StreamReader should treat "final" as false, but I didn't change this code, class StreamReader(Codec,codecs.StreamReader): pass so StreamReader wrongly treated "final" as True. I cloned routine from Lib/encoding/utf-8.py. I hope finally this meets requirement of codec system...
msg49838 - (view)	Author: Jan-Willem (jwnmulder)	Date: 2006-08-01 21:30
Logged In: YES user_id=770969 could be related to bug report 1532726 ?
msg49839 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-08-02 13:01
Logged In: YES user_id=21627 jwnmulder: there is definitely no relationship.
msg49840 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-08-02 13:54
Logged In: YES user_id=21627 Thanks for the update. Committed as r51046. Please create new a new patch the next time you need to further changes.

History
Date	User	Action	Args
2022-04-11 14:56:16	admin	set	github: 43070
2006-03-22 07:31:46	ocean-city	create