Issue 1452697: broken string on mbcs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43048

classification

Title:	broken string on mbcs
Type:		Stage:
Components:	Unicode	Versions:	Python 2.4

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	lemburg, ocean-city
Priority:	high	Keywords:

Created on 2006-03-17 20:07 by ocean-city, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
a.py	ocean-city, 2006-03-17 20:08	script to reproduce the problem
a.txt	ocean-city, 2006-03-17 20:08	input text file
b.txt	ocean-city, 2006-03-17 20:10	and broken result (see No.7's message)
mbcs.patch	ocean-city, 2006-03-18 04:17	Probably this patch will fix the problem
mbcs_2.patch	ocean-city, 2006-03-19 02:08	Probably this patch will fix the problem (version2)

Messages (4)
msg27818 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-17 20:07
Hello. I noticed unicode conversion from mbcs was sometimes broken. This happened when I used codecs.open("foo", "r", "mbcs") as iterator. # It's OK if I use "shift_jis" or "cp932". I'll attach the script and text file to reproduce the problem. I'm using Win2000SP4(Japanese). Thank you.
msg27819 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-18 04:17
Logged In: YES user_id=1200846 Probably this patch will fix the problem. (for release24-maint) Cause: MultiByteToWideChar returns non zero value for incomplete multibyte character. (ex: if buffer terminates with leading byte, MultiByteToWideChar returns 1 (not 0) for it. It should return 0, otherwise result will be broken. Solution: Set flag MB_ERR_INVALID_CHARS to avoid incorrect handling of trailing incomplete multibyte part. If error occurs, removes the trailing byte and tries again. Caution: I have not tested this so intensibly.
msg27820 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-19 02:08
Logged In: YES user_id=1200846 I updated the patch. Compared to version1... * [bug] consumed should be 0 if the length of string is 0 * [enhancement] use IsDBCSLeadByte to detect incomplete buffer termination instead of trying MultiByteToWideChar with MB_ERR_INVALID_CHARS. This could cause performance hit if string contains invalid chars in early part.
msg27821 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-03-22 07:14
Logged In: YES user_id=1200846 I'll move this to "Patches" tracker.

History
Date	User	Action	Args
2022-04-11 14:56:16	admin	set	github: 43048
2006-03-17 20:07:12	ocean-city	create