This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: broken string on mbcs
Type: Stage:
Components: Unicode Versions: Python 2.4
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: lemburg Nosy List: lemburg, ocean-city
Priority: high Keywords:

Created on 2006-03-17 20:07 by ocean-city, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
a.py ocean-city, 2006-03-17 20:08 script to reproduce the problem
a.txt ocean-city, 2006-03-17 20:08 input text file
b.txt ocean-city, 2006-03-17 20:10 and broken result (see No.7's message)
mbcs.patch ocean-city, 2006-03-18 04:17 Probably this patch will fix the problem
mbcs_2.patch ocean-city, 2006-03-19 02:08 Probably this patch will fix the problem (version2)
Messages (4)
msg27818 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2006-03-17 20:07
Hello. I noticed unicode conversion from mbcs was
sometimes broken. This happened when I used
codecs.open("foo", "r", "mbcs") as iterator.

# It's OK if I use "shift_jis" or "cp932".

I'll attach the script and text file to reproduce the
problem. I'm using Win2000SP4(Japanese).

Thank you.
msg27819 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2006-03-18 04:17
Logged In: YES 
user_id=1200846

Probably this patch will fix the problem. (for release24-maint)

Cause: MultiByteToWideChar returns non zero value for
incomplete multibyte character. (ex: if buffer terminates
with leading byte, MultiByteToWideChar returns 1 (not 0) for
it. It should return 0, otherwise result will be broken.

Solution: Set flag MB_ERR_INVALID_CHARS to avoid incorrect
handling of trailing incomplete multibyte part. If error
occurs, removes the trailing byte and tries again.

Caution: I have not tested this so intensibly.
msg27820 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2006-03-19 02:08
Logged In: YES 
user_id=1200846

I updated the patch. Compared to version1...

  * [bug] consumed should be 0 if the length of string is 0

  * [enhancement] use IsDBCSLeadByte to detect incomplete
    buffer termination instead of trying MultiByteToWideChar
    with MB_ERR_INVALID_CHARS. This could cause performance
    hit if string contains invalid chars in early part.
msg27821 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2006-03-22 07:14
Logged In: YES 
user_id=1200846

I'll move this to "Patches" tracker.
History
Date User Action Args
2022-04-11 14:56:16adminsetgithub: 43048
2006-03-17 20:07:12ocean-citycreate