Issue 1532726: incorrect behaviour of PyUnicode_EncodeMBCS?

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43757

classification

Title:	incorrect behaviour of PyUnicode_EncodeMBCS?
Type:		Stage:
Components:	Interpreter Core	Versions:

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	jwnmulder, nnorwitz, ocean-city
Priority:	normal	Keywords:

Created on 2006-08-01 21:20 by jwnmulder, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (7)
msg29415 - (view)	Author: Jan-Willem (jwnmulder)	Date: 2006-08-01 21:20
Using python 2.4.3 This behaviour is not reproducable on a window or linux machine. I found the bug when trying to find a problem on python 2.4.3 ported to the xbox. running the next two commands test_string = 'encode me' print repr(test_string.encode('mbcs')) results on windows in : 'encode me' and on the xbox : 'encode me\\x00' The problem is that 'PyUnicode_EncodeMBCS' returns an PyStringObject that contains the data 'encode me' but with an object size of 10. string_repr(test_string) assumes the string contains a 0 character and encodes it as '\\x00' looking at the function 'PyUnicode_EncodeMBCS(const Py_UNICODE p, int size, const char errors)' there are basicly two functions { mbcssize = WideCharToMultiByte(CP_ACP, 0, p, size, NULL, 0, NULL, NULL); repr = PyString_FromStringAndSize(NULL, mbcssize); } WideCharToMultiByte returns the nummer of bytes needed for the buffer, because of the string termination this functions returns 10. PyString_FromStringAndSize assumes its second argument to be the number of needed characters, not bytes. So an easy fix would be to change repr = PyString_FromStringAndSize(NULL, mbcssize); in repr = PyString_FromStringAndSize(NULL, mbcssize - 1); Just checked the 2.4.3 svn trunk and it contains the same bug.
msg29416 - (view)	Author: Jan-Willem (jwnmulder)	Date: 2006-08-01 21:30
Logged In: YES user_id=770969 related to patch 1455898 ?
msg29417 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-08-02 05:31
Logged In: YES user_id=1200846 I think this is not related to that patch. On my win2000sp4, teminating null character is not passed to PyUnicode_EncodeMBCS. ////////////////////////////////////////////// // patch for debug (release24-maint branch) Index: Objects/unicodeobject.c =================================================================== --- Objects/unicodeobject.c (revision 51033) +++ Objects/unicodeobject.c (working copy) @@ -2782,6 +2782,20 @@ char s; DWORD mbcssize; +{ / debug / + + int i; + + printf("------------> %d\n", size); + + for (i = 0; i < size; ++i) { + printf("%d ", (int)p[i]); + } + + printf("\n"); + +} / debug / + / If there are no characters, bail now! / if (size==0) return PyString_FromString(""); ////////////////////////////////// // a.py test_string = 'encode me' print repr(test_string.encode('mbcs')) ////////////////////////////////// // result R:\>py a.py ------------> 9 101 110 99 111 100 101 32 109 101 'encode me' [7660 refs] And I tried this. #include <windows.h> #include <stdio.h> #include <stdlib.h> void count(LPCWSTR w, int size) { char buf; int i; const int ret = ::WideCharToMultiByte( CP_ACP, 0, w, size, NULL, 0, NULL, NULL ); if (ret == 0) { printf("error\n"); } else { printf("%d\n", ret); } buf = (char)malloc(ret); ::WideCharToMultiByte( CP_ACP, 0, w, size, buf, ret, NULL, NULL ); for (i = 0; i < ret; ++i) { printf("%d ", (int)buf[i]); } printf("\n"); free(buf); } int main() { count(L"encode me", 9); count(L"encode me", 10); / include null charater / } / 9 101 110 99 111 100 101 32 109 101 10 101 110 99 111 100 101 32 109 101 0 */ As stated in http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp , WideCharToMultiByte never output null character if source string doesn't contain null character. So I think usage of WideCharToMultiByte is correct. I don't know why, but probably some behavior difference should exist between win2000 and xbox. (ie: xbox calls PyUnicode_EncodeMBCS with size 10 ... or WideCharToMultiByte on xbox outputs null character even if source string doesn't contain it?) Can you try above C code and debug patch on xbox?
msg29418 - (view)	Author: Jan-Willem (jwnmulder)	Date: 2006-08-02 17:44
Logged In: YES user_id=770969 and the result for the xbox 10 101 110 99 111 100 101 32 109 101 0 11 101 110 99 111 100 101 32 109 101 0 0 It seems the xbox calculates an extra character for a '\0' count(L"encode me", -1); results on both platforms in ret = 10 So I think this bug can be closed and clasified as an xbox bug... Not so hard for us to fix, almost all api calls for dlls are emulated in our application, so it is easy enough to put a fix in for us.
msg29419 - (view)	Author: Neal Norwitz (nnorwitz) *	Date: 2006-08-03 04:09
Logged In: YES user_id=33168 Thanks for letting us know. Closing as requested.
msg29420 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2006-08-03 07:55
Logged In: YES user_id=1200846 Sorry, if you don't mind, can you try another program? #include <windows.h> #include <stdio.h> #include <stdlib.h> void count(LPCWSTR w, int size) { char buf; int i, ret; ret = WideCharToMultiByte( CP_ACP, 0, w, size, NULL, 0, NULL, NULL ); if (ret == 0) { printf("error\n"); return; } printf("required = %d, ", ret); buf = (char)malloc(ret); ret = WideCharToMultiByte( CP_ACP, 0, w, size, buf, ret, NULL, NULL ); printf("written = %d\n", ret); for (i = 0; i < ret; ++i) { printf("%d ", (int)buf[i]); } printf("\n"); free(buf); } int main() { count(L"encode me", 9); count(L"encode me", 10); } //////////////////////////// // Result on Win2000 R:\>a required = 9, written = 9 101 110 99 111 100 101 32 109 101 required = 10, written = 10 101 110 99 111 100 101 32 109 101 0 On Windows, "required buffer size" equals to "written size" and I thought this is always true. But I noticed that there is not such statements in MSDN document. Maybe on xbox, "required buffer size" is more than really required size like this... //////////////////////////// // Maybe on xbox....? R:\>a required = 10, written = 9 101 110 99 111 100 101 32 109 101 required = 11, written = 10 101 110 99 111 100 101 32 109 101 0
msg29421 - (view)	Author: Jan-Willem (jwnmulder)	Date: 2006-08-03 10:50
Logged In: YES user_id=770969 Sure, but it will have to wait a few days since its time for a holliday now.

History
Date	User	Action	Args
2022-04-11 14:56:19	admin	set	github: 43757
2006-08-01 21:20:34	jwnmulder	create