This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: incorrect behaviour of PyUnicode_EncodeMBCS?
Type: Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: jwnmulder, nnorwitz, ocean-city
Priority: normal Keywords:

Created on 2006-08-01 21:20 by jwnmulder, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (7)
msg29415 - (view) Author: Jan-Willem (jwnmulder) Date: 2006-08-01 21:20
Using python 2.4.3
This behaviour is not reproducable on a window or 
linux machine. I found the bug when trying to find a 
problem on python 2.4.3 ported to the xbox.

running the next two commands

  test_string = 'encode me'
  print repr(test_string.encode('mbcs'))

results on windows in : 'encode me'
and on the xbox : 'encode me\\x00'

The problem is that 'PyUnicode_EncodeMBCS' returns an 
PyStringObject that contains the data 'encode me' but 
with an object size of 10.
string_repr(test_string) assumes the string contains 
a 0 character and encodes it as '\\x00'

looking at the function 'PyUnicode_EncodeMBCS(const 
Py_UNICODE *p, int size, const char *errors)' there 
are basicly two functions

{
  mbcssize = WideCharToMultiByte(CP_ACP, 0, p, size, 
NULL, 0, NULL, NULL);
  repr = PyString_FromStringAndSize(NULL, mbcssize);
}

WideCharToMultiByte returns the nummer of bytes 
needed for the buffer, because of the string 
termination this functions returns 10.
PyString_FromStringAndSize assumes its second 
argument to be the number of needed characters, not 
bytes. So an easy fix would be
to change
  repr = PyString_FromStringAndSize(NULL, mbcssize);
in
  repr = PyString_FromStringAndSize(NULL, mbcssize - 
1);

Just checked the 2.4.3 svn trunk and it contains the 
same bug.
msg29416 - (view) Author: Jan-Willem (jwnmulder) Date: 2006-08-01 21:30
Logged In: YES 
user_id=770969

related to patch 1455898 ?
msg29417 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2006-08-02 05:31
Logged In: YES 
user_id=1200846

I think this is not related to that patch.

On my win2000sp4, teminating null character is not passed to
PyUnicode_EncodeMBCS.

//////////////////////////////////////////////
// patch for debug (release24-maint branch)

Index: Objects/unicodeobject.c
===================================================================
--- Objects/unicodeobject.c	(revision 51033)
+++ Objects/unicodeobject.c	(working copy)
@@ -2782,6 +2782,20 @@
     char *s;
     DWORD mbcssize;
 
+{ /* debug */
+
+    int i;
+
+    printf("------------> %d\n", size);
+
+    for (i = 0; i < size; ++i) {
+	printf("%d ", (int)p[i]);
+    }
+
+    printf("\n");
+
+} /* debug */
+
     /* If there are no characters, bail now! */
     if (size==0)
 	    return PyString_FromString("");

//////////////////////////////////
// a.py

test_string = 'encode me'
print repr(test_string.encode('mbcs'))

//////////////////////////////////
// result

R:\>py a.py
------------> 9
101 110 99 111 100 101 32 109 101
'encode me'
[7660 refs]


And I tried this.



#include <windows.h>
#include <stdio.h>
#include <stdlib.h>

void count(LPCWSTR w, int size)
{
    char *buf; int i;

    const int ret = ::WideCharToMultiByte(
        CP_ACP,
        0,
        w,
        size,
        NULL,
        0,
        NULL,
        NULL
    );

    if (ret == 0)
    {
        printf("error\n");
    }
    else
    {
        printf("%d\n", ret);
    }

    buf = (char*)malloc(ret);

    ::WideCharToMultiByte(
        CP_ACP,
        0,
        w,
        size,
        buf,
        ret,
        NULL,
        NULL
    );

    for (i = 0; i < ret; ++i)
    {
        printf("%d ", (int)buf[i]);
    }

    printf("\n");

    free(buf);
}

int main()
{
    count(L"encode me", 9);
    count(L"encode me", 10); /* include null charater */
}

/*
9
101 110 99 111 100 101 32 109 101
10
101 110 99 111 100 101 32 109 101 0
*/


As stated in
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp
, WideCharToMultiByte never output null character if source
string doesn't contain null character. So I think usage of
WideCharToMultiByte is correct.

I don't know why, but probably some behavior difference
should exist between win2000 and xbox. (ie: xbox calls
PyUnicode_EncodeMBCS with size 10 ... or WideCharToMultiByte
on xbox outputs null character even if source string doesn't
contain it?)

Can you try above C code and debug patch on xbox?
msg29418 - (view) Author: Jan-Willem (jwnmulder) Date: 2006-08-02 17:44
Logged In: YES 
user_id=770969

and the result for the xbox
10
101 110 99 111 100 101 32 109 101 0 
11
101 110 99 111 100 101 32 109 101 0 0 

It seems the xbox calculates an extra character for a '\0'

count(L"encode me", -1);
results on both platforms in ret = 10

So I think this bug can be closed and clasified as an xbox 
bug... Not so hard for us to fix, almost all api calls for 
dlls are emulated in our application, so it is easy enough 
to put a fix in for us.
msg29419 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-08-03 04:09
Logged In: YES 
user_id=33168

Thanks for letting us know.  Closing as requested.
msg29420 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2006-08-03 07:55
Logged In: YES 
user_id=1200846

Sorry, if you don't mind, can you try another program?

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>

void count(LPCWSTR w, int size)
{
    char *buf;

    int i, ret;

    ret = WideCharToMultiByte(
        CP_ACP,
        0,
        w,
        size,
        NULL,
        0,
        NULL,
        NULL
    );

    if (ret == 0)
    {
        printf("error\n");

        return;
    }

    printf("required = %d, ", ret);

    buf = (char*)malloc(ret);

    ret = WideCharToMultiByte(
        CP_ACP,
        0,
        w,
        size,
        buf,
        ret,
        NULL,
        NULL
    );

    printf("written = %d\n", ret);

    for (i = 0; i < ret; ++i)
    {
        printf("%d ", (int)buf[i]);
    }

    printf("\n");

    free(buf);
}

int main()
{
    count(L"encode me", 9);
    count(L"encode me", 10);
}

////////////////////////////
// Result on Win2000

R:\>a
required = 9, written = 9
101 110 99 111 100 101 32 109 101
required = 10, written = 10
101 110 99 111 100 101 32 109 101 0

On Windows, "required buffer size" equals to "written size"
and I thought this is always true. But I noticed that there
is not such statements in MSDN document.

Maybe on xbox, "required buffer size" is more than really
required size like this...

////////////////////////////
// Maybe on xbox....?

R:\>a
required = 10, written = 9
101 110 99 111 100 101 32 109 101
required = 11, written = 10
101 110 99 111 100 101 32 109 101 0

msg29421 - (view) Author: Jan-Willem (jwnmulder) Date: 2006-08-03 10:50
Logged In: YES 
user_id=770969

Sure, but it will have to wait a few days since its time 
for a holliday now.
History
Date User Action Args
2022-04-11 14:56:19adminsetgithub: 43757
2006-08-01 21:20:34jwnmuldercreate