It seems like there are some errors while reading a
text file encoded with ISO-2022-JP-2 using the codecs
module. In all my test cases, all latin1 characters
with an accent (e.g. e acute) do not appear in the
output string. However, if I convert the file manually
using iconv, I get everything right. Here is a simple
script that will illustrate the problem:
###########################################
import codecs
import pygtk
import gtk
f = codecs.open( "test.iso-2022-jp-2" , "r" , \
"iso-2022-jp-2" )
s1 = f.readline().strip()
f.close()
f = open( "test.utf-8" , "r" )
s2 = f.readline().strip()
pack = gtk.VBox()
pack.pack_start( gtk.Label( s1 ) )
pack.pack_start( gtk.Label( s2 ) )
window = gtk.Window( gtk.WINDOW_TOPLEVEL )
window.add( pack )
window.show_all()
def event_destroy( widget , event , data ) :
gtk.main_quit()
return 0
window.connect( "delete_event" , \
lambda w,e,d: False , None )
window.connect( "destroy" , event_destroy , None )
gtk.main()
###########################################
I put the file "test.iso-2022-jp-2" in attachment. To
create the UTF-8 version of the file, I used the
following shell command:
iconv -f ISO-2022-JP-2 -t UTF-8 \
test.iso-2022-jp-2 > test.utf-8
When running this script, I would actually expect a
window with two times the same label. However, the
first one is missing the e acute.
--
Francois
|