Issue 1701389: utf-16 codec problems with multiple file append

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44853

classification

Title:	utf-16 codec problems with multiple file append
Type:		Stage:
Components:	Unicode	Versions:	Python 2.5

process

Status:	closed	Resolution:	remind
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	doerwalter, iceberg4ever, lemburg
Priority:	normal	Keywords:

Created on 2007-04-16 10:05 by iceberg4ever, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
temp0.py	iceberg4ever, 2007-04-16 10:05	the code to expose the bug
_codecs.py	iceberg4ever, 2007-05-03 14:08	The wrapper for original codecs.py

Messages (9)
msg31804 - (view)	Author: Iceberg Luo (iceberg4ever)	Date: 2007-04-16 10:05
This bug is similar but not exactly the same as bug215974. (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail) In my test, even multiple write() within an open()~close() lifespan will not cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 215974 was somehow fixed during the past 7 years, although Lemburg classified it as WontFix. However, if a file is appended for more than once, by an "codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears. At the same time, the saying of "(Extra unnecessary) BOM marks are removed from the input stream by the Python UTF-16 codec" in bug215974 is not true even in today, on Python2.4.4 and Python2.5.1c1 on Windows XP. Iceberg ------------------ PS: Did not find the "File Upload" checkbox mentioned in this web page, so I think I'd better paste the code right here... import codecs, os filename = "test.utf-16" if os.path.exists(filename): os.unlink(filename) # reset def myOpen(): return codecs.open(filename, "a", 'UTF-16') def readThemBack(): return list( codecs.open(filename, "r", 'UTF-16') ) def clumsyPatch(raw): # you can read it after your first run of this program for line in raw: if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs yield line[1:] else: yield line fout = myOpen() fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here fout.write(u"cd\n") fout.close() print readThemBack() assert readThemBack() == [ u'ab\n', u'cd\n' ] assert os.stat(filename).st_size == 14 # Only one BOM in the file fout = myOpen() fout.write(u"ef\n") fout.write(u"gh\n") fout.close() print readThemBack() #print list( clumsyPatch( readThemBack() ) ) # later you can enable this fix assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here assert os.stat(filename).st_size == 26 # not to mention here: multi BOM appears
msg31805 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2007-04-19 10:30
append mode is simply not supported for codecs. How would the codec find out the codec state that was active after the last characters where written to the file?
msg31806 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2007-04-19 10:35
I suggest you close this as wont fix.
msg31807 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2007-04-19 11:30
Closing as "won't fix"
msg31808 - (view)	Author: Iceberg Luo (iceberg4ever)	Date: 2007-04-20 03:39
If such a bug would be fixed, either StreamWriter or StreamReader should do something. I can understand Doerwalter that it is somewhat not comfortable for a StreamWriter to detect whether these is already a BOM at current file header, especially when operating in append mode. But, IMHO, the StreamReader should be able to detect multi BOM during its life span and automatically ignore the non-first one, providing that a BOM is never supposed to occur in normal content. Not to mention that such a Reader seems exist for a while, according to the "(extra unnecessary) BOM marks are removed from the input stream by the Python UTF-16 codec" in bug215974 (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func= detail). Therefore I don't think a WontFix will be the proper FINAL solution for this case.
msg31809 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2007-04-23 10:56
But BOMs may appear in normal content: Then their meaning is that of ZERO WIDTH NO-BREAK SPACE (see http://docs.python.org/lib/encodings-overview.html for more info).
msg31810 - (view)	Author: Iceberg Luo (iceberg4ever)	Date: 2007-05-03 14:08
The longtime arguable ZWNBSP is deprecated nowadays ( the http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD JOINER" instead of ZWNBSP ). However I can understand that "backwards compatibility" is always a good concern, and that's why SteamReader seems reluctant to change. In practice, a ZWNBSP inside a file is rarely intended (please also refer to the topic "Q: What should I do with U+FEFF in the middle of a file?" in same URL above). IMHO, it is very likely caused by the multi-append file operation or etc. Well, at least, the unsymmetric "what you write is NOT what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')" and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough. Aiming at the unsymmetry, finally I come up with a wrapper function for the codecs.open(), which solve (or you may say "bypass") the problem well in my case. I'll post the code as attachment. BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in Codecs", mentions that the: PyObject* PyUnicode_DecodeUTF16( const char s, int size, const char errors, int *byteorder) can "switches according to all byte order marks (BOM) it finds in the input data. BOMs are not copied into the resulting Unicode string". I don't know whether it is the BOM-less decoder we talked for long time. //shrug Hope the information above can be some kind of recipe for those who encounter same problem. That's it. Thanks for your patience. Best regards, Iceberg File Added: _codecs.py
msg31811 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2007-05-03 15:03
>BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in > Codecs", mentions that the: > PyObject* PyUnicode_DecodeUTF16( const char s, int size, const char > errors, int *byteorder) > can "switches according to all byte order marks (BOM) it finds in the > input data. BOMs are not copied into the resulting Unicode string". I > don't know whether it is the BOM-less decoder we talked for long time. This seems to be wrong. Looking at the source code (Objects/unicodeobjects.c) reveals that only the first BOM is skipped.
msg31812 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2007-05-03 17:12
OK, I've updated the documentation (r55094, r55095)

History
Date	User	Action	Args
2022-04-11 14:56:23	admin	set	github: 44853
2007-04-16 10:05:22	iceberg4ever	create