This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: utf-16 codec problems with multiple file append
Type: Stage:
Components: Unicode Versions: Python 2.5
process
Status: closed Resolution: remind
Dependencies: Superseder:
Assigned To: lemburg Nosy List: doerwalter, iceberg4ever, lemburg
Priority: normal Keywords:

Created on 2007-04-16 10:05 by iceberg4ever, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
temp0.py iceberg4ever, 2007-04-16 10:05 the code to expose the bug
_codecs.py iceberg4ever, 2007-05-03 14:08 The wrapper for original codecs.py
Messages (9)
msg31804 - (view) Author: Iceberg Luo (iceberg4ever) Date: 2007-04-16 10:05
This bug is similar but not exactly the same as bug215974.  (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail)

In my test, even multiple write() within an open()~close() lifespan will not cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 215974 was somehow fixed during the past 7 years, although Lemburg classified it as WontFix. 

However, if a file is appended for more than once, by an "codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears.

At the same time, the saying of "(Extra unnecessary) BOM marks are removed from the input stream by the Python UTF-16 codec" in bug215974 is not true even in today, on Python2.4.4 and Python2.5.1c1 on Windows XP.

Iceberg
------------------

PS: Did not find the "File Upload" checkbox mentioned in this web page, so I think I'd better paste the code right here...

import codecs, os

filename = "test.utf-16"
if os.path.exists(filename): os.unlink(filename)  # reset

def myOpen():
  return codecs.open(filename, "a", 'UTF-16')
def readThemBack():
  return list( codecs.open(filename, "r", 'UTF-16') )
def clumsyPatch(raw): # you can read it after your first run of this program
  for line in raw:
    if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs
      yield line[1:]
    else:
      yield line

fout = myOpen()
fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here
fout.write(u"cd\n")
fout.close()
print readThemBack()
assert readThemBack() == [ u'ab\n', u'cd\n' ]
assert os.stat(filename).st_size == 14  # Only one BOM in the file

fout = myOpen()
fout.write(u"ef\n")
fout.write(u"gh\n")
fout.close()
print readThemBack()
#print list( clumsyPatch( readThemBack() ) )  # later you can enable this fix
assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here
assert os.stat(filename).st_size == 26  # not to mention here: multi BOM appears
msg31805 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2007-04-19 10:30
append mode is simply not supported for codecs. How would the codec find out the codec state that was active after the last characters where written to the file?
msg31806 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2007-04-19 10:35
I suggest you close this as wont fix.
msg31807 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2007-04-19 11:30
Closing as "won't fix"
msg31808 - (view) Author: Iceberg Luo (iceberg4ever) Date: 2007-04-20 03:39
If such a bug would be fixed, either StreamWriter or StreamReader should do something.

I can understand Doerwalter that it is somewhat not comfortable for a StreamWriter to detect whether these is already a BOM at current file header, especially when operating in append mode. But, IMHO, the StreamReader should be able to detect multi BOM during its life span and automatically ignore the non-first one, providing that a BOM is never supposed to occur in normal content.  Not to mention that such a Reader seems exist for a while, according to the "(extra unnecessary) BOM marks are removed
from the input stream by the Python UTF-16 codec" in bug215974 (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=
detail).

Therefore I don't think a WontFix will be the proper FINAL solution for this case.
msg31809 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2007-04-23 10:56
But BOMs *may* appear in normal content: Then their meaning is that of ZERO WIDTH NO-BREAK SPACE (see http://docs.python.org/lib/encodings-overview.html for more info).
msg31810 - (view) Author: Iceberg Luo (iceberg4ever) Date: 2007-05-03 14:08
The longtime arguable ZWNBSP is deprecated nowadays ( the http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD JOINER" instead of ZWNBSP ). However I can understand that "backwards compatibility" is always a good concern, and that's why SteamReader seems reluctant to change.

In practice, a ZWNBSP inside a file is rarely intended (please also refer to the topic "Q: What should I do with U+FEFF in the middle of a file?" in same URL above). IMHO, it is very likely caused by the multi-append file operation or etc. Well, at least, the unsymmetric "what you write is NOT what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')" and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough.

Aiming at the unsymmetry, finally I come up with a wrapper function for the codecs.open(), which solve (or you may say "bypass") the problem well in my case. I'll post the code as attachment.

BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in Codecs", mentions that the:
   PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char *errors, int *byteorder)
can "switches according to all byte order marks (BOM) it finds in the input data. BOMs are not copied into the resulting Unicode string".  I don't know whether it is the BOM-less decoder we talked for long time. //shrug

Hope the information above can be some kind of recipe for those who encounter same problem.  That's it. Thanks for your patience.

Best regards,
                            Iceberg
File Added: _codecs.py
msg31811 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2007-05-03 15:03
>BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in
> Codecs", mentions that the:
>   PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char
> *errors, int *byteorder)
> can "switches according to all byte order marks (BOM) it finds in the
> input data. BOMs are not copied into the resulting Unicode string".  I
> don't know whether it is the BOM-less decoder we talked for long time.

This seems to be wrong. Looking at the source code (Objects/unicodeobjects.c) reveals that only the first BOM is skipped.
msg31812 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2007-05-03 17:12
OK, I've updated the documentation (r55094, r55095)
History
Date User Action Args
2022-04-11 14:56:23adminsetgithub: 44853
2007-04-16 10:05:22iceberg4evercreate