This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: codecs.open(filename, 'U', 'UTF-16') corrupts text
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: flox Nosy List: aclover, christian.heimes, flox, jackjansen, jorend, lemburg
Priority: normal Keywords: patch

Created on 2003-02-22 19:21 by jorend, last changed 2022-04-10 16:07 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
UTest.py jorend, 2003-02-22 20:01 Unit test demonstrating bug with codecs.open(filename, 'rU', 'UTF-16')
issue691291_py3k.diff flox, 2009-12-01 08:01 Patch against branches/py3k r76622 (test only)
issue691291_v2.diff flox, 2009-12-30 09:46 Patch, apply to trunk
Messages (10)
msg53767 - (view) Author: Jason Orendorff (jorend) Date: 2003-02-22 19:21
Tested in Python 2.3a1.

If I write u'Hello\r\nworld\r\n' to a file, then read
it back in 'U' mode, I should get u'Hello\nworld\n'.

However, if I do this using codecs.open() and the
UTF-16 encoding, I get u'Hello\n\nworld\n\n'.

codecs.open() is not 'U'-mode-aware.  The underlying
file is opened in universal newline mode, so the byte
'\x0d' is erroneously translated to '\x0a' before the
UTF-16 codec has a chance to decode it.

The attached unit test should show specifically what it
is that I wish would work.
msg53768 - (view) Author: Jason Orendorff (jorend) Date: 2003-02-22 21:17
Logged In: YES 
user_id=18139

Tested in Python 2.3a2 as well (the bug is still there).

Note that this isn't limited to UTF-16.  It will affect any
encoding that uses the byte '\x0d' to mean anything other
than u'\r'.  The most common American/European encodings are
safe (ASCII, Latin-1 and friends, and UTF-8).
msg53769 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-02-26 13:44
Logged In: YES 
user_id=38388

I'm turning this into a feature request. codecs.open()
does not support 'U' as file mode.

Assigning to Jack since he introduced the 'U' mode option.
Jack, what can we do about this ?
msg53770 - (view) Author: Jack Jansen (jackjansen) * (Python committer) Date: 2003-03-03 12:10
Logged In: YES 
user_id=45365

The problem is that codecs.open() forces binary mode on the underlying file object, and this defeats the U mode.

My feeling is that it should be okay to open the underlying file in text mode, thereby enabling the U flag to be passed. Opening the file in text mode would break, however, if one of the following conditions is met:
- there are encodings where 0x0a or 0x0d are valid characters, not end of line.
- there are libc implementations where opening a file in text mode has
more implications than converting \r or \r\n to \n, i.e. if they change
other bytes as well.

Re-assigning to  MAL, as he put the binary mode in in the first place. If this was just defensive programming we  might try taking it out, if there was a real error case with text mode then codecs.open should probably at least signal an error if universal newline mode is requested.
msg53771 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-03-04 10:12
Logged In: YES 
user_id=38388

The proper thing to do would be to read the file content
as Unicode and then use the .splitlines() method on the
resulting data. The latter knows about the various ways
you can do line ending in Unicode, including the Mac, DOS
and Unix variations.

I don't have time for this, so unassigning it again.
msg59293 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-05 18:00
Checks this for 2.6
msg81182 - (view) Author: And Clover (aclover) * Date: 2009-02-05 01:42
> The problem is that codecs.open() forces binary mode on the underlying
file object, and this defeats the U mode.

Actually the problem is it doesn't defeat it!

The function is documented to force binary, but it actually only does
"mode = mode + 'b'", which can leave you with a mode of 'rUb'. This mode
should be invalid but in practice the 'U' wins out, and causes the
expected problems for UTF-16 and some East Asian codecs.

Until such time as text/universal mode is supported at the overlying
decoded stream level, I suggest that 'U' should be .replace()d out of
the mode as well as 'b' being added, as the documentation would imply.
msg95849 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-01 08:00
Proposed patch following suggestion of And Clover.

Compliant with documentation:
«Files are always opened in binary mode, even if no binary mode was
specified. This is done to avoid data loss due to encodings using 8-bit
values. This means that no automatic conversion of '\n' is done on
reading and writing.»
msg97023 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2009-12-30 09:46
slight update.
msg100146 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-02-26 10:43
Fixed on trunk with r78461. The test will be ported to py3k.
History
Date User Action Args
2022-04-10 16:07:02adminsetgithub: 38031
2010-02-27 11:41:33floxsetstatus: pending -> closed
2010-02-26 10:43:09floxsetstatus: open -> pending
messages: + msg100146

assignee: flox
resolution: accepted
stage: patch review -> resolved
2009-12-30 09:48:24floxsetfiles: - issue691291.diff
2009-12-30 09:46:21floxsetfiles: + issue691291_v2.diff
versions: + Python 2.7
messages: + msg97023

type: enhancement -> behavior
stage: patch review
2009-12-02 08:14:22floxsetfiles: - issue691291.diff
2009-12-02 08:14:04floxsetfiles: + issue691291.diff
2009-12-01 08:01:49floxsetfiles: + issue691291_py3k.diff
2009-12-01 08:00:30floxsetfiles: + issue691291.diff

nosy: + flox
messages: + msg95849

keywords: + patch
2009-02-05 01:42:20acloversetnosy: + aclover
messages: + msg81182
2008-01-05 18:00:24christian.heimessetnosy: + christian.heimes
messages: + msg59293
components: + Library (Lib), - None
versions: + Python 2.6
2003-02-22 19:21:01jorendcreate